Parallel Architectures
-
Upload
barry-witt -
Category
Documents
-
view
36 -
download
0
description
Transcript of Parallel Architectures
![Page 2: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/2.jpg)
Why Multicores?The SPECint performance of the hottest chip grew by 52% per year from 1986 to 2002, and then grew only 20% in the next three years (about 6% per year).
Diminishing returns from uniprocessor designs
[from Patterson & Hennessy]
![Page 3: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/3.jpg)
Power Wall
• The design goal for the late 1990’s and early 2000’s was to drive the clock rate up. • This was done by adding more transistors to a smaller chip.
• Unfortunately, this increased the power dissipation of the CPU chip beyond the capacity of inexpensive cooling techniques
[from Patterson & Hennessy]
![Page 4: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/4.jpg)
Roadmap for CPU Clock Speed: Circa 2005
Here is the result of the best thought in 2005. By 2015, the clock speedof the top “hot chip” would be in the 12 – 15 GHz range.
[from Patterson & Hennessy]
![Page 5: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/5.jpg)
The CPU Clock Speed Roadmap (A Few Revisions Later)
This reflects the practical experience gained with dense chips that were literally“hot”; they radiated considerable thermal power and were difficult to cool.Law of Physics: All electrical power consumed is eventually radiated as heat.
[from Patterson & Hennessy]
![Page 6: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/6.jpg)
The MultiCore Approach
Multiple cores on the same chip– Simpler– Slower– Less power demanding
![Page 7: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/7.jpg)
The Memory Gap
• Bottom-line: memory access is increasingly expensive and computer architect must devise new ways of hiding this cost
1
10
100
1000
10000
100000
1980
1985
1990
1995
2000
2005
Per
form
ance
Memory CPU
[from Patterson & Hennessy]
![Page 8: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/8.jpg)
04/19/2023 Spring 2011 -- Lecture #15 8
Transition to Multicore
Sequential App Performance
![Page 9: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/9.jpg)
Parallel Architectures
• Definition: “A parallel architecture is a collection of processing elements that cooperate and communicate to solve large problems fast”
• Questions about parallel architectures:– How many are the processing elements?– How powerful are processing elements?– How do they cooperate and communicate?– How are data transmitted? – What type of interconnection?– What are HW and SW primitives for programmer?– Does it translate into performance?
![Page 10: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/10.jpg)
Flynn Taxonomy of parallel computersData streams
Single Parallel
InstructionStreams
Single SISD SIMD
Multiple MISD MIMD
M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.
• Flynn's Taxonomy provides a simple, but very broad, classification of computer architectures:
• Single Instruction, Single Data (SISD)• A single processor with a single instruction stream, operating sequentially on a single data
stream.• Single Instruction, Multiple Data (SIMD)
• A single instruction stream is broadcast to every processor, all processors execute the same instructions in lock-step on their own local data stream.
• Multiple Instruction, Multiple Data (MIMD)• Each processor can independently execute its own instruction stream on its own local data
stream.• SISD machines are the traditional single-processor, sequential computers - also known as Von Neumann
architecture, as opposed to “non-Von" parallel computers.• SIMD machines are synchronous, with more fine-grained parallelism - they run a large number parallel
processes, one for each data element in a parallel vector or array.• MIMD machines are asynchronous, with more coarse-grained parallelism - they run a smaller number of
parallel processes, one for each processor, operating on the large chunks of data local to each processor.
![Page 11: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/11.jpg)
Single Instruction/Single Data Stream:SISD
• Sequential computer • No parallelism in either the instruction or
data streams• Examples of SISD architecture are
traditional uniprocessor machines
Processing Unit
![Page 12: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/12.jpg)
Multiple Instruction/Single Data Stream:MISD
• Computer that exploits multiple instruction streams against a single data stream for data operations that can be naturally parallelized– For example, certain kinds of
array processors• No longer commonly encountered,
mainly of historical interest only
![Page 13: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/13.jpg)
Single Instruction/Multiple Data Stream:SIMD
• Computer that exploits multiple data streams against a single instruction stream to operations that may be naturally parallelized
– e.g., SIMD instruction extensions or Graphics Processing Unit (GPU)• Single control unit• Multiple datapaths (processing elements – PEs) running in parallel
– PEs are interconnected and exchange/share data as directed by the control unit– Each PE performs the same operation on its own local data
![Page 14: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/14.jpg)
Multiple Instruction/Multiple Data Streams:MIMD
• Multiple autonomous processors simultaneously executing different instructions on different data.
• MIMD architectures include multicore and Warehouse Scale Computers (datacenters)
![Page 15: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/15.jpg)
Parallel Computing Architectures Memory Model Pro-
cessorPro- cessor
...
Interconnection
Shared Memory
UMA (Uniform Memory Access)(SMP) symmetric multiprocessor
Pro- cessor
Pro- cessor
...
NUMA (Non-Uniform Memory Access)distributed-shared-memory
multiprocessor
Interconnection
LocalMemory
LocalMemory
Pro- cessor
Pro- cessor
...
Interconnection
LocalMemory
LocalMemory
MPP (Massively Parallel Processors)message-passing
(shared-nothing) multiprocessor
send receive
empty
Centrilized memory Physically distributed memory
Priv
ate
add
ress
spa
ces
Shar
ed a
ddre
ss s
pace
Parallel Architecture = Computer Architecture + Communication Architecture
Question: how do we organize and distribute memory in a multicore architecture?
2 classes of multiprocessors WRT memory:1. Centralized Memory Multiprocessor 2. Physically Distributed-Memory
multiprocessor
2 classes of multiprocessors WRT addressing:1. Shared2. Private
![Page 16: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/16.jpg)
Memory Performance Metrics• Latency is the overhead in setting up a connection between processors for
passing data. – This is the most crucial problem for all parallel architectures - obtaining good
performance over a range of applications depends critically on low latency for accessing remote data.
• Bandwidth is the amount of data per unit time that can be passed between processors.
– This needs to be large enough to support efficient passing of large amounts of data between processors, as well as collective communications, and I/O for large data sets.
• Scalability is how well latency and bandwidth scale with the addition of more processors.
– This is usually only a problem for architectures with manycores.
![Page 17: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/17.jpg)
Distributed Shared Memory Architecture:NUMA
• Data set is distributed among processors:
– each processor accesses only its own data from local memory
– if data from another section of memory (i.e. another processor) is required, it is obtained by a remote access.
• Much larger latency for accessing non-local data, but can scale to large numbers (thousands) of processors for many applications.
– Advantage: Scalability– Disadvantage: Locality Problems and Connection congestion
• Aggregated memory of the whole system appear as one single address space.
Communication Network
Host Processor
P 1 P 2 P 3
M 1 M 2 M 3
P Processor
M Local Memory
![Page 18: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/18.jpg)
Distributed Memory—Message Passing Architectures
• Each processor is connected to exclusive local memory
– i.e. no other CPU has direct access to it• Each node comprises at least one network
interface (NI) that mediates the connection to a communication network.
• On each CPU runs a serial process that can communicate with other processes on other CPUs by means of the network.
• Non-blocking vs. Blocking communication• MPI Problems:
– All data layout must be handled by software
– Message passing has high software overhead
19
P
Mem
NI
Interconnect Network
P
Mem
NI
P
Mem
NI
P
Mem
NI
P
Mem
NI
P
Mem
NI
P
Mem
NI
P
Mem
NI
![Page 19: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/19.jpg)
Shared Memory Architecture: UMA
Pro-cessor
Pro-cessor
Pro-cessor
Pro-cessor
PrimaryCache
SecondaryCache
Bus
SecondaryCache
SecondaryCache
SecondaryCache
PrimaryCache
PrimaryCache
PrimaryCache
Global Memory
• Each processor has access to all the memory, through a shared memory bus and/or communication network– Memory bandwidth and latency are the same for all processors and all memory
locations.• Lower latency for accessing non-local data, but difficult to scale to large numbers of
processors, usually used for small numbers (order 100 or less) of processors.
![Page 20: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/20.jpg)
Shared memory candidates
Pro-cessor
Pro-cessor
Pro-cessor
Pro-cessor
PrimaryCache
SecondaryCache
SecondaryCache
SecondaryCache
SecondaryCache
Global Memory
PrimaryCache
PrimaryCache
PrimaryCache
Pro-cessor
Pro-cessor
Pro-cessor
Pro-cessor
PrimaryCache
Secondary Cache
Global Memory
PrimaryCache
PrimaryCache
PrimaryCache
Shared-main memory
Pro-cessor
Pro-cessor
Pro-cessor
Pro-cessor
Secondary Cache
Global Memory
Primary Cache
Shared-primary cacheShared-secondary cache
• Caches are used to reduce latency and to lower bus traffic• Must provide hardware to ensure that caches and memory are consistent (cache coherency)• Must provide a hardware mechanism to support process synchronization
![Page 21: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/21.jpg)
Challenge of Parallel Processing
• Two biggest performance challenges in using multiprocessors
– Insufficient parallelism
• The problem of inadequate application parallelism must be attacked primarily in
software with new algorithms that can have better parallel performance.
– Long-latency remote communication
• Reducing the impact of long remote latency can be attacked both by the
architecture and by the programmer.
![Page 22: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/22.jpg)
Amdahl’s Law• Speedup due to enhancement E is
Speedup w/ E = ---------------------- Exec time w/o EExec time w/ E
• Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected
Execution Time w/ E =
Speedup w/ E =
Execution Time w/o E [ (1-F) + F/S]
1 / [ (1-F) + F/S ]
![Page 23: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/23.jpg)
Amdahl’s Law
Speedup =
Example: the execution time of half of the program can be accelerated by a factor of 2.What is the program speed-up overall?
![Page 24: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/24.jpg)
Amdahl’s Law
Speedup = 1
Example: the execution time of half of the program can be accelerated by a factor of 2.What is the program speed-up overall?
(1 - F) + FSNon-speed-up part Speed-up part
10.5 + 0.5
2
10.5 + 0.25
= = 1.33
![Page 25: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/25.jpg)
Amdahl’s LawIf the portion ofthe program thatcan be parallelizedis small, then thespeedup is limited
The non-parallelportion limitsthe performance
![Page 26: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/26.jpg)
04/19/2023 Spring 2011 -- Lecture #1 27
![Page 27: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/27.jpg)
Strong and Weak Scaling
• To get good speedup on a multiprocessor while keeping the problem size fixed is harder than getting good speedup by increasing the size of the problem.– Strong scaling: when speedup can be achieved on a
parallel processor without increasing the size of the problem
– Weak scaling: when speedup is achieved on a parallel processor by increasing the size of the problem proportionally to the increase in the number of processors
Needed to amortize sources of OVERHEAD (additional code, not present in the original sequential program, needed to execute the program in parallel)
![Page 28: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/28.jpg)
• Symmetric shared-memory machines usually support the caching of both shared and private data.
• Private data are used by a single processor, while shared data are used by multiple processors.
• When a private item is cached, its location is migrated to the cache, reducing the average access time as well as the memory bandwidth required. Since no other processor uses the data, the program behavior is identical to that in a uniprocessor.
• When shared data are cached, the shared value may be replicated in multiple caches. In addition, This replication also provides a reduction in contention that may exist for shared data items that are being read by multiple processors simultaneously.
Caching of shared data, however, introduces a new problem : cache coherence
Symmetric Shared-Memory Architectures
Pro-cessor
Pro-cessor
Pro-cessor
Pro-cessor
PrimaryCache
SecondaryCache
Bus
SecondaryCache
SecondaryCache
SecondaryCache
PrimaryCache
PrimaryCache
PrimaryCache
Global Memory
![Page 29: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/29.jpg)
Example Cache Coherence Problem
– Cores see different values for u after event 3– With write back caches, value written back to memory depends
on the order of which cache flushes or writes back value– Unacceptable for programming, and it is frequent!
I/O devices
Memory
P1
$ $ $
P2 P3
5
u = ?
4
u = ?
u :51
u :5
2
u :5
3
u = 7
![Page 30: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/30.jpg)
Keeping Multiple Caches Coherent• Architect’s job: shared memory => keep cache values coherent
• Idea: When any processor has cache miss or writes, notify other processors via interconnection network– If only reading, many processors can have copies– If a processor writes, invalidate all other copies
• Shared written result can “ping-pong” between caches
![Page 31: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/31.jpg)
32
Shared Memory Multiprocessor
Use snoopy mechanism to keep all processors’ view of memory coherent
M1
M2
M3
Snoopy Cache
DMA
Physical Memory
Memory Bus
Snoopy Cache
Snoopy Cache
DISKS
![Page 32: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/32.jpg)
33
Example: Write-thru Invalidate
• Must invalidate before step 3• Write update uses more broadcast medium BW
all recent SMP multicores use write invalidate
I/O devices
Memory
P1
$ $ $
P2 P3
5
u = ?
4
u = ?
u :51
u :5
2
u :5
3
u = 7
u = 7
![Page 33: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/33.jpg)
Need for a more scalable protocol
• Snoopy schemes do not scale because they rely on broadcast
• Hierarchical snoopy schemes have the root as a bottleneck
• Directory based schemes allow scaling– They avoid broadcasts by keeping track of all CPUs caching
a memory block, and then using point-to-point messages to maintain coherence
– They allow the flexibility to use any scalable point-to-point network
![Page 34: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/34.jpg)
35
Scalable Approach: Directories
• Every memory block has associated directory information– keeps track of copies of cached blocks and their states– on a miss, find directory entry, look it up, and
communicate only with the nodes that have copies if necessary
– in scalable networks, communication with directory and copies is through network transactions
• Many alternatives for organizing directory information
![Page 35: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/35.jpg)
36
Basic Operation of Directory
• k processors • With each cache-block in memory:
k presence-bits, 1 dirty-bit• With each cache-block in cache:
1 valid bit, and 1 dirty (owner) bit
• Read from main memory by processor i:• If dirty-bit OFF then { read from main memory; turn p[i] ON; }• if dirty-bit ON then { recall line from dirty proc (downgrade cache
state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;}
• Write to main memory by processor i:• If dirty-bit OFF then {send invalidations to all caches that have the
block; turn dirty-bit ON; supply data to i; turn p[i] ON; ... }
• ••
P P
Cache Cache
Memory Directory
presence bits dirty bit
Interconnection Network
![Page 36: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/36.jpg)
Real Manycore Architectures
• ARM Cortex A9• GPU• P2012
![Page 37: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/37.jpg)
ARM Cortex-A9 processors• 98% of mobile phones use at least on ARM processor• 90% of embedded 32-bit systems use ARM• The Cortex-A9 processors are the highest performance ARM processors
implementing the full richness of the widely supported ARMv7 architecture.
![Page 38: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/38.jpg)
Cortex-A9 CPU
• Superscalar out-of-order instruction execution
– Any of the four subsequent pipelines can select instructions from the issue queue
• Advanced processing of instruction fetch and branch prediction
• Up to four instruction cache line prefetch-pending
– Further reduces the impact of memory latency so as to maintain instruction delivery
• Between two and four instructions per cycle forwarded continuously into instruction decode
• Counters for performance monitoring
![Page 39: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/39.jpg)
The Cortex-A9 MPCore Multicore Processor
• Design-configurable Processor supporting between 1 and 4 CPU• Each processor may be independently configured for their cache sizes, FPU and NEON• Snoop Control Unit• Accelerator Coherence Port
![Page 40: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/40.jpg)
Snoop Control Unit and Accelerator Coherence Port
• The SCU is responsible for managing: – the interconnect, – arbitration, – communication, – cache-2-cache and system memory transfers, – cache coherence
• The Cortex-A9 MPCore processor also exposes these capabilities to other system accelerators and non-cached DMA driven mastering peripherals:
– To increase the performance– To reduce the system wide power consumption by sharing access to the processor’s cache hierarchy
• This system coherence also reduces the software complexity involved in otherwise maintaining software coherence within each OS driver.
![Page 41: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/41.jpg)
What is GPGPU?• The graphics processing unit (GPU) on commodity video cards has evolved into an
extremely flexible and powerful processor– Programmability– Precision– Power
• GPGPU: an emerging field seeking to harness GPUs for general-purpose computation other than 3D graphics
– GPU accelerates critical path of application• Data parallel algorithms leverage GPU attributes
– Large data arrays, streaming throughput– Fine-grain SIMD parallelism– Low-latency floating point (FP) computation
• Applications – see //GPGPU.org– Game effects (FX) physics, image processing– Physical modeling, computational engineering, matrix algebra, convolution, correlation,
sorting
![Page 42: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/42.jpg)
Motivation 1:• Computational Power
– GPUs are fast…– GPUs are getting faster, faster
![Page 43: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/43.jpg)
Motivation 2:
• Flexible, Precise and Cheap:– Modern GPUs are deeply programmable
• Solidifying high-level language support
– Modern GPUs support high precision• 32 bit floating point throughout the pipeline• High enough for many (not all) applications
![Page 44: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/44.jpg)
CPU style cores CPU-“style”
![Page 45: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/45.jpg)
Slimming down
![Page 46: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/46.jpg)
Two cores
![Page 47: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/47.jpg)
Four cores
![Page 48: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/48.jpg)
Sixteen cores
![Page 49: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/49.jpg)
Add ALUs
![Page 50: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/50.jpg)
128 elements in parallel
![Page 51: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/51.jpg)
But what about branches?
![Page 52: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/52.jpg)
But what about branches?
![Page 53: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/53.jpg)
But what about branches?
![Page 54: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/54.jpg)
But what about branches?
![Page 55: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/55.jpg)
Stalls!
• Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation.
• Memory access latency = 100’s to 1000’s of cycles• We’ve removed the fancy caches and logic that helps avoid
stalls.• But we have LOTS of independent work items.• Idea #3: Interleave processing of many elements on a single
core to avoid stalls caused by high latency operations.
![Page 56: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/56.jpg)
Hiding stalls
![Page 57: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/57.jpg)
Hiding stalls
![Page 58: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/58.jpg)
Hiding stalls
![Page 59: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/59.jpg)
Hiding stalls
![Page 60: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/60.jpg)
Hiding stalls
![Page 61: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/61.jpg)
Throughput!
![Page 62: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/62.jpg)
NVIDIA Tesla• Three key ideas
– Use many “slimmed down cores” to run in parallel– Pack cores full of ALUs (by sharing instruction stream across groups of work items)– Avoid latency stalls by interleaving execution of many groups of work-items/ threads/ ...
• When one group stalls, work on another group
![Page 63: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/63.jpg)
On-chip memory
• Each multiprocessor has on-chip memory of the four following types:
– One set of local 32-bit registers per processor,– A parallel shared memory that is shared by all scalar processor
cores and is where the shared memory space resides,– A read-only constant cache that is shared by all scalar processor
cores and speeds up reads from the constant memory space, which is a read-only region of device memory,
– A read-only texture cache that is shared by all scalar processor cores and speeds up reads from the texture memory space, which is a read-only region of device memory; each multiprocessor accesses the texture cache via a texture unit that implements the various addressing modes and data filtering.
• The local and global memory spaces are read-write regions of device memory and are not cached.
![Page 64: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/64.jpg)
Shared Memory
• Is on-chip:– much faster than the global memory
– divided into equally-sized memory banks
– as fast as a register when no bank conflicts
• Successive 32-bit words are assigned to successive banks
• Each bank has a bandwidth of 32 bits per clock cycle.
![Page 65: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/65.jpg)
Shared Memory Examples of Shared Memory Access Patterns
without Bank Conflicts
![Page 66: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/66.jpg)
Shared Memory Examples of Shared Memory Access Patterns
with Bank Conflicts
![Page 67: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/67.jpg)
Global Memory: Coalescing• The device is capable of reading 4-byte, 8-byte, or 16-byte words from global memory into registers in a single
instruction. • Global memory bandwidth is used most efficiently when the simultaneous memory accesses can be coalesced into
a single memory transaction of 32, 64, or 128 bytes.
![Page 68: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/68.jpg)
Coalescing Examples
![Page 69: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/69.jpg)
Coalescing Examples
![Page 70: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/70.jpg)
NVIDIA’s Fermi Generation CUDA Compute Architecture:
The key architectural highlights of Fermi are:• Third Generation Streaming Multiprocessor (SM)
– 32 CUDA cores per SM, 4x over GT200– 8x the peak double precision floating
point performance over GT200
• Second Generation ParallelThread Execution ISA
– Unified Address Space with Full C++ Support– Optimized for OpenCL and DirectCompute
• Improved Memory Subsystem– NVIDIA Parallel DataCache hierarchy
with Configurable L1 and Unified L2 Caches – improved atomic memory op performance
• NVIDIA GigaThreadTM Engine– 10x faster application context switching– Concurrent kernel execution– Out of Order thread block execution– Dual overlapped memory transfer engines
![Page 71: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/71.jpg)
Third Generation Streaming Multiprocessor
• 512 High Performance CUDA cores– Each SM features 32 CUDA processors– Each CUDA processor has a fully pipelined
integer arithmetic logic unit (ALU) and floating point unit (FPU)
• 16 Load/Store Units– Each SM has 16 load/store units, allowing
source and destination addresses to be calculated for sixteen threads per clock.
– Supporting units load and store the data at each address to cache or DRAM.
• Four Special Function Units– Special Function Units (SFUs) execute
transcendental instructions such as sin, cosine, reciprocal, and square root.
![Page 72: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/72.jpg)
P2012 Introduction
• The P2012 cluster is the computing node of the P2012 Fabric
• The P2012 cluster has two variants :– An homogeneous computing variant,– An heterogeneous computing variant.
• A single architecture for both variants.
Fabric Controller
SystemBridge
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
SystemBridge
SystemBridge
![Page 73: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/73.jpg)
P2012 Cluster Main Features• Symmetric Multi-processing• Uniform Memory Access within the cluster• Non Uniform Memory Access between clusters• Up to 16 +1 processors per cluster.• Up to 30.6 GOPS peak per cluster (assuming non-SIMD extension) at 600 MHz. • Up to 20.4 GFLOPs (32 bits) peak per cluster at 600 MHz.• 2 DMA channels allowing up to 6.4 GB/s data transfer • HW Support for synchronization:
– Fast barrier (within a cluster only) in ~4 Cycles for 16 processors– Flexible barrier ~20 cycles for 16 processors
• Seamless combination of non-programmable (HWPEs) and programmable (PEs) processing elements
• High level of customization though:– The number of STxP70 processing elements– The STxP70 extensions (ISA customization)– Up to 32 User-defined HWPEs,– Memory sizes,– Banking factor of the shared memory,
![Page 74: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/74.jpg)
P2012 Cluster OverviewP2012 Cluster Architecture
Multi-coreSub-system
(ENCore <N>)
Glo
bal I
nter
conn
ect
Inte
rfac
e
• N x STxP70 Cores• 2xN-banked Shared Data Memory• N-to-2M Logarithmic interconnect (memory)• Peripheral Logarithmic interconnect • Runtime accelerator (HWS)• Timers• Cluster interfaces (I/O)
![Page 75: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/75.jpg)
P2012 Cluster OverviewP2012 Cluster Architecture
Multi-coreSub-system
(ENCore <N>)
ClusterController
(CC)
Glo
bal I
nter
conn
ect
Inte
rfac
e
• 1 STxP70-based Cluster processor• 16KB P$ & TCDM • CC peripheral (boot, …)• Clock, variability, power controller (CVP)• Cluster Controller Interconnect
![Page 76: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/76.jpg)
P2012 Cluster OverviewP2012 Cluster Architecture
Multi-coreSub-system
(ENCore <N>)
Debug and Test Unit (DTU)
ClusterController
(CC)
Glo
bal I
nter
conn
ect
Inte
rfac
e
• Provides controllability and observability to the application developer • Breakpoint propagation inside the cluster and across the fabric
![Page 77: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/77.jpg)
P2012 Cluster OverviewP2012 Cluster Architecture
Multi-coreSub-system
(ENCore <N>)
Debug and Test Unit (DTU)
ClusterController
(CC)
Glo
bal I
nter
conn
ect
Inte
rfac
e
Custom HW Processing Elements
Steaming Interface (SIF)
• P x HW Processing Elements• Stream Flow Local Interconnect (LIC)• HWPE to/from LIC interfaces (HWPE_WPR)• CC to/from LIC interface (SIF).
![Page 78: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/78.jpg)
P2012 Cluster Overview (Con’d)P2012 Cluster Architecture
Debug and Test Unit (DTU)
…………….
Timers
HWS
Shared Tightly Coupled Data
Memory (TCDM)
Logarithmic interconnect (TCDM)
Mem
ory
bank
#1
Mem
ory
bank
#2
Mem
ory
bank
#3
Mem
ory
bank
#4
………
Perip
hera
l Log
arith
mic
inte
rcon
nect
ENC2EXT
EXT2PER
EXT2MEM
Glo
bal I
nter
conn
ect
Inte
rfac
e
ENCo
re<N
>-CC
in
terf
ace
Local Interconnect (Stream flow)
HWPE#P
HWPE#2
HWPE#1
……………………
HWPE_WPR & SIF
HWPE_WPR & SIF
HWPE_WPR & SIF
Steaming Interface (SIF)
CVP-CC
CC-Peripherals
DMAChannel
#0
DMAChannel
#1
STxP70#1
16KB-P$
STxP70#2
16KB-P$
STxP70#N
16KB-P$
Mem
ory
bank
#2
xN-1
Mem
ory
bank
#2
xN
STxP70Cluster
Processor(CP)
16KB-P$
TCDM
CC In
terc
onne
ct, C
CI
STxP70+ FPx
#1
16KB-P$
STxP70+FPx
#2
16KB-P$
STxP70+FPx#16
16KB-P$
STxP70Cluster
Processor+ FPx (CP)
16KB-P$
32-KB TCDM
Mem
ory
bank
#31
Mem
ory
bank
#32
• 32-bit RISC processor• 16 KB P$, No local data memory• 600 MHz in 32 nm• Variable length ISA• Up to two instructions executed per cycle• Configurable core• Extendible through its ISA• Complete software development tools chain
![Page 79: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/79.jpg)
P2012 Cluster Overview (Con’d)P2012 Cluster Architecture
Debug and Test Unit (DTU)
…………….
Timers
HWS
Shared Tightly Coupled Data
Memory (TCDM)
Logarithmic interconnect (TCDM)
Mem
ory
bank
#1
Mem
ory
bank
#2
Mem
ory
bank
#3
Mem
ory
bank
#4
………
Perip
hera
l Log
arith
mic
inte
rcon
nect
ENC2EXT
EXT2PER
EXT2MEM
Glo
bal I
nter
conn
ect
Inte
rfac
e
ENCo
re<N
>-CC
in
terf
ace
Local Interconnect (Stream flow)
HWPE#P
HWPE#2
HWPE#1
……………………
HWPE_WPR & SIF
HWPE_WPR & SIF
HWPE_WPR & SIF
Steaming Interface (SIF)
CVP-CC
CC-Peripherals
DMAChannel
#0
DMAChannel
#1
STxP70#1
16KB-P$
STxP70#2
16KB-P$
STxP70#N
16KB-P$
Mem
ory
bank
#2
xN-1
Mem
ory
bank
#2
xN
STxP70Cluster
Processor(CP)
16KB-P$
TCDM
CC In
terc
onne
ct, C
CI
STxP70+ FPx
#1
16KB-P$
STxP70+FPx
#2
16KB-P$
STxP70+FPx#16
16KB-P$
STxP70Cluster
Processor+ FPx (CP)
16KB-P$
32-KB TCDM
Mem
ory
bank
#31
Mem
ory
bank
#32
• Parametric multi-core crossbar with a logarithmic structure• Reduced arbitration complexity• round robin arbitration scheme• Up to N memory accesses per cycle• Test-and-Set support
![Page 80: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/80.jpg)
P2012 Cluster Overview (Con’d)P2012 Cluster Architecture
Debug and Test Unit (DTU)
…………….
Timers
HWS
Shared Tightly Coupled Data
Memory (TCDM)
Logarithmic interconnect (TCDM)
Mem
ory
bank
#1
Mem
ory
bank
#2
Mem
ory
bank
#3
Mem
ory
bank
#4
………
Perip
hera
l Log
arith
mic
inte
rcon
nect
ENC2EXT
EXT2PER
EXT2MEM
Glo
bal I
nter
conn
ect
Inte
rfac
e
ENCo
re<N
>-CC
in
terf
ace
Local Interconnect (Stream flow)
HWPE#P
HWPE#2
HWPE#1
……………………
HWPE_WPR & SIF
HWPE_WPR & SIF
HWPE_WPR & SIF
Steaming Interface (SIF)
CVP-CC
CC-Peripherals
DMAChannel
#0
DMAChannel
#1
STxP70#1
16KB-P$
STxP70#2
16KB-P$
STxP70#N
16KB-P$
Mem
ory
bank
#2
xN-1
Mem
ory
bank
#2
xN
STxP70Cluster
Processor(CP)
16KB-P$
TCDM
CC In
terc
onne
ct, C
CI
STxP70+ FPx
#1
16KB-P$
STxP70+FPx
#2
16KB-P$
STxP70+FPx#16
16KB-P$
STxP70Cluster
Processor+ FPx (CP)
16KB-P$
32-KB TCDM
Mem
ory
bank
#31
Mem
ory
bank
#32
• Supports 1D & 2D transfers• Up to 3.2GB/s peak per DMA • Support up to 16 outstanding transactions• Support of Out of Order (OoO)
![Page 81: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/81.jpg)
P2012 Cluster Overview (Con’d)P2012 Cluster Architecture
Debug and Test Unit (DTU)
…………….
Timers
HWS
Shared Tightly Coupled Data
Memory (TCDM)
Logarithmic interconnect (TCDM)
Mem
ory
bank
#1
Mem
ory
bank
#2
Mem
ory
bank
#3
Mem
ory
bank
#4
………
Perip
hera
l Log
arith
mic
inte
rcon
nect
ENC2EXT
EXT2PER
EXT2MEM
Glo
bal I
nter
conn
ect
Inte
rfac
e
ENCo
re<N
>-CC
in
terf
ace
Local Interconnect (Stream flow)
HWPE#P
HWPE#2
HWPE#1
……………………
HWPE_WPR & SIF
HWPE_WPR & SIF
HWPE_WPR & SIF
Steaming Interface (SIF)
CVP-CC
CC-Peripherals
DMAChannel
#0
DMAChannel
#1
STxP70#1
16KB-P$
STxP70#2
16KB-P$
STxP70#N
16KB-P$
Mem
ory
bank
#2
xN-1
Mem
ory
bank
#2
xN
STxP70Cluster
Processor(CP)
16KB-P$
TCDM
CC In
terc
onne
ct, C
CI
STxP70+ FPx
#1
16KB-P$
STxP70+FPx
#2
16KB-P$
STxP70+FPx#16
16KB-P$
STxP70Cluster
Processor+ FPx (CP)
16KB-P$
32-KB TCDM
Mem
ory
bank
#31
Mem
ory
bank
#32
• Ultrafast frequency adaptation (power control)• Continuous critical path monitoring (dynamic bin sampling)• Continuous thermal sensing (temperature control)
![Page 82: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/82.jpg)
P2012 Cluster Overview (Con’d)P2012 Cluster Architecture
Debug and Test Unit (DTU)
…………….
Timers
HWS
Shared Tightly Coupled Data
Memory (TCDM)
Logarithmic interconnect (TCDM)
Mem
ory
bank
#1
Mem
ory
bank
#2
Mem
ory
bank
#3
Mem
ory
bank
#4
………
Perip
hera
l Log
arith
mic
inte
rcon
nect
ENC2EXT
EXT2PER
EXT2MEM
Glo
bal I
nter
conn
ect
Inte
rfac
e
ENCo
re<N
>-CC
in
terf
ace
Local Interconnect (Stream flow)
HWPE#P
HWPE#2
HWPE#1
……………………
HWPE_WPR & SIF
HWPE_WPR & SIF
HWPE_WPR & SIF
Steaming Interface (SIF)
CVP-CC
CC-Peripherals
DMAChannel
#0
DMAChannel
#1
STxP70#1
16KB-P$
STxP70#2
16KB-P$
STxP70#N
16KB-P$
Mem
ory
bank
#2
xN-1
Mem
ory
bank
#2
xN
STxP70Cluster
Processor(CP)
16KB-P$
TCDM
CC In
terc
onne
ct, C
CI
STxP70+ FPx
#1
16KB-P$
STxP70+FPx
#2
16KB-P$
STxP70+FPx#16
16KB-P$
STxP70Cluster
Processor+ FPx (CP)
16KB-P$
32-KB TCDM
Mem
ory
bank
#31
Mem
ory
bank
#32
• Highly flexible and configurable interconnect,• Asynchronous implementation• Low-area or high-performance targets,• Natural GALS enabler• high robustness to variations
![Page 83: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/83.jpg)
P2012 Cluster Overview (Con’d)P2012 Cluster Architecture
Debug and Test Unit (DTU)
…………….
Timers
HWS
Shared Tightly Coupled Data
Memory (TCDM)
Logarithmic interconnect (TCDM)
Mem
ory
bank
#1
Mem
ory
bank
#2
Mem
ory
bank
#3
Mem
ory
bank
#4
………
Perip
hera
l Log
arith
mic
inte
rcon
nect
ENC2EXT
EXT2PER
EXT2MEM
Glo
bal I
nter
conn
ect
Inte
rfac
e
ENCo
re<N
>-CC
in
terf
ace
Local Interconnect (Stream flow)
HWPE#P
HWPE#2
HWPE#1
……………………
HWPE_WPR & SIF
HWPE_WPR & SIF
HWPE_WPR & SIF
Steaming Interface (SIF)
CVP-CC
CC-Peripherals
DMAChannel
#0
DMAChannel
#1
STxP70#1
16KB-P$
STxP70#2
16KB-P$
STxP70#N
16KB-P$
Mem
ory
bank
#2
xN-1
Mem
ory
bank
#2
xN
STxP70Cluster
Processor(CP)
16KB-P$
TCDM
CC In
terc
onne
ct, C
CI
STxP70+ FPx
#1
16KB-P$
STxP70+FPx
#2
16KB-P$
STxP70+FPx#16
16KB-P$
STxP70Cluster
Processor+ FPx (CP)
16KB-P$
32-KB TCDM
Mem
ory
bank
#31
Mem
ory
bank
#32
![Page 84: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/84.jpg)
LD/ST and DMA memory transfers• Intra-Cluster:
– LD/ST (UMA)– DMA: From/to TCDM to/from HWPE
• Inter-Cluster:– LD/ST (NUMA)– DMA: L1-to/from-L1
• Cluster to/from L2-Mem:– LD/ST (NUMA)– DMA: L1 to/from L2
• Cluster to/from L3-Mem (though the system bridge):– LD/ST (NUMA)– DMA: L1 to/from L3
Fabric Controller
SystemBridge
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
P2012Cluster
L2-MEM
SystemBridge
![Page 85: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/85.jpg)
P2012 as GP Accelerator
P2012 Fabric
L2
L3 (DRAM)
Cluster 0
L1
TCD
M
Cluster 1
L1
TCD
M
Cluster 2
L1
TCD
M
Cluster 3
L1
TCD
M
ARM Host
FC
![Page 86: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/86.jpg)
Summary• P2012 Cluster includes up to 16 + 1 x STxP70 cores, delivering up
to 30.6 GOPS and 20.4 GFLOP peak.
• ~7 GB/s DMA transfers
• Symmetric multi-processing in a UMA fashion within a Cluster; shared data memory in a NUMA fashion between Clusters.
• Fast multi processor synchronization thanks to HW support
• Seamless combination of non-programmable (HWPEs) and programmable (PEs) processing elements
![Page 87: Parallel Architectures](https://reader036.fdocuments.us/reader036/viewer/2022081514/56812dd9550346895d932520/html5/thumbnails/87.jpg)
Mobile SOC in 2012…• Features
– TSMC 40nm (LP/G)– Dual core A9 – 1-1.2GHz (G)– GPU, etc. - 330-400MHz (LP)– GEForce ULV (8 shaders)– 2 separate Vdd rails– 1MB L2$– 32b LPDDR2 (600MHz DR)
NVIDIA Tegra II SoC (2011)
A few (2, 4, 8) High-power processors (ARM): we need to handle power peaks
Efficient accelerator Fabrics with many (10s) PEs: we need to improve efficiency
Lots of (cool) memory, but we need more