Computer Architecture Guidance
Keio University
AMANO, Hideharu
hunga@am . ics . keio . ac . jp
Contents
Techniques on Parallel Processing Parallel Architectures Parallel Programming → On real machines
Advanced uni-processor architecture→ Special Course of Microprocessors
(by Prof. Yamasaki, Fall term)
Class
Lecture using Powerpoint: (70mins-90mins. ) The ppt file is uploaded on the web site
http://www.am.ics.keio.ac.jp, and you can down load/print before the lecture.
When the file is uploaded, the message is sent to you by E-mail.
Textbook: “Parallel Computers” by H.Amano (Sho-ko-do) but too old….
Exercise (20mins-home work.) Simple design or calculation on design issues Sorry, it often becomes a home work.
Evaluation
Exercise on Parallel Programming using GPU (50%) Caution! If the program does not run, the unit
cannot be given even if you finish all other exercises.
Exercise after every lecture (50%)
TSUBAME2.0(Xeon+Tesla,Top500 2010/11 4th ) 天河一号 (Xeon+FireStream,2009/11 5th )
GPGPU(General-Purpose computing on Graphic ProcessingUnit)
※() 内は開発環境
glossary 1 英語の単語がさっぱりわからんとのことなので用
語集を付けることにする。 この glossary は、コンピュータ分野に限り有効で
ある。英語一般の使い方とかなり異なる場合がある。 Parallel: 並列の 本当に同時に動かすことを意味
する。並列に動いているように見えることを含める場合を concurrent( 並行)と呼び区別する。概念的には concurrent > parallel である。
Exercise: ここでは授業の最後にやる簡単な演習を指す
GPU: Graphic ProcessingUnit Cell Broadband Engine を使って来たが、昨年から GPU を導入した。今年は新型でより高速のを使う予定
Computer Architecture 1Introduction to Parallel Architectures
Keio University
AMANO, Hideharu
hunga@am . ics . keio . ac . jp
Parallel Architecture
Purposes Classifications Terms Trends
A parallel architecture consists of multiple processing units which work simultaneously.→ Thread level parallelism
Boundary between Parallel machines and Uniprocessors
ILP(Instruction Level Parallelism) A single Program Counter Parallelism Inside/Between instructions
TLP(Thread Level Parallelism) Multiple Program Counters Parallelism between processes and jobs
Uniprocessors
Parallel MachinesDefinition
Hennessy & Petterson’s Computer Architecture: A quantitative approach
Multicore Revolution
Niagara 2
1. The end of increasing clock frequency1. Consuming power becomes too much.2. A large wiring delay in recent processes.3. The gap between CPU performance and memory latency
2. The limitation of ILP3. Since 2003, multicore and manycore have become popular.
Increasing power consumption
End of Moore’s Law in computer performance
1.25/year1.5/year=Moore’s Law
1.2/year
Purposes of providing multiple processors
Performance A job can be executed quickly with multiple processors
Dependability If a processing unit is damaged, total system can be
available : Redundant systems Resource sharing
Multiple jobs share memory and/or I/O modules for cost effective processing : Distributed systems
Low energy High performance with Low frequency operation
Parallel Architecture: Performance Centric!
glossary 2
Simultaneously: 同時に、という意味で in parallel とほとんど同じだが、ちょっとニュアンスが違う。 in parallel だと同じようなことを同時にやる感じがするが、 simultaneously だととにかく同時にやればよい感じがする。
Thread: プログラムの一連の流れのこと。 Thread level parallelism ( TLP) は、 Thread 間の並列性のことで、ここではHennessy and Patterson のテキストに従って PC が独立している場合に使うが違った意味に使う人も居る。これに対して PC が単一で命令間にある並列性を ILP と呼ぶ
Dependability: 耐故障性、 Reliability( 信頼性) , Availability( 可用性)双方を含み、要するに故障に強いこと。 Redundant system は冗長システムのことで、多めに資源を持つことで耐故障性を上げることができる。
Distributed system: 分散システム、分散して処理することにより効率的に処理をしたり耐故障性を上げたりする
Flynn’s Classification
The number of Instruction Stream : M(Multiple)/S(Single)
The number of Data Stream : M/S SISD
Uniprocessors ( including Super scalar 、 VLIW ) MISD : Not existing ( Analog Computer ) SIMD MIMD
He gave a lecture at Keio on the last May.
SIMD (Single Instruction StreamMultiple Data Streams
Instruction
InstructionMemory
Processing Unit
Data memory
•All Processing Units executes the same instruction•Low degree of flexibility•Illiac-IV/MMX instructions/ClearSpeed/IMAP/GP-GPU( coarse grain )•CM-2, ( fine grain )
Two types of SIMD Coarse grain : Each node performs floating point n
umerical operations Old SuperComputers: ILLIAC-IV , BSP,GF-11 Multimedia instructions in recent high-end CPUs Accelerator: GPU, ClearSpeed Dedicated on-chip approach: NEC’s IMAP
Fine grain : Each node only performs a few bits operations
ICL DAP, CM-2 , MP-2 Image/Signal Processing Connection Machines ( CM-2) extends the application
to Artificial Intelligence (CmLisp)
TSUBAME2.0(Xeon+Tesla,Top500 2010/11 4th ) 天河一号 (Xeon+FireStream,2009/11 5th )
GPGPU(General-Purpose computing on Graphic ProcessingUnit)
※() 内は開発環境
PBSM PBSM
Thread Processors
PBSM PBSM
Thread Processors
PBSM PBSM
Thread Processors
PBSM PBSM
Thread Processors
PBSM PBSM
Thread Processors
…
Thread Execution Manager
Input Assembler
Host
Load/Store
Global Memory
GeForceGTX280240 cores
GPU(NVIDIA’s GTX580)
512 GPU cores ( 128 X 4 )768 KB L2 cache40nm CMOS 550 mm^2
128 Cores128 Cores 128 Cores128 Cores
128 Cores128 Cores 128 Cores128 Cores
L2 CacheL2 Cache
LPA is consisting of 16 PE Groups
PE
G
PE
G
PE
G
PE
G
PE
G
PE
G
PE
G
PE
G
PE
G
PE
G
PE
G
PE
G
PE
G
PE
G
PE
G
PE
G
Interface UnitS
DRAM IF
CPU IF
Background transfer control
Bus IF
Inter-PE data selector
Linear Processor Array (LPA)
Data cache 2KB
Inst. cache 32KB
GR
16bx32
Instruction
Fetch(4w/clk)
ALUMUL
Control Processor (CP) P
E
status
PE data
Wired OR logic
64b
64b
16b
16b
16b
64b
64b
IMAP-CE
IMAP-CE
Control Processor (CP)
16b
64b
8bx16
8b1b
8b
PE
PE
PE
PE
PE
PE
PE
PE
SR
SR
SR
SR
SR
SR
SR
SR
Video data in
Video data out
ClearSpeedCSX600
FP
MU
LF
PA
DD
DIV
/SQ
RT
MA
CA
LU
Reg File
SRAM
PIO
FP
MU
LF
PA
DD
DIV
/SQ
RT
MA
CA
LU
Reg File
SRAM
PIO
FP
MU
LF
PA
DD
DIV
/SQ
RT
MA
CA
LU
Reg File
SRAM
PIO
PIO Collection/Distribution
Poly Execution Unit
Poly MCodedControl
Poly LSControl
Poly PIOControl
Poly Scoreboard
Mono Exec Unit
Thread Cont.
Semaphore Unit
D Cache D CacheControlDebug
96 Execution Unitswhich work at 250MHz
GRAPE-DR
Kei Hiraki “GRAPE-DR” http://www.fpl.org (FPL2007)
Renesas MTXI/
O I
nte
rfac
e
..
.
..
PEPEPEPEPEPEPEPE
PEPE
..
.
Inst. memory Controller
Data Register 1Data Register 0
Inst. Pointer1Pointer0
256bit 256bit4096bit
2048PEs
SE
L
2b-ALU
Valid OH-ch
V-ch H-chPE structure
The future of SIMD
Coarse grain SIMD GPGPU became a main stream of accelerators Other SIMD accelerators: CS600, GRAPE - DR Multi-media instructions will be used in the future.
Fine grain SIMD Advantageous to specific applications like image
processing On-chip accelerator General purpose machines are difficult to be built
ex.CM2 → CM5
MIMD
Processors
Memory modules (Instructions ・ Data )
•Each processor executes individual instructions•Synchronization is required•High degree of flexibility•Various structures are possible
Interconnectionnetworks
Classification of MIMD machines Structure of shared memory UMA(Uniform Memory Access Model )
provides shared memory which can be accessed from all processors with the same manner.
NUMA(Non-Uniform Memory Access Model )provides shared memory but not uniformly accessed.
NORA/NORMA ( No Remote Memory Access Model )provides no shared memory. Communication is done with
message passing.
UMA The simplest structure of shared memory machin
e The extension of uniprocessors OS which is an extension for single processor ca
n be used. Programming is easy. System size is limited.
Bus connected Switch connected
A total system can be implemented on a single chip
On-chip multiprocessorChip multiprocessorSingle chip multiprocessorIBM Power 5 NEC/ARM chip multiprocessor for embedded systems
An example of UMA : Bus connected
PU PU
SnoopCache
PU
SnoopCache
PU
SnoopCache
Main Memory
shared bus
SnoopCache
SMP(Symmetric MultiProcessor)
On chip multiprocessor
Note that it is a logical image
MPCore (ARM+NEC)
CPUinterface
Timer
WdogCPU
interfaceTimer
WdogCPU
interfaceTimer
WdogCPU
interfaceTimer
Wdog
CPU/VFP
L1 Memory
CPU/VFP
L1 Memory
CPU/VFP
L1 Memory
CPU/VFP
L1 Memory
Interrupt Distributor
Snoop Control Unit (SCU) CoherenceControl Bus
DuplicatedL1 Tag
…
IRQ IRQ IRQ IRQ
PrivateFIQ Lines
PrivatePeripheralBus
L2 Cache
PrivateAXI R/W64bit Bus
SMP for Embeddedapplication
SUN T1Core
Core
Core
Core
Core
Core
Core
Core
CrossbarSwitch
FPU
L2Cachebank
Directory
L2Cachebank
Directory
L2Cachebank
Directory
L2Cachebank
DirectorySingle issue six-stage pipelineRISC with 16KB Instruction cache/8KB Data cache for L1
Total 3MB, 64byte Interleaved
Memory
Multi-Core (Intel’s Nehalem-EX)
8 CPU cores24MB L3 cache45nm CMOS 600 mm^2
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPUL3 CacheL3 Cache
L3 CacheL3 Cache
Heterogeneous vs. Homogeneous Homogeneous: consisting of the same processing
elements A single task can be easily executed in parallel. Unique programming environment
Heterogeneous: consisting of various types of processing elements Mainly for task-level parallel processing High performance per cost Most recent high-end processors for cellular phone use this
structure However, programming is difficult.
NEC MP211
ARM926PE0
ARM926PE1
ARM926PE2
SPX-K602DSP
DMAC USB OTG
3D Acc.
Rot-ater.
ImageAcc.
CamDTVI/F.
LCDI/F
AsyncBridge0
AsyncBridge1
APBBridge0
IIC UART
TIM1
TIM2
TIM3
WDT
Mem. card
PCM
APBBridge1
Bus Interface
Scheduler
SDRAMController
SRAMInterface
On-chip SRAM
(640KB)
PLL OSC
Inst.RAM
PMU
INTC TIM0GPIO SIO
Sec.Acc.
SMU uWIRE
CameraLCD
FLASH DDR SDRAM
Multi-Layer AHB
Heterogeneous type UMA
NUMA Each processor provides a local memory,
and accesses other processors’ memory through the network.
Address translation and cache control often make the hardware structure complicated.
Scalable : Programs for UMA can run without modification. The performance is improved as the system size.
Competitive to WS/PC clusters with Software DSM
Typical structure of NUMA
Node 1
Node 2
Node 3
Node 0 0
1
2
3
InterconnectonNetwork
Logical address space
Classification of NUMA Simple NUMA :
Remote memory is not cached. Simple structure but access cost of remote memory is large.
CC-NUMA : Cache Coherent Cache consistency is maintained with hardware. The structure tends to be complicated.
COMA:Cache Only Memory Architecture No home memory Complicated control mechanism
Cray’s T3D: A simple NUMA supercomputer (1993)
Using
Alpha 21064
The Earth simulator(2002) Simple NUMA
From IBM web site
The fastest computerAlso simple NUMA
Cell ( IBM/SONY/Toshiba )
SXU
LS
DMA
PXU
L1 C
L2 C
MIC
BIC
ExternalDRAM
Flex I/O
EIB: 2+2 Ring Bus
CPU Core IBM Power2-way superscalar, 2-thread
SPE:Synergistic Processing Element(SIMD core)128bit(32bit X 4)2 way superscalar
32KB+32KB
512KB
PPE
512KB Local Store
SXU
LS
DMA
SXU
LS
DMA
SXU
LS
DMA
SXU
LS
DMA
SXU
LS
DMA
SXU
LS
DMA
SXU
LS
DMA
The LS of SPEsare mapped onthe same addressspace of the PPE
SGI Origin
Hub Chip
Main Memory
Network
Bristled Hypercube
Main Memory is connected directly with Hub Chip1 cluster consists of 2 PE.
SGI’s CC-NUMA Origin3000(2000)
Using R12000
TRIPS
N NNNN
NNNNN N
NM MM MM M
MM MM MM M
N
NNNN
NNNN
IIII
IIIII
I
N M
M N
M N
DDDD
DDDDG
GE E E EEEE
E E EE E EE E E
E E E EEEE
E E EE E EE E E
R R R R
R R R RDMA C2C
SD EBCDMA
SDSD
N
DMA
C2C
M
I
G
E
D
R
TRIPS L2 Cache(OCN)
TRIPS processor 0(OPN)
TRIPS processor 1(OPN)
Register Tile
Execution Tile
Instruction cache Tile
Data cache Tile
Global Control Tile
Network Tile
Memory Tile
DDRAM ControllerDMA ControllerChip to chip InterfaceOCN interconnect
OPN Interconnect
DDM(Data Diffusion Machine )
... ... ... ...
D
NORA/NORMA
No shared memory Communication is done with message
passing Simple structure but high peak performance
Cost effective solution.
Hard for programming
Inter-PU communications Cluster computing
Tile Processors: On-chip NORMA for embedded applications
Early Hypercube machine nCUBE2
Fujitsu’s NORA AP1000(1990)
Mesh connection SPARC
Intel’s Paragon XP/S(1991)
Mesh connection i860
PC Cluster
Beowulf Cluster (NASA’s Beowulf Projects 1994, by Sterling) Commodity components TCP/IP Free software
Others Commodity components High performance networks like Myrinet / Infiniban
d Dedicated software
RHiNET-2 cluster
Tilera’s Tile64
Tile Pro, Tile Gx
Linux runs ineach core.
Intel 80-Core Chip
Intel 80-core chip [Vangal,ISSCC’07]
Multi-core + Accelerator
Intel’s Sandy Bridge AMD’s Fusion APU
I / O
System Agent
Core4
Core3
Core2
Core1
LLC
LLC
LLC
LLC
GPU
GPU 1
Video Decoder
Core 1
GPU 2 Core 2
memorycontroller
PlatformInterface
glossary 3 Flynn’s Classification: Flynn(Stanford 大の教授)が論文中に用いた分類、内容
は本文を参照のこと Coarse grain :粗粒度、この場合はプロセッシングエレメントが浮動小数演算
が可能な程度大きいこと。反対が Fine grain (細粒度)で、数ビットの演算しかできないもの
Illiac-IV, BSP, GF-11, Connection Machine CM-2 , MP-2 などはマシン名。SIMD の往年の名機
Synchronization: 同期、 Shared Memory: 共有メモリ、この辺は後の授業で詳細を解説する
Message passing: メッセージ交換。共有メモリを使わずにデータを直接交換する方法
Embedded System: 組み込みシステム Homogeneous: 等質な Heterogeneous :性質の異なったものから成る Coherent Cache: 内容の一貫性が保障されたキャッシュ、 Cache Consistency
は内容の一貫性、これも後の授業で解説する Commodity Component: 標準部品、価格が安く入手が容易 Power 5, Origin2000, Cray XD-1,AP1000,NCUBE などもマシン名。 The earth
simulator は地球シミュレータ ,IBM BlueGene/L は現在のところ最速
Terms(1)
Multiprocessors : MIMD machines with shared memory ( Strict definition :by Enslow J
r.) Shared memory Shared I/O Distributed OS Homogeneous
Extended definition: All parallel machines ( Wrong usage )
Terms ( 2) Multicomputer
MIMD machines without shared memory, that is NORA/NORMA
Arraycomputer A machine consisting of array of processing elements :
SIMD A supercomputer for array calculation (Wrong usage)
Loosely coupled ・ Tightly coupled Loosely coupled: NORA, Tightly coupled: UMA But, are NORAs really loosely coupled??
Don’t use if possible
Stored programmingbased
Others
SIMDFine grain Coarse grain
MIMD
Bus connected UMASwitch connected UMA
NUMASimple NUMACC-NUMACOMA
NORA
Systolic architectureData flow architectureMixed controlDemand driven architecture
Multiprocessors
Multicomputers
Classification
Exercise 1
In 2011, Japanese supercomputer K got the award of “world fastest computer”.
Which type is K classified into ? Why do you think so? The reason is important!
If you take this class, send the answer with your name and student number to [email protected]
You can use either Japanese or English.
Top Related