Download - Computer Architecture Guidance

Computer 　 Architecture 　Guidance

Keio University

AMANO, Hideharu

hunga@am ． ics ． keio ． ac ． jp

Contents

Techniques on Parallel Processing Parallel Architectures Parallel Programming → 　 On real machines

Advanced uni-processor architecture→ 　 Special Course of Microprocessors

(by Prof. Yamasaki, Fall term)

Class

Lecture using Powerpoint: (70mins-90mins. ) The ppt file is uploaded on the web site

http://www.am.ics.keio.ac.jp, and you can down load/print before the lecture.

When the file is uploaded, the message is sent to you by E-mail.

Textbook: “Parallel Computers” by H.Amano (Sho-ko-do)　 but too old….

Exercise (20mins-home work.) Simple design or calculation on design issues Sorry, it often becomes a home work.

Evaluation

Exercise on Parallel Programming using GPU (50%) Caution! If the program does not run, the unit

cannot be given even if you finish all other exercises.

Exercise after every lecture (50%)

TSUBAME2.0(Xeon+Tesla,Top500 2010/11 4th ) 天河一号 (Xeon+FireStream,2009/11 5th )

GPGPU(General-Purpose computing on Graphic ProcessingUnit)

※() 内は開発環境

glossary 1 英語の単語がさっぱりわからんとのことなので用

語集を付けることにする。この glossary は、コンピュータ分野に限り有効で

ある。英語一般の使い方とかなり異なる場合がある。 Parallel: 並列の　本当に同時に動かすことを意味

する。並列に動いているように見えることを含める場合を concurrent( 並行）と呼び区別する。概念的には concurrent > parallel である。

Exercise: ここでは授業の最後にやる簡単な演習を指す

GPU: Graphic ProcessingUnit 　 Cell Broadband Engine を使って来たが、昨年から GPU を導入した。今年は新型でより高速のを使う予定

Computer 　 Architecture 　 1Introduction to Parallel Architectures

Keio University

AMANO, Hideharu

hunga@am ． ics ． keio ． ac ． jp

Parallel Architecture

Purposes Classifications Terms Trends

A parallel architecture consists of multiple processing units which work simultaneously.→ 　 Thread level parallelism

Boundary between Parallel machines and Uniprocessors

ILP(Instruction 　 Level 　 Parallelism) A single Program Counter Parallelism Inside/Between instructions

TLP(Thread 　 Level 　 Parallelism) Multiple Program Counters Parallelism between processes and jobs

Uniprocessors

Parallel MachinesDefinition

Hennessy & Petterson’s Computer Architecture: A quantitative approach

Multicore Revolution

Niagara 2

1. The end of increasing clock frequency1. Consuming power becomes too much.2. A large wiring delay in recent processes.3. The gap between CPU performance and memory latency

2. The limitation of ILP3. Since 2003, multicore and manycore have become popular.

Increasing power consumption

End of Moore’s Law in computer performance

1.25/year1.5/year=Moore’s Law

1.2/year

Purposes of providing multiple processors

Performance A job can be executed quickly with multiple processors

Dependability If a processing unit is damaged, total system can be

available ： Redundant systems Resource sharing

Multiple jobs share memory and/or I/O modules for cost effective processing ： Distributed systems

Low energy High performance with Low frequency operation

Parallel Architecture: Performance Centric!

glossary 2

Simultaneously: 同時に、という意味で in parallel とほとんど同じだが、ちょっとニュアンスが違う。 in parallel だと同じようなことを同時にやる感じがするが、 simultaneously だととにかく同時にやればよい感じがする。

Thread: プログラムの一連の流れのこと。 Thread level parallelism （ TLP) は、 Thread 間の並列性のことで、ここではHennessy and Patterson のテキストに従って PC が独立している場合に使うが違った意味に使う人も居る。これに対して PC が単一で命令間にある並列性を ILP と呼ぶ

Dependability: 耐故障性、 Reliability( 信頼性） , Availability( 可用性）双方を含み、要するに故障に強いこと。 Redundant system は冗長システムのことで、多めに資源を持つことで耐故障性を上げることができる。

Distributed system: 分散システム、分散して処理することにより効率的に処理をしたり耐故障性を上げたりする

Flynn’s Classification

The number of Instruction 　 Stream ：　 M(Multiple)/S(Single)

The number of Data 　 Stream ： M/S SISD

Uniprocessors （ including Super scalar 、 VLIW ） MISD ： Not existing （ Analog 　 Computer ） SIMD MIMD

He gave a lecture at Keio on the last May.

SIMD (Single Instruction StreamMultiple Data Streams

Instruction

InstructionMemory

Processing Unit

Data memory

•All Processing Units executes the same instruction•Low degree of flexibility•Illiac-IV/MMX instructions/ClearSpeed/IMAP/GP-GPU（ coarse grain ）•CM-2, （ fine grain ）

Two types of ＳＩＭＤ Coarse grain ： Each node performs floating point n

umerical operations Old SuperComputers: ILLIAC-IV ， BSP,GF-11 Multimedia instructions in recent high-end CPUs Accelerator: GPU, ClearSpeed Dedicated on-chip approach: NEC’s IMAP

Fine grain ： Each node only performs a few bits operations

ICL 　 DAP, 　 CM-2 ， MP-2 Image/Signal Processing Connection Machines 　（ CM-2) extends the application

to Artificial Intelligence (CmLisp)

TSUBAME2.0(Xeon+Tesla,Top500 2010/11 4th ) 天河一号 (Xeon+FireStream,2009/11 5th )

GPGPU(General-Purpose computing on Graphic ProcessingUnit)

※() 内は開発環境

PBSM PBSM

Thread Processors

PBSM PBSM

Thread Processors

PBSM PBSM

Thread Processors

PBSM PBSM

Thread Processors

PBSM PBSM

Thread Processors

…

Thread Execution Manager

Input Assembler

Host

Load/Store

Global Memory

GeForceGTX280240 cores

GPU(NVIDIA’s GTX580)

512 GPU cores ( 128 X 4 )768 KB L2 cache40nm CMOS 550 mm^2

128 Cores128 Cores 128 Cores128 Cores

128 Cores128 Cores 128 Cores128 Cores

L2 CacheL2 Cache

LPA is consisting of 16 PE Groups

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

PE

G

Interface　UnitS

DRAM　IF

CPU　IF

Background　transfer　control

Bus　IF

Inter-PE　data　selector

Linear Processor Array (LPA)

Data　cache　2KB

Inst.　cache　32KB

GR

16bx32

Instruction

Fetch(4w/clk)

ALUMUL

Control　Processor　(CP) P

E　

status

PE　data

Wired　OR　logic

64b

64b

16b

16b

16b

64b

64b

IMAP-CE

IMAP-CE

Control　Processor　(CP)

16b

64b

8bx16

8b1b

8b

PE

PE

PE

PE

PE

PE

PE

PE

SR

SR

SR

SR

SR

SR

SR

SR

Video　data　in

Video　data　out

ClearSpeedCSX600

FP

MU

LF

PA

DD

DIV

/SQ

RT

MA

CA

LU

Reg File

SRAM

PIO

FP

MU

LF

PA

DD

DIV

/SQ

RT

MA

CA

LU

Reg File

SRAM

PIO

FP

MU

LF

PA

DD

DIV

/SQ

RT

MA

CA

LU

Reg File

SRAM

PIO

PIO Collection/Distribution

Poly Execution Unit

Poly MCodedControl

Poly LSControl

Poly PIOControl

Poly Scoreboard

Mono Exec Unit

Thread Cont.

Semaphore Unit

D Cache D CacheControlDebug

96 Execution Unitswhich work at 250MHz

GRAPE-DR

Kei Hiraki “GRAPE-DR” http://www.fpl.org (FPL2007)

Renesas 　 MTXI/

O I

nte

rfac

e

．．

．

..

PEPEPEPEPEPEPEPE

PEPE

．．

．

Inst. memory Controller

Data Register 1Data Register 0

Inst. Pointer1Pointer0

256bit 256bit4096bit

2048PEs

SE

L

2b-ALU

Valid OH-ch

V-ch H-chPE structure

The future of SIMD

Coarse grain SIMD GPGPU became a main stream of accelerators Other SIMD accelerators: CS600, GRAPE － DR Multi-media instructions will be used in the future.

Fine grain SIMD Advantageous to specific applications like image

processing On-chip accelerator General purpose machines are difficult to be built

ex.CM2 　→　 CM5

MIMD

Processors

Memory modules (Instructions ・ Data ）

•Each processor executes individual instructions•Synchronization is required•High degree of flexibility•Various structures are possible

Interconnectionnetworks

Classification of MIMD machines Structure of shared memory UMA(Uniform 　 Memory 　 Access 　 Model ）

provides shared memory which can be accessed from all processors with the same manner.

NUMA(Non-Uniform 　 Memory 　 Access 　 Model ）provides shared memory but not uniformly accessed.

NORA/NORMA （ No 　 Remote 　 Memory 　 Access 　 Model ）provides no shared memory. Communication is done with

message passing.

UMA The simplest structure of shared memory machin

e The extension of uniprocessors OS which is an extension for single processor ca

n be used. Programming is easy. System size is limited.

Bus connected Switch connected

A total system can be implemented on a single chip

On-chip multiprocessorChip multiprocessorSingle chip multiprocessorIBM Power 5 NEC/ARM chip multiprocessor for embedded systems

An example of UMA ： Bus connected

PU PU

SnoopCache

PU

SnoopCache

PU

SnoopCache

Main 　 Memory

　 shared 　 bus

SnoopCache

SMP(Symmetric MultiProcessor)

On chip multiprocessor

Note that 　 it is a logical image

MPCore (ARM+NEC)

CPUinterface

Timer

WdogCPU

interfaceTimer

WdogCPU

interfaceTimer

WdogCPU

interfaceTimer

Wdog

CPU/VFP

L1 Memory

CPU/VFP

L1 Memory

CPU/VFP

L1 Memory

CPU/VFP

L1 Memory

Interrupt Distributor

Snoop Control Unit (SCU) CoherenceControl Bus

DuplicatedL1 Tag

…

IRQ IRQ IRQ IRQ

PrivateFIQ Lines

PrivatePeripheralBus

L2 Cache

PrivateAXI R/W64bit Bus

SMP for Embeddedapplication

SUN T1Core

Core

Core

Core

Core

Core

Core

Core

CrossbarSwitch

FPU

L2Cachebank

Directory

L2Cachebank

Directory

L2Cachebank

Directory

L2Cachebank

DirectorySingle issue six-stage pipelineRISC with 16KB Instruction cache/8KB Data cache for L1

Total 3MB, 64byte Interleaved

Memory

Multi-Core (Intel’s Nehalem-EX)

8 CPU cores24MB L3 cache45nm CMOS 600 mm^2

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPUL3 CacheL3 Cache

L3 CacheL3 Cache

Heterogeneous vs. Homogeneous Homogeneous: consisting of the same processing

elements A single task can be easily executed in parallel. Unique programming environment

Heterogeneous: consisting of various types of processing elements Mainly for task-level parallel processing High performance per cost Most recent high-end processors for cellular phone use this

structure However, programming is difficult.

NEC MP211

ARM926PE0

ARM926PE1

ARM926PE2

SPX-K602DSP

DMAC USB OTG

3D Acc.

Rot-ater.

ImageAcc.

CamDTVI/F.

LCDI/F

AsyncBridge0

AsyncBridge1

APBBridge0

IIC UART

TIM1

TIM2

TIM3

WDT

Mem. card

PCM

APBBridge1

Bus Interface

Scheduler

SDRAMController

SRAMInterface

On-chip SRAM

(640KB)

PLL OSC

Inst.RAM

PMU

INTC TIM0GPIO SIO

Sec.Acc.

SMU uWIRE

CameraLCD

FLASH DDR SDRAM

Ｍｕｌｔｉ－Ｌａｙｅｒ　ＡＨＢ

Heterogeneous type UMA

NUMA Each processor provides a local memory,

and accesses other processors’ memory through the network.

Address translation and cache control often make the hardware structure complicated.

Scalable ： Programs for UMA can run without modification. The performance is improved as the system size.

Competitive to WS/PC clusters with Software DSM

Typical structure of NUMA

Node 　１

Node 　 2

Node 　３

Node 　００

１

２

３

ＩｎｔｅｒｃｏｎｎｅｃｔｏｎＮｅｔｗｏｒｋ

Logical address space

Classification of NUMA Simple NUMA ：

Remote memory is not cached. Simple structure but access cost of remote memory is large.

CC-NUMA ： Cache 　 Coherent Cache consistency is maintained with hardware. The structure tends to be complicated.

COMA:Cache 　 Only 　 Memory 　 Architecture No home memory Complicated control mechanism

Cray’s T3D: A simple NUMA supercomputer (1993)

Using

Alpha 21064

The Earth simulator(2002)　 Simple NUMA

From IBM 　 web site

The fastest computerAlso simple NUMA

Cell （ IBM/SONY/Toshiba ）

SXU

LS

DMA

PXU

L1 C

L2 C

MIC

BIC

ExternalDRAM

Flex I/O

EIB: 2+2 Ring Bus

CPU Core IBM Power2-way superscalar, 2-thread

SPE:Synergistic Processing Element(SIMD core)128bit(32bit X 4)2 way superscalar

32KB+32KB

512KB

PPE

512KB Local Store

SXU

LS

DMA

SXU

LS

DMA

SXU

LS

DMA

SXU

LS

DMA

SXU

LS

DMA

SXU

LS

DMA

SXU

LS

DMA

The LS of SPEsare mapped onthe same addressspace of the PPE

SGI 　 Origin

Hub 　Chip

Main 　 Memory

Ｎｅｔｗｏｒｋ

Bristled 　 Hypercube

Main 　 Memory is connected directly with Hub 　 Chip1 cluster consists of ２ PE.

SGI’s CC-NUMA Origin3000(2000)

Using R12000

TRIPS

ＮＮＮＮＮ

ＮＮＮＮＮＮ

ＮＭＭＭＭＭＭ

ＭＭＭＭＭＭＭ

Ｎ

ＮＮＮＮ

ＮＮＮＮ

ＩＩＩＩ

ＩＩＩＩＩ

Ｉ

ＮＭ

ＭＮ

ＭＮ

ＤＤＤＤ

ＤＤＤＤＧ

ＧＥＥＥＥＥＥＥ

ＥＥＥＥＥＥＥＥＥ

ＥＥＥＥＥＥＥ

ＥＥＥＥＥＥＥＥＥ

ＲＲＲＲ

ＲＲＲＲＤＭＡＣ２Ｃ

ＳＤＥＢＣＤＭＡ

ＳＤＳＤ

Ｎ

ＤＭＡ

Ｃ２Ｃ

Ｍ

Ｉ

Ｇ

Ｅ

Ｄ

Ｒ

TRIPS L2 Cache(OCN)

TRIPS processor 0(OPN)

TRIPS processor 1(OPN)

Register Tile

Execution Tile

Instruction cache Tile

Data cache Tile

Global Control Tile

Network Tile

Memory Tile

DDRAM ControllerDMA ControllerChip to chip InterfaceOCN interconnect

OPN Interconnect

DDM(Data 　 Diffusion 　Machine ）

．．．．．．．．．．．．

Ｄ

NORA/NORMA

No shared memory Communication is done with message

passing Simple structure but high peak performance

Cost effective solution.

Hard for programming

Inter-PU communications Cluster computing

Tile Processors: On-chip NORMA for embedded applications

Early Hypercube machine nCUBE2

Fujitsu’s NORA AP1000(1990)

Mesh connection SPARC

Intel’s Paragon XP/S(1991)

Mesh connection i860

PC Cluster

Beowulf Cluster (NASA’s Beowulf Projects 1994, by Sterling) Commodity components TCP/IP Free software

Others Commodity components High performance networks like Myrinet / Infiniban

d Dedicated software

RHiNET-2 cluster

Tilera’s Tile64

Tile Pro, Tile Gx

Linux runs ineach core.

Intel 80-Core Chip

Intel 80-core chip [Vangal,ISSCC’07]

Multi-core + Accelerator

Intel’s Sandy Bridge AMD’s Fusion APU

I / O

System Agent

Core4

Core3

Core2

Core1

LLC

LLC

LLC

LLC

GPU

GPU 1

Video Decoder

Core 1

GPU 2 Core 2

memorycontroller

PlatformInterface

glossary 3 Flynn’s Classification: Flynn(Stanford 大の教授）が論文中に用いた分類、内容

は本文を参照のこと Coarse grain ：粗粒度、この場合はプロセッシングエレメントが浮動小数演算

が可能な程度大きいこと。反対が Fine grain （細粒度）で、数ビットの演算しかできないもの

Illiac-IV, BSP, GF-11, Connection Machine 　 CM-2 ， MP-2 などはマシン名。SIMD の往年の名機

Synchronization: 同期、 Shared Memory: 共有メモリ、この辺は後の授業で詳細を解説する

Message passing: メッセージ交換。共有メモリを使わずにデータを直接交換する方法

Embedded System: 組み込みシステム Homogeneous: 等質な　 Heterogeneous ：性質の異なったものから成る Coherent Cache: 内容の一貫性が保障されたキャッシュ、 Cache Consistency

は内容の一貫性、これも後の授業で解説する Commodity 　 Component: 　標準部品、価格が安く入手が容易 Power 5, Origin2000, Cray XD-1,AP1000,NCUBE などもマシン名。 The earth

simulator は地球シミュレータ ,IBM BlueGene/L は現在のところ最速

Terms(1)

Multiprocessors ： MIMD machines with shared memory （ Strict definition ：ｂｙ　Ｅｎｓｌｏｗ　Ｊ

ｒ．） Shared memory Shared I/O Distributed OS Homogeneous

Extended definition: All parallel machines （ Wrong usage ）

Terms （ 2) Multicomputer

ＭＩＭＤ machines without shared memory, that is ＮＯＲＡ／ＮＯＲＭＡ

Arraycomputer A machine consisting of array of processing elements :

ＳＩＭＤ A supercomputer for array calculation (Wrong usage)

Loosely coupled ・ Tightly coupled Loosely coupled: ＮＯＲＡ， Tightly coupled: ＵＭＡ But, are NORAs really loosely coupled??

Don’t use if possible

Stored programmingbased

Others

ＳＩＭＤＦｉｎｅ　ｇｒａｉｎ　Ｃｏａｒｓｅ　ｇｒａｉｎ　

ＭＩＭＤ

Bus connected UMASwitch connected UMA

ＮＵＭＡＳｉｍｐｌｅ　ＮＵＭＡＣＣ－ＮＵＭＡＣＯＭＡ

ＮＯＲＡ

Systolic architectureData flow architectureMixed controlDemand driven architecture

Multiprocessors

Multicomputers

Classification

Exercise 1

In 2011, Japanese supercomputer K got the award of “world fastest computer”.

Which type is K classified into ? Why do you think so? The reason is important!

If you take this class, send the answer with your name and student number to [email protected]

You can use either Japanese or English.