QMy general research area: System-level support for parallel and distributed computing User: run...

My general research area: System-level support for parallel and distributed computing

•User: run my_prog•System: load my_prog on nodes, run it, collect results

Efficiency: make program run as fast as possibleefficient use of system resources (CPU,RAM,disk)

User level abstractions that make the system easy to use

CPU 1 CPU 1

Memory Memory

DiskDisk

CPU 2 CPU 2 CPU 3 CPU 3 CPU 4 CPU 4 CPU n CPU n……

DiskDisk DiskDisk

my_progmy_prog

Nswap: Network RAM for cluster computers

Cluster computer: multiple inexpensive, “independent” machines connected by a network, run system SW that make them look and act like single parallel computer

……

Network

Cluster system softwareCluster system software

General Purpose Cluster Multiple parallel programs share the cluster

•Assigned to some, possibly overlapping, machines•Share NW, Memory, CPU, disk resources

Program workload changes over time•New programs may enter the system•Existing programs may complete and leave

==> imbalances in RAM and CPU usage across nodes some nodes don’t have enough RAM, some have unused RAM

P1P2P3

P1P2P3

When node doesn’t have enough RAM space?

Lots of Data that cannot all fit into Memory OS moves data to/from disk as needed (swapping)

time to access data:CPU: 0.000000005 secs RAM: 100 x slower than CPUDisk: 10 million x slower

-> Swapping is really, really slow

CPUCPU

Memory Memory

DiskDisk

Network Swapping in a Cluster

Disk

CPU

RAMBypass disk and swap pages of RAM to remote idle memory in the cluster

Node 1

CPURAM

Node 3

CPURAM

Node 2

network

• Network Swapping: expand Node 1’s memory using idle RAM of other cluster nodes rather than local disk

Why Nswap?

Swapping to Disk

Nswap

3561.66 seconds 105.50 seconds(speed-up of 33.8)

The network is much faster than disk, so swapping data over NW to remote RAM is much faster than

swapping to local disk

Nswap Architecture Divided into two parts that run on each node

1) Nswap client device driver for network swap “device”

•OS makes swap-in & swap-out requests to it2) Nswap server manages part of RAM for caching remotely swapped pages (Nswap Cache)

Nswap clientNswap client

swap out pageswap out page Nswap serverNswap server

Nswap serverNswap server

Node A Node B

OS spaceOS space

User spaceUser space

Nswap CacheNswap Cache Nswap clientNswap client

Nswap Communication Layer Nswap Communication Layer

NetworkNetwork

This summer Answer questions having to do with policies for

growing/shrinking RAM available for Nswap and implement solution(s):•How do we know when idle RAM is available?•Can we predict when idle RAM will be available for

a long enough time to make it useful for NSWAP?•How much idle RAM should we take for Nswap?•How much RAM should we “give back” to the

system when it needs it Investigate incorporating Flash memory into the

memory hierarchy and using it with Nswap to speed up swap-ins

System-level support for computation involving massive amounts of globally dispersed data (cluster computing on steroids)•Internet scale distributed/parallel computing•Caching, prefetching, programming interface?

How Pages Move Around Cluster

Swap Out:

Swap In:

Migrate from node to node with changes in WL:

Node A (not enough RAM)

Node B (has idle RAM)

A B

A B CWhen node B needs more RAM, it migrates A’s page to node C

Reliable Network RAM Automatically restore remotely swapped

page data lost in node crash

How: Need redundancy •Extra space to store redundant info.

•Avoid using slow disk •Use idle RAM in cluster to store redundant data•Minimize use of idle RAM for redundant data

•Extra computation to compute redundant data•Minimize extra computation overhead

Soln 1: Mirroring

When page is swapped out send it to be stored in idle RAM of 2 nodes•If first node fails, can fetch a copy from second node

+ easy to implement- Requires twice as much idle RAM space for same amount of data

- Requires twice as much network bandwidth

- two page sends across NW vs. one

Soln 2: Centralized ParityEncode redundant info for a set of

pages across diff nodes in a single parity page

If lose data, can recover it using parity page and other data pages in set

0 1 0 0 0 0 0 1 1 0 0 1parity

page

0 1 0 0 0 0 0 1 1 0 0 1 0 1 1

recovered page

Centralized Parity (cont.) A single dedicated cluster node is

the parity server•Stores all parity pages•Implements page recovery on a crash

Parity Logging: regular nodes compute a parity page locally as they swap-out pages, only when parity page is full is it sent to parity server•One extra page send to parity server on every N page swap-outs (vs. 2 on every swap-out for mirroring)

Soln 3: Decentralized Parity

0 1 0 0 0 0 0 1 1 0 0 1

0 0 0 1 0 1 1 0 0 0 0 1

0 1 0 0 1 0 1 1 0 1 1 0

0 1 1 0 0 1 1 1 1 1 0 1

No Dedicated Parity ServerParity Pages distributed across cluster nodes

Centralized vs. Decentralized

Results

Future Work

Acknowlegments

Sequential Programming Designed to run on computers

with one processor (CPU)•CPU knows how to do a small number of simple things (instructions)

Sequential program is ordered set of instructions CPU executes to solve larger problem(ex) Compute 34

1. Multiply 3 and 32. Multiply result and 33. Multiply result and 34. Print out result

CPU CPU

Memory Memory

DiskDisk

Sequential AlgorithmFor each time step do: For each grid element X do:

compute X’s new valueX = f(old X, neighbor 1, neighbor 2,

…)

xx

The Earth SimulatorJapan Agency for Marine-Earth Science and Technology

The Earth SimulatorJapan Agency for Marine-Earth Science and Technology

How Computer Executes Program

CPUCPU

Memory (RAM)

Memory (RAM)

DiskDisk

1. OS loads program code & data from Disk into RAM

2. OS loads CPU withfirst instruction to run

3. CPU starts executinginstructions one at a time

4. OS may need to movedata to/from RAM & Disk as prog runs

11

22

How Fast is this? CPU speed determines max of how

many instructions it can execute•Upper bound: 1 clock cycle: ~ 1 instruction •1 GHertz clock: ~1 billion instructions/sec

Max is never achieved•When CPU needs to access RAM

•takes ~100 cycles

•If OS needs to bring in more data from Disk•RAM is fixed-size, not all program data can fit•Takes ~1,000,000 cycles

Fast desktop machine This is the idea but check these

nubmers!!!!! GigaHertz processor

•Takes ~.000000005 seconds to access data 2 GigaBytes of memory

•231 bytes•Takes ~.000001 seconds to access data

80 GB of disk space•Takes ~ .01 seconds to access data 1 million times slower than if data is on CPU

Requirements of Simulation

Petabytes of data •1 petabyte is 250 bytes (1,125,899,906,842,624

bytes)

Billions of computations at each time step We need help:

•A single computer cannot do one time step in real time

•Need a supercomputer•Lots of processors run simulation program in parallel•Lots of memory space•Lots of disk storage space

Parallel Programming Divide data and computation into

several pieces and let several processors simultaneously compute their piece

3.61.22.32.6…

Processor 1

……

Processor 2

Processor n

Supercomputers of the 90’s

Massively parallel•1,000’s of processors

Custom, state of the art•Hardware •Operating System•Specialized Programming

Languages and Programming Tools

Fastest Computer*

1

10

100

1000

10000

100000

1000000

1990199219941998200020032005

GFlops/sec

*(www.top500.org & Jack Dongara)

computation took 1yr in 1980, takes 16mins in 1995, 27secs in 2000

CrayY-MP

CrayY-MP

ASCI WhiteASCI White

TMCCM-5

TMCCM-5

Blue GeneBlue Gene

Earth SimulatorEarth Simulator

ASCI BlueASCI Blue

Intel ParagonIntel Paragon

1

10

100

1000

10000

1990 1991 1992 1993 1994 1996 1998 1999 2000

GFlops/sec

Fastest Computer*


computation took 1yr in 1980, takes 16mins in 1995, 27secs in 2000

CrayY-MP

CrayY-MP

TMCCM-2

TMCCM-2


ASCI BlueASCI Blue

Intel ParagonIntel Paragon

TMCCM-5

TMCCM-5

Fastest Computer*

0

50000

100000

150000

200000

250000

300000

200020032005

GFlops/sec



Blue GeneBlue Gene

Earth SimulatorEarth Simulator

Problems with Supercomputers of the 90’s

Expensive Time to delivery

~2years Out of date soon

Cluster: Supercomputer of the 00’s

Massively parallel Supercomputer out of network of unimpressive PCs•Each node is off-the-shelf hardware running off-

the-shelf OS

NetworkNetwork

Are Clusters Good?+ Inexpensive

•Parallel computing for the masses

+ Easy to Upgrade• Individual components can be easily replaced

• Off-the-shelf parts, HW and SW•Can constantly and cheaply build a faster parallel

computer

- Using Off-The-Shelf Parts•Lag time between latest advances and availability

outside the research lab•Using parts that are not designed for parallel systems

Currently 7 of the world’s fastest 10 computers are clusters

System-level Support for Clusters

Implement view of a single large parallel machine on top of separate machines

Single, big, shared memory on top of n, small, separate ones

Single, big, shared disk on top of n, small, separate ones

NetworkNetwork

Nswap: Network Swapping Implements a view of a single, large,

shared memory on top of cluster nodes’ individual RAM (physical memory)•When one cluster node needs more memory space than it has, Nswap enables it use idle remote RAM of another cluster node(s) to increase its “memory” space

Traditional Memory Managementprocessor

RAM

Disk

swapProgram1

pages

Program2pages

files

OS moves parts (pages) of running

programs in/out of RAM

• RAM: limited size, expensive, fast, storage • Disk: larger, inexpensive, slow (1,000,000 x slower), storage• Swap: virtual memory that is really on disk expand memory using disk

Network Swapping in a Cluster

Disk

processor

RAMSwap pages to remote idle memory in the cluster

Node 1

processorRAM

Node 3

processorRAM

Node 2

network

files

• Network Swapping: expand memory using RAM of other cluster nodes

Nswap Goals: Transparency

•Processes don’t need to do anything special to use Nswap

Efficiency and Scalability•Point-to-Point model (rather then central server)•Don’t require complete state info to make

swapping decisions

Adaptability•Adjusts to local processing needs on each node•Grow/Shrink portion of node’s RAM used for

remote swapping as its memory use changes

Nswap Architecture Divided into two parts that run on each node

1) Nswap client device driver for network swap “device”

•OS makes swap-in & swap-out requests to it2) Nswap server manages part of RAM for caching remotely swapped pages (Nswap Cache)

Nswap clientNswap client

swap out pageswap out page Nswap serverNswap server

Nswap serverNswap server

Node A Node B

OS spaceOS space

User spaceUser space

Nswap CacheNswap Cache Nswap clientNswap client

Nswap Communication Layer Nswap Communication Layer

NetworkNetwork

How Pages Move Around Cluster

Swap Out:

Swap In:

Migrate from server to server:

Node A (client) Node B (server)

Client A Server B

Client A Server B Server CWhen Server B is full, it migrates A’s page to server C

Complications Simultaneous Conflicting

Operations•Ex. Migration and Swap-in for same page

Garbage Pages in the System•When program terminates, need to remove its remotely swapped pages from servers

Node failure•Can lose remotely swapped page data

Currently, our project… Implemented on Linux cluster of 8

nodes connected with a switched 100 Mb/sec Ethernet network•All nodes have faster disk than network

•Disk is up to 500 Mb/sec•Network up to 100 Mb/sec

-> We expect to be slower than swapping to disk

Experiments Workload 1: sequential R & W to

large chunk of memory•Best case for swapping to disk

Workload 2: random R & W to mem•Disk arm seeks w/in swap partition

Workload 3: Workload 1 + file I/O•Disk arm seeks between swap and file partitions

Workload 4: Workload 2 + file I/O

Workload Execution Times

•Nswap faster than swapping to much faster disk for workloads 2, 3 and 4

0

100

200

300

400

500

600

700

800

WL1 WL2 WL3 WL4

Disk500Mb/ sec Nswap100Mb/ sec

Nswap on Faster Networks

Measured on Disk, 10 Mb/s and 100 Mb/sCalculated speed-up values for 1,000 & 10,000 Mb/s

Work

load

Disk 10 Mb/s 100 Mb/s 1,000Mb/s

10,000

Mb/s(1) 12.27 306.69 56.8

speedup 5.4

28.9(10.6)

26.3(11.6)

(2) 266.79 847.74 153.5(5.5)

77.3(10.9)

70.3(12.1)

(4) 6265.39 9605.91 1733.9(5.54)

866.2(11.1)

786.7(12.2)

Conclusions Nswap: Scalable, Adaptable,

Transparent Network Swapping for Linux clusters

Results show Nswap is•comparable to swapping to disk on slow network

•much faster than disk on faster networks•Based on network vs. disk speed trends, Nswap will be even better in the future

AcknowledgementsStudents :

Sean Finney ’03 Matti Klock ’03Kuzman Ganchev ’03 Gabe Rosenkoetter ’02 Michael Spiegel ’03 Rafael Hinojosa ’01

Michener Fellowship for Second Semester Leave Support

More information, results, details:EuroPar’03 paper, CCSCNE’03 posterhttp://www.cs.swarthmore.edu/~newhall/

QMy general research area: System-level support for parallel and distributed computing User: run...

Documents

Transcript of QMy general research area: System-level support for parallel and distributed computing User: run...