N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of...
-
date post
22-Dec-2015 -
Category
Documents
-
view
221 -
download
3
Transcript of N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of...
![Page 1: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/1.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
1
Comparison of Communication and I/O of the Cray T3E and IBM SP
Jonathan Carter
NERSC User Services
![Page 2: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/2.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
2
Overview
• Node Characteristics
• Interconnect Characteristics
• MPI Performance
• I/O Configuration
• I/O Performance
![Page 3: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/3.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
3
T3E Architecture
• Distributed memory, single CPU processing elements
Interconnect
CPU Memory
![Page 4: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/4.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
4
T3E Communication Network
• Processing Elements (PE) are connected by a 3D torus.
![Page 5: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/5.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
5
T3E Communication Network
• The peak bandwidth of the torus is about 600 Mbyte/sec bidirectional
• Sustainable bandwidth is about 480 Mbytes/sec bidirectional
• Latency is 1μs
• Shmem API gives latency of 1μs, bandwidth 350 Mbyte/sec bidirectional
![Page 6: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/6.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
6
SP Architecture
• Cluster of SMP nodes
Interconnect
Memory
CPU
CPU
![Page 7: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/7.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
7
SP Communication Network
• Nodes are connected via adapters to the SP Switch. Switch is composed of boards which link 16 nodes. Boards are linked to form larger network.
Switch Board
Nodes
![Page 8: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/8.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
8
SP Communication Network
• The peak bandwidth of adapter and switch is 300 Mbyte/sec bidirectional
• Latency of the switch is about 2μs
• Sustainable bandwidth is about 185 Mbytes/sec bidirectional
![Page 9: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/9.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
9
MPI Performance
T3E SP(intra-node)
SP(inter-node)
Latencys
12 10 22
BandwidthMbyte/s
270 300 150
Intra-node is 1 MPI process per node, 2 MPI processes (typical) will halve bandwidth
![Page 10: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/10.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
10
MPI Performance
MPI_reduce (sum)
0
500
1000
1500
2000
2500
3000
3500
4000
16 32 64 128
Procs.
Tim
e (u
s) T3E 256 bytesSP 256 bytesT3E 1024 bytesSP 1024 bytes
![Page 11: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/11.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
11
MPI Performance
MPI_Bcast
0
100
200
300
400
500
600
700
16 32 64 128
Procs.
Tim
e (u
s) T3E 256 bytesSP 256 bytesT3E 1024 bytesSP 1024 bytes
![Page 12: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/12.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
12
T3E I/O Configuration
• PEs do not have local disk
• All PEs access all filesystems equivalently
• Path for (optimum) I/O generally looks like:– PE to I/O node via torus
– I/O node to Fibre Channel Node (FCN) via Gigaring
– FCN to Disk Array via Fibre loop
• In some cases data on APP PE must be transferred to a system buffer on an OS PE then out to an FCN
![Page 13: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/13.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
13
T3E I/O Configuration
I/O FCN
Gigaring
Disk Arrays
![Page 14: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/14.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
14
SP I/O Configuration
• Nodes have local disk. One SCSI disk for all local filesystems. Non-optimal.
• All nodes access Global Parallel File System (GPFS) filesystems equivalently
• Path for GPFS I/O looks like:– Node to GPFS Node via IP over the switch
– GPFS Node to Disk Array via SSA loop
![Page 15: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/15.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
15
SP I/O Configuration
Nodes
Switch
Switch
GPFS Nodes
Disk Array
![Page 16: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/16.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
16
T3E Filesystems• /usr/tmp
– fast
– subject to 14 day purge, not backed up
– check quota with quota -s /usr/tmp (usually 75Gb and 6000 inodes)
• $TMPDIR
– fast
– purged at end of job or session
– shares quota with /usr/tmp
• $HOME
– slower
– permanent, backed up
– check quota with quota (usually 2Gb and 3500 inodes)
![Page 17: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/17.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
17
SP Filesystems• /scratch and $SCRATCH
– global
– fast (GPFS)
– subject to 14 day purge (or at session end for $SCRATCH), not backed up
– check quota with myquota (usually 100Gb and 6000 inodes)
• $TMPDIR
– local (created in /scr) - only 2 Gbyte total
– slower
– purged at end of job or session
• $HOME
– global
– slower (GPFS)
– permanent, not backed up yet
– check quota with myquota (usually 4Gb and 5000 inodes)
![Page 18: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/18.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
18
Types of I/O
• Bewildering number of choices on both machines:– Standard Language I/O: Fortran or C (ANSI or POSIX)
– Vendor extensions to language I/O
– MPI I/O
– Cray FFIO library (can be used from Fortran or C)
– IBM MIO library, requires code changes
![Page 19: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/19.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
19
Standard Language I/O
• Fortran direct access is slightly more efficient then sequential access both on the T3E (see comments on FFIO later) and the SP. It also allows file transferability.
• C language I/O (fopen, fwrite, etc.) is inefficient on both machines.
• POSIX standard I/O (open, read, etc.) can be efficient on the T3E, but requires care (see comments on FFIO later). Works well on the SP.
![Page 20: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/20.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
20
Vendor Extensions to Language I/O
• Cray has a number of I/O routines (aqopen, etc.) which are legacies from the PVP systems. Non-portable.
• IBM has extended Fortran syntax to provide asynchronous I/O. Non-portable.
![Page 21: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/21.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
21
MPI I/O
• Part of MPI-2
• Interface for High Performance Parallel I/O– data partitioning
– collective I/O
– asynchronous I/O
– portability and interoperability bwteen T3E and SP
• Different subset implemented on T3E and SP
![Page 22: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/22.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
22
Summary of access routines for T3E
Positioning Synchronism CoordinationNon-collective Collective
Explicit BlockingNon-blocking
READ_AT READ_AT_ALL
IREAD_AT READ_AT_ALL_BEGINWAIT READ_AT_ALL_END
Individual BlockingNon-blocking
READ READ_ALL
IREAD READ_ALL_BEGINWAIT READ_ALL_END
Shared BlockingNon-Blocking
READ_SHARED READ_ORDERED
IREAD_SHARED READ_ORDERED_BEGINWAIT READ_ORDERED_END
![Page 23: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/23.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
23
Summary of access routines for SP
Positioning Synchronism CoordinationNon-collective Collective
Explicit BlockingNon-blocking
READ_AT READ_AT_ALL
IREAD_AT READ_AT_ALL_BEGINWAIT READ_AT_ALL_END
Individual BlockingNon-blocking
READ READ_ALL
IREAD READ_ALL_BEGINWAIT READ_ALL_END
Shared BlockingNon-Blocking
READ_SHARED READ_ORDERED
IREAD_SHARED READ_ORDERED_BEGINWAIT READ_ORDERED_END
![Page 24: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/24.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
24
Cray FFIO library
• FFIO is a set of I/O layers tuned for different I/O characteristics
• Buffering of data (configurable size)
• Caching of data (configurable size)
• Available to regular Fortran I/O without reprogramming
• Available for C through POSIX-like calls, e.g. ffopen, ffwrite
![Page 25: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/25.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
25
FFIO - The assign command
• controls program behavior at runtime
• the assign command controls– controls which FFIO layer is active
– striping across multiple partitions
– lots more
• scope of assign– File name
– Fortran unit number
– File type (e.g. all sequential unformatted files)
![Page 26: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/26.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
26
IBM MIO library
• User interface based on POSIX I/O routines, so requires program modification
• Useful trace module to collect statistics
• Not much experience with using on GPFS filesystem
• Coming soon
![Page 27: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/27.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
27
I/O Strategies - Exclusive access files
• Each process reads and writes to a separate file– Language I/O
• Increase language I/O performance with FFIO library (for example, sepcify a large buffer with the bufa layer) on T3E. For Fortran direct access default buffer is only the maximum of the record length or 32 Kbytes
• read/write large amounts of data per request on the SP
– MPI I/O• read/write large amounts of data per request
![Page 28: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/28.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
28
bufa FFIO layer Overview
• bufa is an asynchronous buffering layer
• performs read-ahead, write-behind
• specify buffer size with -F bufa:bs:nbufs where bs is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers
• buffer space increases your applications memory requirements
![Page 29: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/29.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
29
I/O Strategies - Shared files
• All PEs read and write the same file simultaneously– Language I/O (requires FFIO library global layer for T3E)
– MPI I/O
– On T3E, language I/O with FFIO library global layer and Cray extensions for additional flexibility
![Page 30: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/30.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
30
Positioning with a shared file
• Positioning of a read or write is your responsibility
• File pointers are private
• Fortran– Use a direct access file, and read/write(rec=num)– Use Cray T3E extensions setpos and getpos to position file
pointer (not portable)
• C– Use ffseek
• MPI I/O– MPI I/O fileview generally takes care of this. Positioning routines
also available.
![Page 31: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/31.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
31
global FFIO layer Overview
• global is a caching and buffering layer which enables multiple PEs to read and write to the same file
• if one PE has already read the data, an additional read request from another PE will result in a remote memory copy
• file open is a synchronizing event
• By default, all PEs must open a global file, this can be changed by calling GLIO_GROUP_MPI(comm)
• specify buffer size with -F global:bs:nbufs where bs is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers per PE
![Page 32: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/32.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
32
GPFS and shared files
• On the T3E the global FFIO layer takes care of updates to a file from multiple PEs by tracking the state of the file across all PEs.
• On the SP, GPFS implements a safe update scheme via tokens and a token manager.– If two processes access the same block of a GPFS file (256 Kbytes),
a negotiation is conducted between the nodes and the token manager to determine the order of updates. This can slow down I/O considerably.
– MPI I/O merges requests from different processes to alleviate this problem
![Page 33: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/33.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
33
I/O Performance Comparison• Each process writes a 200 Mbyte file. 2 processes per node on SP.
0
200
400
600
800
1000
1200
16 32 64
processes
Mby
te/s
ec T3E WriteT3E ReadSP WriteSP read
![Page 34: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d7d5503460f94a60007/html5/thumbnails/34.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
34
Further Information
• I/O on the T3E Tutorial by Richard Gerber at http://home.nersc.gov/training/tutorials
• Cray Publication - Application Programmer’s I/O Guide
• Cray Publication - Cray T3E Fortran Optimization Guide
• man assign
• XL Fortran User’s Guide