T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core...

20
"# CCS HPC## T2K [email protected] ! "#

Transcript of T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core...

Page 1: T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system

CCS HPC T2K

[email protected]

Page 2: T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system

T2K

• 20 6

• • 95TFLOPS 800TB

”T2K Open Supercomputer Alliance”

– T2K 140TFLOPS – T2K 61TFLOPS+SMP

Page 3: T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system

T2K

Page 4: T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system

648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system

800 TB (physical 1PB) RAID-6 Luster cluster file system Infiniband x 2 Meta-Data Server, File Server

fault tolerance

70 racks

Page 5: T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system

• Multi-core & multi-socket – AMD quad-core Opteron 8000 (Barcelona)

– 4 socket / node 147 GFLOPS/node

– Opteron NUMA 40 GB/s/node

• Multi-rail – 4xDDR Infiniband (Mellanox ConnectX) x 4 / node

– Quad-rail Infiniband 8 GB/s/node

• – MVAPICH (modified by Appro) multi-rail configuration

MPI rail multi-rail

Page 6: T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system

Infiniband ConnectX x 4 ( PCI-Express x 8 lane)

Page 7: T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system

T2K

• – OS: Red Hat Enterprise Linux v.5 WS (Linux kernel 2.6) – F90, C, C++, Java – PGI (Portrand Group), Intel – MPI MVAPICH Appro – IMSL , ACML, SCALAPACK – PGPROFR, PAPI

• – –

• • OpenMP • MPI

Page 8: T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system

• – 16 16 –

• OpenMP – 16 16 – OpenMP directive

• MPI – MPI (Message Passing Interface)

• – or OpenMP MPI

Page 9: T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system

•OpenMP

– CPU

• L1 L2

• L3

• 16

– 4MPI

multi-thread

multi-thread

MPI

Page 10: T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system

Bridge

NVIDIA

nForce 3050

USB

Dual Channel Reg DDR2

Hyper Transport

8GB/s (Full-duplex)

PCI-X

I/O Hub

8GB/s 8GB/s

(A)2

(B)2

4GB/s (Full-duplex)

4GB/s (Full-duplex)

(A)1

(B)1

4GB/s (Full-duplex)

4GB/s (Full-duplex)

Bridge

NVIDIA

nForce 3600

Bridge

PCI-Express X16

PCI-Express X8

PCI-X

PCI-X

X16

X8

X4

PCI-Express X16

PCI-Express X8

SAS

X16

X8

X4

2GB 667MHz DDR2 DIMM x4

2GB 667MHz DDR2 DIMM x4

2GB 667MHz DDR2 DIMM x4

2GB 667MHz DDR2 DIMM x4

Mellanox MHGH28-XTC ConnectX HCA x2

(1.2 s MPI Latency, 4X DDR 20Gb/s)

Mellanox MHGH28-XTC ConnectX HCA x2

(1.2 s MPI Latency, 4X DDR 20Gb/s)

Page 11: T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system

MPI

• Multi-rail

– 4 rail / 1 MPI process (x 16 thread)

– 4 rail / 4 MPI process (x 4 thread)

– 1 rail / 1 MPI process (x 4 thread)

– 4 rail / 16 MPI process (x 1 thread)

• Multi-rail Infiniband –

–fail-over

– Infiniband Subnet Management

Page 12: T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system

4 link / 1 MPI process (x 16 thread)

Bridge

NVIDIA

nForce 3050

USB

Dual Channel Reg DDR2

Hyper Transport

8GB/s (Full-duplex)

PCI-X

I/O Hub

8GB/s 8GB/s

(A)2

(B)2

4GB/s (Full-duplex)

4GB/s (Full-duplex)

(A)1

(B)1

4GB/s (Full-duplex)

4GB/s (Full-duplex)

Bridge

NVIDIA

nForce 3600

Bridge

PCI-Express X16

PCI-Express X8

PCI-X

PCI-X

X16

X8

X4

PCI-Express X16

PCI-Express X8

SAS

X16

X8

X4

2GB 667MHz DDR2 DIMM x4

2GB 667MHz DDR2 DIMM x4

2GB 667MHz DDR2 DIMM x4

2GB 667MHz DDR2 DIMM x4

Mellanox MHGH28-XTC ConnectX HCA x2

(1.2 s MPI Latency, 4X DDR 20Gb/s)

Mellanox MHGH28-XTC ConnectX HCA x2

(1.2 s MPI Latency, 4X DDR 20Gb/s)

Page 13: T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system

1 link / 1 MPI process (x4 thread)

Bridge

NVIDIA

nForce 3050

USB

Dual Channel Reg DDR2

Hyper Transport

8GB/s (Full-duplex)

PCI-X

I/O Hub

8GB/s 8GB/s

(A)2

(B)2

4GB/s (Full-duplex)

4GB/s (Full-duplex)

(A)1

(B)1

4GB/s (Full-duplex)

4GB/s (Full-duplex)

Bridge

NVIDIA

nForce 3600

Bridge

PCI-Express X16

PCI-Express X8

PCI-X

PCI-X

X16

X8

X4

PCI-Express X16

PCI-Express X8

SAS

X16

X8

X4

2GB 667MHz DDR2 DIMM x4

2GB 667MHz DDR2 DIMM x4

2GB 667MHz DDR2 DIMM x4

2GB 667MHz DDR2 DIMM x4

Mellanox MHGH28-XTC ConnectX HCA x2

(1.2 s MPI Latency, 4X DDR 20Gb/s)

Mellanox MHGH28-XTC ConnectX HCA x2

(1.2 s MPI Latency, 4X DDR 20Gb/s)

Page 14: T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system

T2K

CPU0-3

CPU4-7

CPU8-11

CPU12-15

mlx

4_0

mlx

4_1

mlx

4_2

mlx

4_3

Page 15: T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system

T2K

Socket

Bridge & PCI-e

Infiniband HCA

mlx4_2

mlx4_3

mlx4_0

mlx4_1

CPU12-15 sock3

CPU4-7 sock1

IO55

CPU0-3 sock0

CPU8-11 sock2

MCP55

Mem3 8GB

Mem1 8GB

Mem2 8GB

Memory

Mem0 8GB

Page 16: T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system

CPU

• numactl – $ numactl --hardware available: 4 nodes (0-3) node 0 size: 8062 MB node 0 free: 112 MB node 1 size: 8080 MB node 1 free: 327 MB node 2 size: 8080 MB node 2 free: 274 MB node 3 size: 8080 MB node 3 free: 354 MB node distances: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 20 2: 20 20 10 20 3: 20 20 20 10

$ numactl --show policy: default preferred node: current physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 cpubind: 0 1 2 3 nodebind: 0 1 2 3 membind: 0 1 2 3

Page 17: T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system

CPU

• --interleave – all, 0-2

• --preferred –

• --physcpubind – CPU

• --cpunodebind –

• --membind –

• --localalloc –

usage: numactl [--interleave=nodes] [--preferred=node] [--physcpubind=cpus] [--cpunodebind=nodes] [--membind=nodes] [--localalloc] command args ...

Page 18: T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system

MPI

• MPI

– % module load pgi mvapich2/pgi # PGI

– % module load intel mvapich2/intel # Intel

– % module load mvapich2/gnu # GCC

– mpicc, mpiCC, mpif77, mpif90

Page 19: T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system

16MPI

16MPI

mpirun_rsh –np 16 numactl –localalloc ./a.out

# numactl

4MPI SOCKET=$(( $MPIRUN_RANK % 4 ))

/opt/sge/mpi/make_hostfile.sh 4 > ${JOB_ID}_host

mpirun_rsh -np 4 -hostfile ${JOB_ID}_host

numactl –cpunodebind=$SOCKET –localalloc ./a.out

rm ${JOB_ID}_host

# numactl

Page 20: T2K...648 node (quad-core x 4 socket / node) Opteron “Barcelona” B8000 CPU 2.3 GHz x 4 x 4 core x 4 socket = 147.2 GFLOPS / node = 95.3 TFLOPS / system 20.7 TB memory / system

• T2K IB4

• MVAPICH2

•MV2_NUM_HCAS

– mpirun_rsh … MV2_NUM_HCAS=4 ./a.out 1 4

• T2K-Tsukuba