TACC Lonestar Cluster Upgrade to 300 Teraflops

19
TACC Lonestar Cluster Upgrade to 300 Teraflops Tommy Minyard SC10 November 16, 2010

Transcript of TACC Lonestar Cluster Upgrade to 300 Teraflops

Page 1: TACC Lonestar Cluster Upgrade to 300 Teraflops

TACC Lonestar Cluster Upgrade to 300 Teraflops

Tommy MinyardSC10

November 16, 2010

Page 2: TACC Lonestar Cluster Upgrade to 300 Teraflops

TACC Mission & StrategyThe mission of the Texas Advanced Computing Center is to enable discoveries that advance science and society through the application of advanced computing technologies.

To accomplish this mission, TACC: – Evaluates, acquires & operates

advanced computing systems

– Provides training, consulting, anddocumentation to users

– Collaborates with researchers toapply advanced computing techniques

– Conducts research & development toproduce new computational technologies

Resources & Services

Research & Development

Page 3: TACC Lonestar Cluster Upgrade to 300 Teraflops

TACC Staff Expertise

• Operating as an Advanced Computing Center since 1986

• More than 80 Employees at TACC– 20 Ph.D. level research staff– Graduate and undergraduate students

• Currently support thousands of users on production systems

Page 4: TACC Lonestar Cluster Upgrade to 300 Teraflops

TACC Resources are Comprehensive and Balanced

• HPC systems to enable larger simulations analyses and faster turnaround times

• Scientific visualization resources to enable large data analysis and knowledge discovery

• Data & information systems to store large datasets from simulations, analyses, digital collections, instruments, and sensors

• Distributed/grid computing servers & software to integrate all resources into computational grids

• Network equipment for high-bandwidth data movements and transfers between systems

Page 5: TACC Lonestar Cluster Upgrade to 300 Teraflops

TACC’s Migration Towards HPC Clusters

• 1986: TACC founded– Historically had large Cray systems

• 2000: First experimental cluster– 16 AMD workstations

• 2001: First production clusters– 64-processor Pentium III Linux cluster– 20-processor Itanium Linux cluster

• 2003: First terascale cluster, Lonestar– 1028-processor Dell Xeon Linux cluster

• 2006: Largest US academic cluster deployed– 5840-processor cores 64-bit Xeon Linux cluster

Page 6: TACC Lonestar Cluster Upgrade to 300 Teraflops

Current Dell Production Systems

• Lonestar – 1460 node, dual-core InfiniBand HPC production system, 62 Teraflops

• Longhorn – 256 node, quad-core Nehalem, visualization and GPGPU computing cluster

• Colt – 10 node high-end visualization system with 3x3 tiled wall display

• Stallion – 23 node, large scale Vis system with 15x5 tiled wall display (more than 300M pixels)

• Discovery – 90 node benchmark system with variety of processors, InfiniBand DDR & QDR

Page 7: TACC Lonestar Cluster Upgrade to 300 Teraflops

TACC Lonestar System

Dell Dual-Core 64-bit Xeon Linux Cluster5840 CPU cores (62.1 Tflops)10+ TB memory, 100+ TB disk

Page 8: TACC Lonestar Cluster Upgrade to 300 Teraflops

Galerkin wave propagation

• Lucas Wilcox, Institute for Computational Engineering and Sciences, UT-Austin

• Seismic wave propagation, PI Omar Ghattaspart of research recently on cover of Sciencefinalist for Gordon Bell Prize at SC10

Page 9: TACC Lonestar Cluster Upgrade to 300 Teraflops
Page 10: TACC Lonestar Cluster Upgrade to 300 Teraflops
Page 11: TACC Lonestar Cluster Upgrade to 300 Teraflops

Molecular Dynamics

• David LeBard, Institute for Computational Molecular Science, Temple University

• Pretty Fast Analysis: A software suite for analyzing large scale simulations on supercomputers and GPU clusters

• Presented to the American Chemical Society, August 2010

Page 12: TACC Lonestar Cluster Upgrade to 300 Teraflops

PFA example: E(r) around lysozyme

EOH(r) = rOH . E(r), calculating

P(EOH)

4,311x4,311x

Page 13: TACC Lonestar Cluster Upgrade to 300 Teraflops

Lonestar Upgrade

• Current Lonestar already 4+ years of operation• Needed replacement to support UT and

TeraGrid users along with several other large projects

• Submitted proposal to NSF with matching UT funds along with funds from UT-ICES, Texas A&M and Texas Tech

Page 14: TACC Lonestar Cluster Upgrade to 300 Teraflops

New Lonestar Summary• Compute power – 301.7 Teraflops

– 1,888 Dell M610 two-socket blades– Intel X5680 3.33GHz six-core “Westmere” processors– 22,656 total processing cores

• Memory – 44 Terabytes– 2 GB/core, 24 GB/node– 132 TB/s aggregate memory bandwidth

• Disk subsystem – 1.2 Petabytes– Two DDN SFA10000 controllers, 300 2TB drives each– ~20 GB/sec total aggregate I/O bandwidth– 2 MDS, 16 OSS nodes

• Interconnect – InfiniBand QDR– Mellanox 648-port InfiniBand switches (4)– Full non-blocking fabric– Mellanox ConnectX-2 InfiniBand cards

Page 15: TACC Lonestar Cluster Upgrade to 300 Teraflops

System design challenges

• Limited by power and cooling• X5680 processor 130 Watts per

socket!– M1000e chassis fully populated

~7kW of power

• Three M1000e chassis per rack – 21kW per rack– Six 208V, 30-amp circuits per rack

• Forty total compute racks, four switch racks– Planning mix of underfloor and

overhead cabling

Page 16: TACC Lonestar Cluster Upgrade to 300 Teraflops
Page 17: TACC Lonestar Cluster Upgrade to 300 Teraflops
Page 18: TACC Lonestar Cluster Upgrade to 300 Teraflops

Software Stack

• Reevaluating current cluster management kits and resource managers/schedulers– Platform PCM/LSF– Univa UD– Bright Cluster Manager– SLURM, MOAB, PBS, Torque

• Current plan:– TACC custom cluster install and administration scripts– SGE 6.2U5– Lustre 1.8.4– Intel Compilers– MPI Libraries: MVAPICH, MVAPICH2, OpenMPI

Page 19: TACC Lonestar Cluster Upgrade to 300 Teraflops

Questions?