TACC Lonestar Cluster Upgrade to 300 Teraflops
Tommy MinyardSC10
November 16, 2010
TACC Mission & StrategyThe mission of the Texas Advanced Computing Center is to enable discoveries that advance science and society through the application of advanced computing technologies.
To accomplish this mission, TACC: – Evaluates, acquires & operates
advanced computing systems
– Provides training, consulting, anddocumentation to users
– Collaborates with researchers toapply advanced computing techniques
– Conducts research & development toproduce new computational technologies
Resources & Services
Research & Development
TACC Staff Expertise
• Operating as an Advanced Computing Center since 1986
• More than 80 Employees at TACC– 20 Ph.D. level research staff– Graduate and undergraduate students
• Currently support thousands of users on production systems
TACC Resources are Comprehensive and Balanced
• HPC systems to enable larger simulations analyses and faster turnaround times
• Scientific visualization resources to enable large data analysis and knowledge discovery
• Data & information systems to store large datasets from simulations, analyses, digital collections, instruments, and sensors
• Distributed/grid computing servers & software to integrate all resources into computational grids
• Network equipment for high-bandwidth data movements and transfers between systems
TACC’s Migration Towards HPC Clusters
• 1986: TACC founded– Historically had large Cray systems
• 2000: First experimental cluster– 16 AMD workstations
• 2001: First production clusters– 64-processor Pentium III Linux cluster– 20-processor Itanium Linux cluster
• 2003: First terascale cluster, Lonestar– 1028-processor Dell Xeon Linux cluster
• 2006: Largest US academic cluster deployed– 5840-processor cores 64-bit Xeon Linux cluster
Current Dell Production Systems
• Lonestar – 1460 node, dual-core InfiniBand HPC production system, 62 Teraflops
• Longhorn – 256 node, quad-core Nehalem, visualization and GPGPU computing cluster
• Colt – 10 node high-end visualization system with 3x3 tiled wall display
• Stallion – 23 node, large scale Vis system with 15x5 tiled wall display (more than 300M pixels)
• Discovery – 90 node benchmark system with variety of processors, InfiniBand DDR & QDR
TACC Lonestar System
Dell Dual-Core 64-bit Xeon Linux Cluster5840 CPU cores (62.1 Tflops)10+ TB memory, 100+ TB disk
Galerkin wave propagation
• Lucas Wilcox, Institute for Computational Engineering and Sciences, UT-Austin
• Seismic wave propagation, PI Omar Ghattaspart of research recently on cover of Sciencefinalist for Gordon Bell Prize at SC10
Molecular Dynamics
• David LeBard, Institute for Computational Molecular Science, Temple University
• Pretty Fast Analysis: A software suite for analyzing large scale simulations on supercomputers and GPU clusters
• Presented to the American Chemical Society, August 2010
PFA example: E(r) around lysozyme
EOH(r) = rOH . E(r), calculating
P(EOH)
4,311x4,311x
Lonestar Upgrade
• Current Lonestar already 4+ years of operation• Needed replacement to support UT and
TeraGrid users along with several other large projects
• Submitted proposal to NSF with matching UT funds along with funds from UT-ICES, Texas A&M and Texas Tech
New Lonestar Summary• Compute power – 301.7 Teraflops
– 1,888 Dell M610 two-socket blades– Intel X5680 3.33GHz six-core “Westmere” processors– 22,656 total processing cores
• Memory – 44 Terabytes– 2 GB/core, 24 GB/node– 132 TB/s aggregate memory bandwidth
• Disk subsystem – 1.2 Petabytes– Two DDN SFA10000 controllers, 300 2TB drives each– ~20 GB/sec total aggregate I/O bandwidth– 2 MDS, 16 OSS nodes
• Interconnect – InfiniBand QDR– Mellanox 648-port InfiniBand switches (4)– Full non-blocking fabric– Mellanox ConnectX-2 InfiniBand cards
System design challenges
• Limited by power and cooling• X5680 processor 130 Watts per
socket!– M1000e chassis fully populated
~7kW of power
• Three M1000e chassis per rack – 21kW per rack– Six 208V, 30-amp circuits per rack
• Forty total compute racks, four switch racks– Planning mix of underfloor and
overhead cabling
Software Stack
• Reevaluating current cluster management kits and resource managers/schedulers– Platform PCM/LSF– Univa UD– Bright Cluster Manager– SLURM, MOAB, PBS, Torque
• Current plan:– TACC custom cluster install and administration scripts– SGE 6.2U5– Lustre 1.8.4– Intel Compilers– MPI Libraries: MVAPICH, MVAPICH2, OpenMPI
Questions?
Top Related