HPC Growing Pains
Transcript of HPC Growing Pains
HPC Growing PainsIT Lessons Learned from the Biomedical Data Deluge
John L. WoffordCenter for Computational Biology & Bioinformatics
Columbia University
Tuesday, March 27, 12
What is ?
• Internationally recognized biomedical computing center.
• Broad range of computational biomedical and biology research, from biophysics to genomics.
• More than 15 Labs and nearly 200 faculty, staff, students and postdocs across multiple campuses and 8 departments.
• IT staff of 8, covering everything from desktop to HPC & datacenter.
Affiliates
Tuesday, March 27, 12
HPC Growing Pains
Before (2008) After (2012)
Total CPU-cores ~500 ~4500
Largest cluster: CPU 400 core ~4000 core
Largest cluster: Memory 800 GB 8 TB
Annual CPU-hr 2 M CPU-hrs >50 M CPU-hrs
Average Daily Active Users 20 120
Storage capacity 30 TB ~ 1PB
Data Center space 800 sq.ft. 4000 sq.ft.
Over the past 3 years we have grown our HPC resources by an order of magnitude, driven largely by genomic data storage and
processing demands.
Tuesday, March 27, 12
Outline
I. Intro: Biomedical data growth
II. Storage challenges
II.1. Performance
II.2. Capacity
II.3. Data integrity
III. Conclusions
IT Lessons Learned from the Biomedical Data Deluge
Tuesday, March 27, 12
• A few stock facts:
• With a collective 269 petabytes of data, education was among the U.S. economy’s top 10 sectors storing the largest amount of data in 2009, according to a McKinsey Global Institute survey.
• The world will generate 1.8 zettabytes of data this year alone, according to IDC’s 2011 Digital Universe survey.
• Worldwide data volume is growing a minimum of 59% annually, a Gartner report estimates, outrunning Kryder’s law for disk capacity per cost growth.
• Biomedical data–primarily driven by gene sequencing–is growing dramatically faster than industry average and Kryder’s law.
• ...not only does that data need to be stored, it needs to be heavily analyzed, demanding both performance and capacity.
What data deluge?
Tuesday, March 27, 12
Sequence data production rates
Tuesday, March 27, 12
From Kahn S.D. On the future of genomic data. Science 2011;331:728-729.
Tuesday, March 27, 12
0
750
1500
2250
3000
0 15 30 45 60
C2B2 data growth, June ’08 - March ’12Da
ta U
sage
(TB)
Months since June ’08
Raw capacity
Raw
requ
irem
ent
(50%
ove
rhea
d)
Usage
trend
Industry trend (59% annual
growth)Usage (logical)
Tuesday, March 27, 12
The storage challenge:
1. Perform well enough to analyze the data (i.e. stand up to a top500 supercomputer);
2. Scale from from TeraBytes to PetaBytes (without having to constantly rebuild);
3. Protect important data (from crashes, users and floods);
Design a storage system that can:
Tuesday, March 27, 12
Performance
• We have 4000 CPUs working around the clock on analyzing data; we want to keep them all fed with data all of the time.
• Our workload is “random” and not well behaved. It’s notoriously difficult to design for this kind of workload. Ideally, we want a solution that can be flexible as our workloads change.
• The more disks we have spinning in parallel, the better the performance. We’re going to need a lot of disks. Using a rough heuristic, 1 disk per compute node would mean ~500 disks.
• But, to make that many disks useful, we’re going to need a lot of processing and network capabilities.
For parallel processes you need parallel file access.
Tuesday, March 27, 12
Traditional NAS ArchitectureSingle NAS head with multiple disk arrays
• Pro
• Support: Time-tested architecture with many major, competing vendors.
• Capacity scaling: relatively easy on modern NAS.
• Con
• Performance scaling: is difficult & unpredictable.
• Management: Storage pools must be managed and tuned.
• Reliability: NAS head provides single failure point.
Traditional NASSingle controller, many arrays
SAN or Direct Attached Interconnect
(FC, SATA,…)
NAS HeadProcesses network file storage requests (NFS, CIFS, etc). Manage SAN storage pools.
Disk arraysJBODs or RAID
...
SSD SATA SATA
SAN Storage node
SSD SATA SATA
SAN Storage node
SSD SATA SATA
SAN Storage node
Disk Disk Disk
Storage client* Cluster node, Server, Desktop, ...
Storage client* Cluster node, Server, Desktop, ...
Storage client* Cluster node, Server, Desktop, ...
CPU Cache Network
Storage client* Cluster node, Server, Desktop, ...
Storage client* Cluster node, Server, Desktop, ...
Storage client* Cluster node, Server, Desktop, ...
Netapp, Bluearc, EMC...
Tuesday, March 27, 12
Clustered NAS architectureSingle filesystem distributed across many nodes.
• Pro
• Capacity scaling: new nodes automatically integrate.
• Performance scaling: new nodes add CPU, Cache and network performance.
• Reliability: most architectures can survive multiple node failures.
• Con
• Support: Relatively new technology. Few vendors (but rapidly growing).
Clustered NASA single filesystem is presented by multiple
nodes which each process filesystem requests.Storage client* Cluster node, Server, Desktop, ...
Storage client* Cluster node, Server, Desktop, ...
Storage client* Cluster node, Server, Desktop, ...
Clustered NAS node
Disk pool
Disk Disk Disk
CPU Cache Network
Clustered NAS node
Disk pool
Disk Disk Disk
CPU Cache Network
Clustered NAS node
Disk pool
Disk Disk Disk
CPU Cache Network
Storage client* Cluster node, Server, Desktop, ...
Storage client* Cluster node, Server, Desktop, ...
Storage client* Cluster node, Server, Desktop, ...
Hig
h sp
eed
back
end
netw
ork
(tran
sfer
s da
ta b
etw
een
node
s).
Isilon, Panasas, Gluster,...
Tuesday, March 27, 12
• In late 2009 we had scaled to where we thought we should be, but our system was unresponsive, with constant, very high load.
• More puzzling, was that the load didn’t seem to have anything correlation with the network, disk throughput, or even load on the compute cluster.
Clustered NAS doesn’t solve everything...
Tuesday, March 27, 12
What we found(With the help of good analytics)
Tuesday, March 27, 12
Namespace readsNamespace operations consume CPU,
and waste I/O capabilities.
• It’s common in biomedical data to have thousands, or millions of small files on a project.
• We have ~500M files, with an average filesize of less than 8 kb.
• Many genome “databases” are directory structures of flat files that get “indexed” by filename (NCBI, for instance, hosts > 20k files in their databases).
• Our system was thrashing, and we weren’t getting a lot of I/O... the 40% namespace reads were killing our performance.
Other13%
Write15%
Read32%
Namespace Read40%
Distribution of Protocol Operations
Tuesday, March 27, 12
SSD Accelerated NamespaceNamespace data live on SSDs (very low seek time),
Ordinary data lives on ordinary disks.
SSD Enabled Node
Data
Disk
Metadata
DiskSSD
SSD Enabled Node
Data
Disk
Metadata
DiskSSD
SSD Enabled Node
Data
Disk
Metadata
DiskSSD
SSD Accelerated NamespaceSold Sate Disks provide fast
seek time for namespace reads.
• Namespace reads are seek intensive (not throughput intensive).
• SSDs generally have > 40x seek time of spinning disks.
• We were able to spread our filesystem metadata over SSDs on our nodes, dramatically increasing namespace performance and decreasing system load.
• We experienced an immediate, ~8x increase in filesystem responsiveness, and overall system performance increase.
File
syst
em m
etad
ata
Tuesday, March 27, 12
Capacity
1. While the High-performance Clustered NAS naturally scales capacity as well as performance, it’s an expensive way to build capacity.
2. We don’t want to have the “big” filesystem and the “fast” filesystem. We want everyone to see the same files from everywhere.
3. High-performance systems benefit from uniform hardware. Capacity scaling benefits from being able to use the latest, biggest, densest disks.
4. In fact, typically a clustered NAS is made of entirely uniform hardware, so how do you update without time consuming data migrations?
How we scaled a single filesystem from 168 TB to 1 PB
Tuesday, March 27, 12
Multi-tiered Clustered NASSingle filesystem distributed across many pools of nodes.
• Multiple “pools” of nodes share a single filesystem namespace.
• Different pools can have different performance/capacity, allowing for independent scaling of capacity and performance.
• New pools can be added, and old removed, allowing seamless upgrades.
• All pools are active, so nodes configured for large capacity can still serve data to low-demand devices.
Multi-tiered Clustered NASmultiple pools, single namespace
Rule baseddata migration
High-speed storage poolcompute clusters, sequencers, etc.
...
SSD SATA SATASSD-accelerated storage node
SSD SATA SATASSD-accelerated storage node
SSD SATA SATASSD-accelerated storage node
SSD SATA SATA
High-capacity storage poolinfrastructure servers, desktops, etc.
...
SSD SATA SATA
Nearline storage node
SSD SATA SATA
Nearline storage node
SSD SATA SATA
Nearline storage node
SATASATA
SATASATA
SATASATA
Sharedbackend network
High-Performance Compute Clusters
Blade chassis
Virtual machineVirtual machine
Compute nodeCompute node
Data generating research equipment
(e.g. gene sequencers)
Server Infrastructure
Virtual machineVirtual machine
Server
Desktops &Workstations
Tuesday, March 27, 12
Evolution of a filesystem
0
250
500
750
1000
0 12.5 25.0 37.5 50.0
Raw capacity
Requ
iremen
t
2008inception
Cap: 168 TB
2009new nodes
Cap: 276 TB
2010 Q31. Merged pools
2. New nodes (SSD)Cap: 672 TB
2010 Q1new nodes
(separate cluster)496 TB
2010 Q4Swap original nodes
for new nodes Cap: 648 TB
2011 Capacity upgrade
(denser nodes)Cap: 984 TB
Tuesday, March 27, 12
Caveat: A single namespace has challenges
• Since the namespace doesn’t refresh, it tends to grow and grow. We currently have > 500 Million files in our filesystem.
• While it’s nice to have all of your files in one space, it takes a lot of effort to keep it organized.
• In initial deployment, spent roughly 30x longer planning filesystem structure than deploying hardware.
• If you plan your filesystem poorly, it could take a long time to relocate or remove all of those files.
Tuesday, March 27, 12
Data Integrity
1. Users: “Oops! I didn’t mean to delete that!”
2. Glitches: “Error$mounting:$mount:$wrong$fs$type,$bad$option,$bad$superblock$on$/dev/... ”
3. Floods: “Who installed the water main over the data center?!”
How do you protect large-scale, important datafrom users, glitches & floods?
Tuesday, March 27, 12
Tape vs. Disk-to-DiskTape is dead. Right?
Tape Library
TapeTapeTape
Backup serverBackup server
NAS File serverFile server
Tape backup
Primary NAS Secondary NAS
Disk-to-disk backup
Tape D-t-D
Fast
Easy to maintain
Reliable
Cheap
Low-power
Long shelf life
Tuesday, March 27, 12
• Using only tape is impractical. LTO5 can write 1TB in ~ 6hr, or 200 TB in 50 days. You can split across drives, but this becomes a management nightmare.
• Using only disk is cost prohibitive. A complete disk system vs. a complete tape system. Plus:• you need to keep disk powered.
• there’s no easy (or safe) way to archive disks.
• it has to grow faster than primary storage (if you want historical archives).
Neither option is ideal on its own.
Tuesday, March 27, 12
Our middle groundSnapshots + replication + tape =
protection from:
• Users: frequent snapshots on the source provide easy “oops” recovery to the user.
• Glitches: replication provides short-term rapid recovery. Added snapshots extend replication archives to the mid-term (~6 mo.).
• Floods: Tape backup provides cheap, reliable archival of data, for large-scale disaster recovery (or important files from ’06). Leaving backup windows flexible keeps tape manageable.
Tape LibraryOffsite DRArchival
TapeTapeTape
Replication clusterLive replication of critical data, with historical snapshots
Replication storage node
SATA SATA SATASATA SATA SATA
Replication storage node
SATA SATA SATASATA SATA SATA
Replication storage node
SATA SATA SATASATA SATA SATA
Primary storage clusterEasy user recovery with short-term snapshots
...
SSD SATA SATAStorage node
SSD SATA SATAStorage node
SSD SATA SATAStorage node
Disk Disk Disk
Frequent replication (daily)
Lazy archival(~ 6 mo.)
Tape storageLong-term archival
TapeTapeTape
1 copy per year
Data backup path
Tuesday, March 27, 12
Backup Infrastructure
The big picture:
Tape LibraryOffsite DRArchival
TapeTapeTape
Replication clusterLive replication of critical data
Replication storage node
SATA SATA SATASATA SATA SATA
Replication storage node
SATA SATA SATASATA SATA SATA
Replication storage node
SATA SATA SATASATA SATA SATA
Primary Storage Clustermultiple pools, single filesystem
Rule baseddata migration
High-speed storage poolcompute clusters, sequencers, etc.
...
SSD SATA SATASSD-accelerated storage node
SSD SATA SATASSD-accelerated storage node
SSD SATA SATASSD-accelerated storage node
SSD SATA SATA
Nearline storage poolinfrastructure servers, desktops, etc.
...
SSD SATA SATANearline storage node
SSD SATA SATANearline storage node
SSD SATA SATANearline storage node
SATA SATA SATA
Server Infrastructure
Virtualization Infrastructure
Virtual machineVirtual machine
Virtual machine
Virtual machineVirtual machinePhysical servers
Data generating research equipment
(e.g. gene sequencers)
Desktops &Workstations
High-Performance Compute Clusters
Blade chassis
Virtual machineVirtual machine
Compute nodeCompute node
Blade chassis
Virtual machineVirtual machine
Compute nodeCompute node
10 Gbps Aggregation netw
ork
Multi-tiered scale-out storage architecture from HPC Infrastructure to the Desktop
Tape storageLong-term archival
TapeTapeTape
Tuesday, March 27, 12
Backup Infrastructure
The big picture:
Tape LibraryOffsite DRArchival
TapeTapeTape
Replication clusterLive replication of critical data
Replication storage node
SATA SATA SATASATA SATA SATA
Replication storage node
SATA SATA SATASATA SATA SATA
Replication storage node
SATA SATA SATASATA SATA SATA
Primary Storage Clustermultiple pools, single filesystem
Rule baseddata migration
High-speed storage poolcompute clusters, sequencers, etc.
...
SSD SATA SATASSD-accelerated storage node
SSD SATA SATASSD-accelerated storage node
SSD SATA SATASSD-accelerated storage node
SSD SATA SATA
Nearline storage poolinfrastructure servers, desktops, etc.
...
SSD SATA SATANearline storage node
SSD SATA SATANearline storage node
SSD SATA SATANearline storage node
SATA SATA SATA
Server Infrastructure
Virtualization Infrastructure
Virtual machineVirtual machine
Virtual machine
Virtual machineVirtual machinePhysical servers
Data generating research equipment
(e.g. gene sequencers)
Desktops &Workstations
High-Performance Compute Clusters
Blade chassis
Virtual machineVirtual machine
Compute nodeCompute node
Blade chassis
Virtual machineVirtual machine
Compute nodeCompute node
10 Gbps Aggregation netw
ork
Multi-tiered scale-out storage architecture from HPC Infrastructure to the Desktop
Tape storageLong-term archival
TapeTapeTape
Tuesday, March 27, 12
Backup Infrastructure
The big picture:
Tape LibraryOffsite DRArchival
TapeTapeTape
Replication clusterLive replication of critical data
Replication storage node
SATA SATA SATASATA SATA SATA
Replication storage node
SATA SATA SATASATA SATA SATA
Replication storage node
SATA SATA SATASATA SATA SATA
Primary Storage Clustermultiple pools, single filesystem
Rule baseddata migration
High-speed storage poolcompute clusters, sequencers, etc.
...
SSD SATA SATASSD-accelerated storage node
SSD SATA SATASSD-accelerated storage node
SSD SATA SATASSD-accelerated storage node
SSD SATA SATA
Nearline storage poolinfrastructure servers, desktops, etc.
...
SSD SATA SATANearline storage node
SSD SATA SATANearline storage node
SSD SATA SATANearline storage node
SATA SATA SATA
Server Infrastructure
Virtualization Infrastructure
Virtual machineVirtual machine
Virtual machine
Virtual machineVirtual machinePhysical servers
Data generating research equipment
(e.g. gene sequencers)
Desktops &Workstations
High-Performance Compute Clusters
Blade chassis
Virtual machineVirtual machine
Compute nodeCompute node
Blade chassis
Virtual machineVirtual machine
Compute nodeCompute node
10 Gbps Aggregation netw
ork
Multi-tiered scale-out storage architecture from HPC Infrastructure to the Desktop
Tape storageLong-term archival
TapeTapeTape
Tuesday, March 27, 12
Conclusion
• With Clustered NAS and SSD acceleration, we’re regularly seeing filesystem throughput in excess of 10 Gbps and IOPs well over 500k without an issue.
• So far we’ve managed to stay ahead of our data-growth curve with multi-tiered storage. We plan to at least double capacity in the next 6-12 months with no major architectural changes.
• With combination of snapshots, disk-to-disk replication and tape, we’re getting daily backups of all important data as well as long-term archivals.
• Thank you! Questions?
Putting it all together.
Tuesday, March 27, 12