Post on 15-Jan-2016
1
Data Management and Data Storage
Ben RogersITS – Research Services
2
Data Awareness Data Management Data Storage Campus Resources Questions
Overview
3
Sequencing changes faster than IT Understand the data you will produce Understand the data you will keep Understand how the data will move
Data Awareness
4
Understand the sizes of the data each instrument produces◦ How often will you collect this data?◦ What IT resources are needed for each data set?
How will you handle?◦ Raw Data◦ Intermediate Data◦ Derived Data
Data You Produce
5
Must decide what data to keep◦ How long?◦ How will it be stored?
Is it cheaper to:◦ Rerun the experiment◦ Rerun the analysis
Data You Keep
6
Data captured by the instrument must be moved
Terabytes of data may be involved Moving terabytes of data across networks is
non-trivial◦ The network is not always the bottleneck
Data Movement
7
Common Data Movements◦ Instrument to local capture storage◦ Capture storage to shared storage◦ Shared storage to HPC resource◦ Shared storage to desktop◦ Shared storage to backup/replication
Data Movement
8
Globus Online – Fastest for big files but requires GridFTP
scp – Fetch (Mac) WS_FTP (Windows) Network Drive File Copy – Slowest but
simplest External Hard Drive – Reasonably fast but
requires physical movement
Tools for Moving Data
9
Transfer Mechanism Transfer Speed
External Hard Drive 100MB/second read + 100MB/second write + Walking Time
Gigabit Ethernet Up to 120MB/second
Typical Desktop Hard Drive 100MB/second
Typical Desktop SSD 300MB/second
GridFTP over 1Gb 120MB/second
CIFS over 1Gb 60-80MB/second
scp over 1Gb 60-100MB/second
Fastest network filesystem on campus
600MB/second single copy6GB/second aggregate
Data Movement Speed
Moving 1TB can easily take 3 hours or more!
10
Very important There are many solutions
◦ Wiki, spreadsheet, database, etc◦ Campus Options
Campus Wiki Sharepoint Redcap Galaxy
Make sure you have backups!
Data Management
11
Cheap storage is easy◦ 2TB External USB Drive
Big storage is harder◦ 50TB Storage Server
Big, fast, cheap, safe storage is much harder◦ 50TB Storage Server Pair
Checksum High Performance Network Backups
Cost of storage does not scale linearly
Data Storage
12
Economy of Scale Reversed
1TB 50TB 500TB$0.0
$200,000.0
$400,000.0
$600,000.0
$800,000.0
$1,000,000.0
$1,200,000.0
$1,400,000.0
$1,600,000.0
$1,800,000.0
AWSUSBUSBx3
13
Where you store the data can impact how fast you can analyze your data.
On Helium during testing we saw over 100% difference in analysis time for BWA depending on where we stored the data.
If doing analysis on your desktop fast storage will likely improve analysis time for NGS.
Galaxy is being optimized to take advantage of this.
If running directly on a cluster ask for recommendations.
Storage & Analysis
14
Galaxy Redcap Helium
◦ Colocation◦ /nfsscratch – 110TB◦ /glusterscratch – 146TB
R Drive ITS Research Data Storage Service Pilot Lab/Shared ZFS Systems
Campus Resources
15
Today◦ All data in Galaxy should be considered as
transient Deleted after 30 days
◦ Data processing platform only◦ Please backup all data that is valuable to you!
Future◦ Solutions to allow longer term storage of data
Galaxy
16
Increased availability of 10Gb Networking Research Data Storage Service Backup Service Cloud Storage Galaxy Data Libraries
Future Directions
17
Ben-rogers@uiowa.edu
Questions?
18
Tom Bair – Economy of Scale Reversed Safe Photo -
http://americanbestlocksmith.com/wp-content/uploads/2010/09/safe-installation.jpg
BioTeam - http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CF4QFjAA&url=http%3A%2F%2Fwww.bioteam.net%2Fwp-content%2Fuploads%2F2010%2F03%2Fcdag-xgen-storageForNGS_v3.pdf&ei=0cwWUJPJG4WHqQGihoHoDw&usg=AFQjCNFrzHSvQ8y4Ze3igsXd9mFV_EWb_Q
Credits