Ben Rogers ITS – Research Services 1. Data Awareness Data Management Data Storage Campus...

18
Data Management and Data Storage Ben Rogers ITS – Research Services 1

Transcript of Ben Rogers ITS – Research Services 1. Data Awareness Data Management Data Storage Campus...

Page 1: Ben Rogers ITS – Research Services 1.  Data Awareness  Data Management  Data Storage  Campus Resources  Questions 2.

1

Data Management and Data Storage

Ben RogersITS – Research Services

Page 2: Ben Rogers ITS – Research Services 1.  Data Awareness  Data Management  Data Storage  Campus Resources  Questions 2.

2

Data Awareness Data Management Data Storage Campus Resources Questions

Overview

Page 3: Ben Rogers ITS – Research Services 1.  Data Awareness  Data Management  Data Storage  Campus Resources  Questions 2.

3

Sequencing changes faster than IT Understand the data you will produce Understand the data you will keep Understand how the data will move

Data Awareness

Page 4: Ben Rogers ITS – Research Services 1.  Data Awareness  Data Management  Data Storage  Campus Resources  Questions 2.

4

Understand the sizes of the data each instrument produces◦ How often will you collect this data?◦ What IT resources are needed for each data set?

How will you handle?◦ Raw Data◦ Intermediate Data◦ Derived Data

Data You Produce

Page 5: Ben Rogers ITS – Research Services 1.  Data Awareness  Data Management  Data Storage  Campus Resources  Questions 2.

5

Must decide what data to keep◦ How long?◦ How will it be stored?

Is it cheaper to:◦ Rerun the experiment◦ Rerun the analysis

Data You Keep

Page 6: Ben Rogers ITS – Research Services 1.  Data Awareness  Data Management  Data Storage  Campus Resources  Questions 2.

6

Data captured by the instrument must be moved

Terabytes of data may be involved Moving terabytes of data across networks is

non-trivial◦ The network is not always the bottleneck

Data Movement

Page 7: Ben Rogers ITS – Research Services 1.  Data Awareness  Data Management  Data Storage  Campus Resources  Questions 2.

7

Common Data Movements◦ Instrument to local capture storage◦ Capture storage to shared storage◦ Shared storage to HPC resource◦ Shared storage to desktop◦ Shared storage to backup/replication

Data Movement

Page 8: Ben Rogers ITS – Research Services 1.  Data Awareness  Data Management  Data Storage  Campus Resources  Questions 2.

8

Globus Online – Fastest for big files but requires GridFTP

scp – Fetch (Mac) WS_FTP (Windows) Network Drive File Copy – Slowest but

simplest External Hard Drive – Reasonably fast but

requires physical movement

Tools for Moving Data

Page 9: Ben Rogers ITS – Research Services 1.  Data Awareness  Data Management  Data Storage  Campus Resources  Questions 2.

9

Transfer Mechanism Transfer Speed

External Hard Drive 100MB/second read + 100MB/second write + Walking Time

Gigabit Ethernet Up to 120MB/second

Typical Desktop Hard Drive 100MB/second

Typical Desktop SSD 300MB/second

GridFTP over 1Gb 120MB/second

CIFS over 1Gb 60-80MB/second

scp over 1Gb 60-100MB/second

Fastest network filesystem on campus

600MB/second single copy6GB/second aggregate

Data Movement Speed

Moving 1TB can easily take 3 hours or more!

Page 10: Ben Rogers ITS – Research Services 1.  Data Awareness  Data Management  Data Storage  Campus Resources  Questions 2.

10

Very important There are many solutions

◦ Wiki, spreadsheet, database, etc◦ Campus Options

Campus Wiki Sharepoint Redcap Galaxy

Make sure you have backups!

Data Management

Page 11: Ben Rogers ITS – Research Services 1.  Data Awareness  Data Management  Data Storage  Campus Resources  Questions 2.

11

Cheap storage is easy◦ 2TB External USB Drive

Big storage is harder◦ 50TB Storage Server

Big, fast, cheap, safe storage is much harder◦ 50TB Storage Server Pair

Checksum High Performance Network Backups

Cost of storage does not scale linearly

Data Storage

Page 12: Ben Rogers ITS – Research Services 1.  Data Awareness  Data Management  Data Storage  Campus Resources  Questions 2.

12

Economy of Scale Reversed

1TB 50TB 500TB$0.0

$200,000.0

$400,000.0

$600,000.0

$800,000.0

$1,000,000.0

$1,200,000.0

$1,400,000.0

$1,600,000.0

$1,800,000.0

AWSUSBUSBx3

Page 13: Ben Rogers ITS – Research Services 1.  Data Awareness  Data Management  Data Storage  Campus Resources  Questions 2.

13

Where you store the data can impact how fast you can analyze your data.

On Helium during testing we saw over 100% difference in analysis time for BWA depending on where we stored the data.

If doing analysis on your desktop fast storage will likely improve analysis time for NGS.

Galaxy is being optimized to take advantage of this.

If running directly on a cluster ask for recommendations.

Storage & Analysis

Page 14: Ben Rogers ITS – Research Services 1.  Data Awareness  Data Management  Data Storage  Campus Resources  Questions 2.

14

Galaxy Redcap Helium

◦ Colocation◦ /nfsscratch – 110TB◦ /glusterscratch – 146TB

R Drive ITS Research Data Storage Service Pilot Lab/Shared ZFS Systems

Campus Resources

Page 15: Ben Rogers ITS – Research Services 1.  Data Awareness  Data Management  Data Storage  Campus Resources  Questions 2.

15

Today◦ All data in Galaxy should be considered as

transient Deleted after 30 days

◦ Data processing platform only◦ Please backup all data that is valuable to you!

Future◦ Solutions to allow longer term storage of data

Galaxy

Page 16: Ben Rogers ITS – Research Services 1.  Data Awareness  Data Management  Data Storage  Campus Resources  Questions 2.

16

Increased availability of 10Gb Networking Research Data Storage Service Backup Service Cloud Storage Galaxy Data Libraries

Future Directions

Page 17: Ben Rogers ITS – Research Services 1.  Data Awareness  Data Management  Data Storage  Campus Resources  Questions 2.

17

[email protected]

Questions?

Page 18: Ben Rogers ITS – Research Services 1.  Data Awareness  Data Management  Data Storage  Campus Resources  Questions 2.

18

Tom Bair – Economy of Scale Reversed Safe Photo -

http://americanbestlocksmith.com/wp-content/uploads/2010/09/safe-installation.jpg

BioTeam - http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CF4QFjAA&url=http%3A%2F%2Fwww.bioteam.net%2Fwp-content%2Fuploads%2F2010%2F03%2Fcdag-xgen-storageForNGS_v3.pdf&ei=0cwWUJPJG4WHqQGihoHoDw&usg=AFQjCNFrzHSvQ8y4Ze3igsXd9mFV_EWb_Q

Credits