Open stack summit-2015-dp

29
If you build it, will they come? Dirk Petersen, Scientific Computing Director, Fred Hutchinson Cancer Research Center Joe Arnold, Chief Product Officer, President SwiftStack Using OpenStack Swift to build a large scale active archive in a Scientific Computing environment

Transcript of Open stack summit-2015-dp

If you build it, will they come?

Dirk Petersen, Scientific Computing

Director,

Fred Hutchinson Cancer Research Center

Joe Arnold, Chief Product Officer, President

SwiftStack

Using OpenStack Swift to build a large scale active

archive in a Scientific Computing environment

Challenge

Need an archive to offload expensive storage

❖ Lost cost storage

❖ High throughput: Load large genome files

HPC

❖ Faster and lower cost than S3 & no

proprietary lock-in.

About Fred Hutch

❖ Cancer & HIV research

❖ 3 Nobel Laureates

❖ $430M budget / 85% NIH funding

❖ 2,700 employees

❖ Conservative use of information technology

IT at Fred Hutch

❖ Multiple data centers with >1000kw

capacity

❖ 100 staff in Center IT plus divisional IT

❖ Team of 3 Sysadmins to support storage

❖ IT funded by indirects (F&A)

❖ Storage Chargebacks started Nov 2014

❖ 1.03 PUE, natural air cooled

Inside Fred Hutch data center

About SwiftStack

❖ Object Storage software

❖ Build with OpenStack Swift

❖ SwiftStack is leading contributor

and Project Technical Lead

❖ Software-defined storage platform

for object storage

SecurityAuthentication & Authorization

SwiftStack Storage Clusters

Runtime Agents include:

Load balancing, Monitoring, Utilization, Device Inventory

OS

HW

OS

HW

OS

HW

OS

HW

OS

HW

OS

HW

Swift Object Storage Engine

NFS/CIFSSwift API

Device & Node Management

Datacenter 1

Datacenter 2

Datacenter 3

User Dashboard …

OS

HW

OS

HW

OS

HW

Out of Band,

Software-Defined Controller

SwiftStack Controller

SwiftStack Resources

https://swiftstack.com/books/

Researchers concerned about ….

❖ Significant storage costs – $40/TiB/month

chargebacks (first 5 TB is free) and

declining grant funding

❖ “If you charge us please give us some

cheap storage for old and big files”

❖ (Mis)perception on storage value (I can buy

a hard drive at BestBuy)

Not what you want: Unsecured and unprotected external USB

storage

Finance concerned about ….

❖ Cost predictability and scale

❖ Data growth causes storage costing up to

$1M per year

❖ Genomics data grows at 40%/Y and

chargebacks don’t cover all costs

❖ Expensive forklift upgrades every few

years

❖ The public cloud (e.g. Amazon S3) set new

transparent cost benchmark.

How much does it cost?

❖ Only small changes vs 2014 ❖ Kryder’s law obsolete at <15%/Y ?

❖ Swift now down to Glacier cost (hardware down to $3 / TB /

month)

❖ No price reductions in the cloud

❖ 4TB (~$120) and 6TB (~$250) drives cost the same ❖ Do you want a fault domain of 144TB or 216TB in your storage

servers

❖ Don’t save on CPU / Erasure Code is coming !

40

2826

11

0

10

20

30

40

50

NAS Amazon S3 Google Swiftstack

$/TB/Mo

AWS EFS is $300/TB/Mo

Economy File in production in 2014

❖ Chargebacks drove the Hutch to embrace

more economical storage

❖ Selected Swift object storage managed by

SwiftStack

❖ Go-live in 2014, strong interest and

expansion in 2015

❖ Researchers do not want to pay the price

for standard enterprise storage

Chargebacks spike Swift utilization!

❖ Started storage chargebacks

on Nov 1st

❖ Triggered strong growth in October

❖ Users sought to avoid high cost of

enterprise NAS and put as much as

possible into lower cost Swift

❖ Underestimated success of Swift

❖ Needed to stop migration to buy more

hardware

❖ Can migrate 30+ TB per day today

Standard Hardware

❖ Supermicro with Silicon Mechanics

❖ 2.1PB raw capacity; ~700TB usable

❖ No RAID controllers; no storage lost to

RAID

❖ Seagate SATA drives (desktop)

❖ 2 x 120GB Intel S3700 SSDs; OS +

metadata

❖ 10Gb Base-T connectivity

❖ (2) Intel Xeon E5 CPUs

❖ 64GB RAM

Management of OpenStack Swift using SwiftStack

❖ Out-of-band management controller

❖ SwiftStack provides control & visibility

❖ Monitoring and stats at cluster, node,

and drive levels

❖ Authentication & Authorization

❖ Capacity & Utilization Management

via Quotas and Rate Limits

❖ Alerting, & Diagnostics

SwiftStack Automation

❖ Deployment automation

❖ Let us roll out Swift nodes in

10 minutes

❖ Upgrading Swift across clusters

with 1 click

❖ 0.25 FTE to manage cluster

Supporting Scientific Computing Workloads

HPC Use Cases & Tools

HPC Requirements

❖ High Aggregate throughput

❖ Current network architecture is bottleneck

❖ Many parallel streams used to max out

throughput

❖ Ideal for HPC cluster architecture

Not a Filesystem

No traditional file system hierarchy, we just have containers, that can contain

millions of objects (aka files)

Huh, no sub-directories? But how the heck can I upload my uber-complex

bioinformatics file system with 11 folder hierarchies to Swift?

Filesystem Mapping with Swift

We simulate the hierarchical structure by simply putting forward slashes (/) in

the object name (or file name)

❖ So, how do you actually copy a folder?

❖ However, the Swift client is frequently used,

well supported, maintained and really fast !!

$ swift upload --changed --segment-

size=2G --use-slo --object-

name=“pseudo/folder" “container" "

/my/local/folder"

Really? Can’t we get this a little easier?

Introducing Swift Commander

❖ Swift Commander, a simple shell wrapper

for the Swift client, curl and some other

tools makes working with Swift very easy.

❖ Sub commands such as swc ls, swc cd,

swc rm, swc more give you a feel that is

quite similar to a Unix file system

❖ Actively maintained and available at:

❖ https://github.com/FredHutch/Swift-

commander/

$ swc upload /my/posix/folder

/my/Swift/folder

$ swc compare /my/posix/folder /my/Swift/folder

$ swc download /my/Swift/folder /my/scratch/fs

Much easier…

Some additional examples

Swift Commander + Metadata

❖ Didn’t someone say that object storage

systems were great at using metadata?

❖ Yes, and you can just add a few key:value

pairs as upload argument:

❖ Query the meta data via swc, or use an

external search engine such as elastic

search

$ swc meta /my/Swift/folder

Meta Cancer: breast

Meta Collaborators: jill,joe,jim

Meta Project: grant-xyz

$ swc upload /my/posix/folder /my/Swift/folder

project:grant-xyz

collaborators:jill,joe,jim

cancer:breast

Integrating with HPC

❖ Integrating Swift in HPC workflows is

not really hard

❖ Example, running samtools using

persistent scratch space

(files deleted if not accessed for 30

days)

If ! [[ -f /fh/scratch/delete30/pi/raw/genome.bam ]]; then

swc download /Swiftfolder/genome.bam /fh/scratch/delete30/raw/genome.bam

fi

samtools view -F 0xD04 -c /fh/scratch/delete30/pi/raw/genome.bam > otherfile

A complex 50 line HPC submission script

prepping a GATK workflow requires

just 3 more lines !!

Other HPC Integrations

❖ Use HPC system to download lots of bam

files in parallel

❖ 30 cluster jobs run in parallel on 30 1G

nodes (which is my HPC limit)

❖ My scratch file system says it loads data at

1.4 GB/s

❖ This means that each bam file is

downloaded at 47 MB/s on average and

downloading this dataset of 1.2 TB takes 14

min

$ swc ls /Ext/seq_20150112/ > bamfiles.txt

$ while read FILE; do

$ sbatch -N1 -c4 --wrap="swc download /Ext/seq_20150112/$FILE .";

$ done < bamfiles.txt

$ squeue -u petersen

JOBID PARTITION NAME USER ST TIME NODES NODELIST

17249368 campus sbatch petersen R 15:15 1 gizmof120

17249371 campus sbatch petersen R 15:15 1 gizmof123

17249378 campus sbatch petersen R 15:15 1 gizmof130

$ fhgfs-ctl --userstats --names --interval=5 --nodetype=storage

====== 10 s ======

Sum: 13803 [sum] 13803 [ops-wr] 1380.300 [MiB-wr/s]

petersen 13803 [sum] 13803 [ops-wr] 1380.300 [MiB-wr/s]

Swift Commander + Small Files

So, we could tar up this entire directory

structure… but then we have one giant tar

ball

Solution: tar up sub dirs in one file but create

a tar ball for each level

eg. /folder1/folder2/folder3

restoring folder2 and below we just need

folder2.tar.gz + folder3.tar.gz

$ swc arch /my/posix/folder /my/Swift/folder

$ swc unarch /my/Swift/folder /my/scratch/fs

It’s available at https://github.com/FredHutch/Swift-commander/blob/master/bin/swbundler.py

It’s Easy

It’s Fast❖ Archiving uses multiple processes, measured up to 400 MB/s from one

Linux box.

❖ Each process uses pigz multithreaded gzip compression (Example:

compressing 1GB DNA string down to 272MB: 111 sec using gzip, 5

seconds using pigz)

❖ Restore can use standard gzip

Desktop Clients & Collaboration

❖ Reality: Every archive requires access via

GUI tools

❖ Requirements

❖ Easy to use

❖ Do not create any proprietary data

structures in Swift that cannot be read by

other tools

Cyberduck desktop client running in windows

Desktop Clients & Collaboration

❖ Another example: ExpanDrive and Storage Made Easy

❖ Works with Windows and Mac

❖ Integrates in Mac Finder and is mountable as a drive in

Windows

rclone: mass copy, backup, data migration

❖ rclone is a multithreaded data copy / mirror

tool

❖ Consistent performance on Linux, Mac and

Windows

❖ E.g. keep a mirror of Synology workgroup

NAS (QNAP has a builtin swift mirror

option)

❖ Data remains accessible by swc, desktop

clients

❖ Mirror protected by swift undelete (currently

60 days retention)

Galaxy: Scientific Workflow Management

❖ Galaxy web based high throughput

computing at the Hutch uses Swift as

primary storage in production today

❖ SwiftStack patches contributed to Galaxy

Project

❖ Swift allows to delegate “root” access to

bioinformaticians

❖ Integrated with Slurm HPC scheduler:

automatically assigns default PI account for

each user

Summary

Discovery is driven by technologies that generate larger and larger datasets

❖ Object storage ideal for

❖ Ever-growing data volumes

❖ High throughput required for HPC

❖ Faster and lower cost than S3 & no

proprietary lock-in

Thank you!Dirk Petersen, Scientific Computing

Director,

Fred Hutchinson Cancer Research Center

Joe Arnold, Chief Product Officer, President

SwiftStack