Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a...

Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft

File Systems for your Cluster

Selecting a storage solution for tier 2

Suggestions and experiences

Jos van WezelInstitute for Scientific Computing

Karlsruhe, Germany

[email protected]


Overview

• Estimated sizes and needs• GridKa today and roadmap• Connection models• Hardware choices• Software choices• LCG


Scaling the tiers

• Tier 0: 2 PB disk 10 PB tape 6000 kSi (data collection, distribution to tier 1

• Tier 1: 1 PB disk 10 PB tape 2000 kSi (data processing, calibration, archiving for tier 2, distribute to tier 2)

• Tier 2: 0.2 PB disk no tape 3000 kSi (dataselections, simulation, distribute to tier 3)

• Tier 3: location and or group specific

1 opteron today ~ 1 kSi


GridKa growth

3000

2000

1000

0

LCG Fase I Fase II Fase III

2000

4000

6000

2002 2003 2004 2005 2006 2007 2008 2009

Fase I Fase II Fase III

Disk

Tape

Tera

Byte

kSI9

5CP

U


Storage at GridKa

SAN Fibre Channel

TCP/IP

Cluster nodes

GPFS Servers1 x 2 Gb FC1 - 2 x 1 Gb Ethernet

RAID 5 devices32 x 2 Gb FC1120 disks: 120 TB

GPFS via NFS to nodes dCache via dcap to nodes

TCP/IP

Cluster nodes

2 Pool nodes

Tape libraryDisks 1 TB

TSM server


GridKa road map

• 2004-2005expand and stabilize GPFS / NFS combinationpossibly install Lustreintegrate dCachelook for alternative to TSM if !! really neededTry SATA disks

• 2004-2007decide path for Parallel FS and dCachedecide Tape backendscale for LHC (200 – 300 MB/s continuous for some weeks)


Tier 2 targets (source: G.Quast / Uni-KA)

• 5 MB per node throughput• 300 nodes• 1000 MB/s• 200 TB overall disk storage


Estimate your needs (1)

• can you charge for the storage?– influences choice between on-line and offline (tape)

– classification of data (volatile, precious, high IO, low IO)

• how many nodes will access the storage simultaneously– Absolute number of nodes

– Number of nodes that run a particular job

– Job classification to separate accesses


•

Estimate your needs (2)

• What kind of access (Read/Write/Transfer sizes)– ability to control access pattern

• Pre-staging

• Software tuning

– job classification to influence access pattern• spread via scheduler

• What size will the storage have eventually– use benefit of random access via large number of controllers – up till 4 TB or 100 MB/s one controller– need high speed disks


Disc technology keys

• Disk areal density is larger then tape– disks are rigid

• Density growth rate for disks continues (but slower)– deviation from Moore’s law (same for CPU)

• Superparamagnetic effect is not yet influencing progress– the end has been in view since 20 years

• Convergence of costs for disk and tape stopped– still factor 4 to 5 difference

Disks and tape will be there at least another 10 years


Disk areal density vs. head – media spacing

1 10 100 1000 10000 100000

Hitachi Deskstar 7k400 (2004): 400GB, 61 Gb/in.2

IBM RAMAC (1956): 5 MB, 2000kb/in.2

Head to media spacing (nm)

Are

al d

ensi

ty (

Mb/

in.2 )

10 4

10 2

10 -2

1

10 6

10 -3

10 -1

10

10 3

10 5


To SATA or notwhen compared to SCSI/FC

• Up to 4 times cheaper (3 k€ / TB vs. 10 k€ / TB)

• 2 times slower in Multi user environment (access time)

• Not really for 24/7 operation (more failures)

• Larger capacity per diskmax: 140 GB SCSI, 400 GB SATA (today)

• No large scale experience

• Warranty of drives for only 1 or 2 years.

• GridKa uses SCSI, SAN and expensive controllers

• bad experiences with IDE NAS boxes (160 GB disks, 3Ware controllers)

• New products, with SATA disks and expensive controllers

• IO ops are more important then throughput for most accesses


Network attached storage

IO –path via the network

File serversClusternodes

IP network

IO –path locallyFibre Channel or SCSI


NAS example

• server with 4 dual SCSI busses – more then 1 GB/s transfer

• 4 x 2 SATA RAID boxes (16 * 250 GB)– ~4 TB per bus

• 2 * 4 * 2 * 4 = 72 TB on a server. • est 30 keuro or 35 keuro with point to point FC Not that bad.


SAN Clusternodes

Fibre Channel / iSCSIIO –path to each host

via SAN or iSCSI


SAN or Ethernet

• SAN has easier management– exchange of hardware without interruption– joining separate storage elements

• iSCSI needs separate net (SCSI over IP)• Very scalable performance

– via switches or directors• 1 SCSI bus maxes at 320 MB/s

– better than current FC, but FC is duplex– not a fabric– example follows

• ELVM for easier management• Network block device• Kernel 2.6 new 16 TB limit• SAN is expensive (500 EURO HBA, 1000 EURO switch port)• A direct connection limitation can be partly compensated via High Speed interconnect

(InfiniBand,Myrinet etc)• Tighly coupled cluster with InfiniBand. Can be used for FC too, depending on the FS

software.


Combining FC and Infiniband

FCP network

Cluster nodes

SAN Disk collection

Cluster nodes

InfiniBand network


Software to drive the hardware

• File systems– GPFS (IBM) (GridKa uses this, so does UNI-KA)– SAN-FS (IBM $$) supports a range of architectures– Lustre (HP $) (Uni-KA Rechenzentrum cluster)– PVFS (stability is rather low)– GFS (now RedHat) or OpenGFS– NFS

• Linux implementation is messy but RH 3.0 EL seems promising• NAS boxes reach impressive throughput, are stable, easy management, grow as needed (NetApp,

Exanet)

– Terragrid (very new)

• (Almost-posix) access via library preload– write once / read many– changing a file means creating a new and deleting the old– not usable for all software (e.g. no DBMS!)– Examples Gridftp (gfal), (x)rootd (rfio), dCache (dcap/gfal/rfio)


GPFS

A

D

B

D ABC

C

B

AB A

Stripes over n disks Linux and AIX or combined Max FS size 70 TB HSM option Scalable and very robust Easy management SAN or IP+SAN or IP only Add and remove storage on-line Vendor lock


Accumulated throughput as function of number of nodes/raid-arrays (GPFS)

MB

/s

0

200

400

600

800

1000

1200

1 2 3 4 5 6 7 8 9 10

Reading

Writing


SAN FS

IP network

MetadataServer Cluster

ClientsLinux / Windows / Macwith STFS filesystem

SANFibre Channel / iSCSI

Storage Tank Protocolover TCP or UDP

Metadata VolumesAttributes, Policies

Disk collection

File Data Volumes

• metadata server failover• policy based management• add and remove storage on line• $$$


LUSTRE

IP network

Metadata ServersClients

failover MDS activeLinux OST serverswith (SAN) disks

• Object based• LDAP config database• Failover of OST’s• Support for heterogeneous network e.g. InfiniBand• Advanced security• Open Source


SRM Storage resource manager

• Glue between worldwide grid and local mass storage (SE)• A storage element should offer:

– GridFTP

– An SRM interface

– Information publication via MDS

• LCG has SRM2 almost …. ready, SRM1 in operation• SRM is build upon known MSS (CASTOR, dCache, Jasmine)• dCache implements SRM v1


User SRM interaction

Legenda:LFN: Logical file nameRMC: Replication metadata catalogGUID: Grid unique identifierRLC: Replica location catalog RLI: Replica location indexRLC + RLI = RLSRLS: Replica location serviceSURL: Site URLTURL: Transfer URL

User

SRM managedStorage

SRM

RLS

RMC

LFN

GUID

GUID

SURL

SURL

TURL

open()

PIN

close()

Release

read/write

GFAL

dC

ach

e


In short

• Loosely coupled cluster: Ethernet• Tightly coupled cluster: InfiniBand• From 100 to 200 TB: local attached, NFS and or RFIO• Above 200 TB: SAN, cluster file system and RFIO• HSM via dCache

– Grid SRM interface– Tape TSM / GSI solution ?? or Vanderbilt Enstor


Some encountered difficulties• Prescribed chain of software revision levels

– support is given only to those who live by the rules

– disk -> controller -> hba -> driver -> kernel -> application

• Linux limitations– block addressability < 2^31

– number of LU’s < 128

• NFS on Linux is a running target– enhancements or fixes introduce almost always a new bugs

– limited experience in large (> 100 clients) installations

• Storage units become difficult to handle– exchanging 1 TB and rebalancing of live 5 TB file system takes 20 hrs

– restoring a 5 TB file system can take up to a week

– Acquirement needs 1 FTE / 10^6 €


Thank you for your attention

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a...

Documents

Transcript of Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a...