Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a...
-
Upload
william-edwards -
Category
Documents
-
view
213 -
download
0
Transcript of Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a...
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
File Systems for your Cluster
Selecting a storage solution for tier 2
Suggestions and experiences
Jos van WezelInstitute for Scientific Computing
Karlsruhe, Germany
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Overview
• Estimated sizes and needs• GridKa today and roadmap• Connection models• Hardware choices• Software choices• LCG
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Scaling the tiers
• Tier 0: 2 PB disk 10 PB tape 6000 kSi (data collection, distribution to tier 1
• Tier 1: 1 PB disk 10 PB tape 2000 kSi (data processing, calibration, archiving for tier 2, distribute to tier 2)
• Tier 2: 0.2 PB disk no tape 3000 kSi (dataselections, simulation, distribute to tier 3)
• Tier 3: location and or group specific
1 opteron today ~ 1 kSi
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
GridKa growth
3000
2000
1000
0
LCG Fase I Fase II Fase III
2000
4000
6000
2002 2003 2004 2005 2006 2007 2008 2009
Fase I Fase II Fase III
Disk
Tape
Tera
Byte
kSI9
5CP
U
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Storage at GridKa
SAN Fibre Channel
TCP/IP
Cluster nodes
GPFS Servers1 x 2 Gb FC1 - 2 x 1 Gb Ethernet
RAID 5 devices32 x 2 Gb FC1120 disks: 120 TB
GPFS via NFS to nodes dCache via dcap to nodes
TCP/IP
Cluster nodes
2 Pool nodes
Tape libraryDisks 1 TB
TSM server
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
GridKa road map
• 2004-2005expand and stabilize GPFS / NFS combinationpossibly install Lustreintegrate dCachelook for alternative to TSM if !! really neededTry SATA disks
• 2004-2007decide path for Parallel FS and dCachedecide Tape backendscale for LHC (200 – 300 MB/s continuous for some weeks)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Tier 2 targets (source: G.Quast / Uni-KA)
• 5 MB per node throughput• 300 nodes• 1000 MB/s• 200 TB overall disk storage
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Estimate your needs (1)
• can you charge for the storage?– influences choice between on-line and offline (tape)
– classification of data (volatile, precious, high IO, low IO)
• how many nodes will access the storage simultaneously– Absolute number of nodes
– Number of nodes that run a particular job
– Job classification to separate accesses
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
•
Estimate your needs (2)
• What kind of access (Read/Write/Transfer sizes)– ability to control access pattern
• Pre-staging
• Software tuning
– job classification to influence access pattern• spread via scheduler
• What size will the storage have eventually– use benefit of random access via large number of controllers – up till 4 TB or 100 MB/s one controller– need high speed disks
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Disc technology keys
• Disk areal density is larger then tape– disks are rigid
• Density growth rate for disks continues (but slower)– deviation from Moore’s law (same for CPU)
• Superparamagnetic effect is not yet influencing progress– the end has been in view since 20 years
• Convergence of costs for disk and tape stopped– still factor 4 to 5 difference
Disks and tape will be there at least another 10 years
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Disk areal density vs. head – media spacing
1 10 100 1000 10000 100000
Hitachi Deskstar 7k400 (2004): 400GB, 61 Gb/in.2
IBM RAMAC (1956): 5 MB, 2000kb/in.2
Head to media spacing (nm)
Are
al d
ensi
ty (
Mb/
in.2 )
10 4
10 2
10 -2
1
10 6
10 -3
10 -1
10
10 3
10 5
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
To SATA or notwhen compared to SCSI/FC
• Up to 4 times cheaper (3 k€ / TB vs. 10 k€ / TB)
• 2 times slower in Multi user environment (access time)
• Not really for 24/7 operation (more failures)
• Larger capacity per diskmax: 140 GB SCSI, 400 GB SATA (today)
• No large scale experience
• Warranty of drives for only 1 or 2 years.
• GridKa uses SCSI, SAN and expensive controllers
• bad experiences with IDE NAS boxes (160 GB disks, 3Ware controllers)
• New products, with SATA disks and expensive controllers
• IO ops are more important then throughput for most accesses
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Network attached storage
IO –path via the network
File serversClusternodes
IP network
IO –path locallyFibre Channel or SCSI
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
NAS example
• server with 4 dual SCSI busses – more then 1 GB/s transfer
• 4 x 2 SATA RAID boxes (16 * 250 GB)– ~4 TB per bus
• 2 * 4 * 2 * 4 = 72 TB on a server. • est 30 keuro or 35 keuro with point to point FC Not that bad.
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
SAN Clusternodes
Fibre Channel / iSCSIIO –path to each host
via SAN or iSCSI
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
SAN or Ethernet
• SAN has easier management– exchange of hardware without interruption– joining separate storage elements
• iSCSI needs separate net (SCSI over IP)• Very scalable performance
– via switches or directors• 1 SCSI bus maxes at 320 MB/s
– better than current FC, but FC is duplex– not a fabric– example follows
• ELVM for easier management• Network block device• Kernel 2.6 new 16 TB limit• SAN is expensive (500 EURO HBA, 1000 EURO switch port)• A direct connection limitation can be partly compensated via High Speed interconnect
(InfiniBand,Myrinet etc)• Tighly coupled cluster with InfiniBand. Can be used for FC too, depending on the FS
software.
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Combining FC and Infiniband
FCP network
Cluster nodes
SAN Disk collection
Cluster nodes
InfiniBand network
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Software to drive the hardware
• File systems– GPFS (IBM) (GridKa uses this, so does UNI-KA)– SAN-FS (IBM $$) supports a range of architectures– Lustre (HP $) (Uni-KA Rechenzentrum cluster)– PVFS (stability is rather low)– GFS (now RedHat) or OpenGFS– NFS
• Linux implementation is messy but RH 3.0 EL seems promising• NAS boxes reach impressive throughput, are stable, easy management, grow as needed (NetApp,
Exanet)
– Terragrid (very new)
• (Almost-posix) access via library preload– write once / read many– changing a file means creating a new and deleting the old– not usable for all software (e.g. no DBMS!)– Examples Gridftp (gfal), (x)rootd (rfio), dCache (dcap/gfal/rfio)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
GPFS
A
D
B
D ABC
C
B
AB A
Stripes over n disks Linux and AIX or combined Max FS size 70 TB HSM option Scalable and very robust Easy management SAN or IP+SAN or IP only Add and remove storage on-line Vendor lock
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Accumulated throughput as function of number of nodes/raid-arrays (GPFS)
MB
/s
0
200
400
600
800
1000
1200
1 2 3 4 5 6 7 8 9 10
Reading
Writing
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
SAN FS
IP network
MetadataServer Cluster
ClientsLinux / Windows / Macwith STFS filesystem
SANFibre Channel / iSCSI
Storage Tank Protocolover TCP or UDP
Metadata VolumesAttributes, Policies
Disk collection
File Data Volumes
• metadata server failover• policy based management• add and remove storage on line• $$$
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
LUSTRE
IP network
Metadata ServersClients
failover MDS activeLinux OST serverswith (SAN) disks
• Object based• LDAP config database• Failover of OST’s• Support for heterogeneous network e.g. InfiniBand• Advanced security• Open Source
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
SRM Storage resource manager
• Glue between worldwide grid and local mass storage (SE)• A storage element should offer:
– GridFTP
– An SRM interface
– Information publication via MDS
• LCG has SRM2 almost …. ready, SRM1 in operation• SRM is build upon known MSS (CASTOR, dCache, Jasmine)• dCache implements SRM v1
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
User SRM interaction
Legenda:LFN: Logical file nameRMC: Replication metadata catalogGUID: Grid unique identifierRLC: Replica location catalog RLI: Replica location indexRLC + RLI = RLSRLS: Replica location serviceSURL: Site URLTURL: Transfer URL
User
SRM managedStorage
SRM
RLS
RMC
LFN
GUID
GUID
SURL
SURL
TURL
open()
PIN
close()
Release
read/write
GFAL
dC
ach
e
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
In short
• Loosely coupled cluster: Ethernet• Tightly coupled cluster: InfiniBand• From 100 to 200 TB: local attached, NFS and or RFIO• Above 200 TB: SAN, cluster file system and RFIO• HSM via dCache
– Grid SRM interface– Tape TSM / GSI solution ?? or Vanderbilt Enstor
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Some encountered difficulties• Prescribed chain of software revision levels
– support is given only to those who live by the rules
– disk -> controller -> hba -> driver -> kernel -> application
• Linux limitations– block addressability < 2^31
– number of LU’s < 128
• NFS on Linux is a running target– enhancements or fixes introduce almost always a new bugs
– limited experience in large (> 100 clients) installations
• Storage units become difficult to handle– exchanging 1 TB and rebalancing of live 5 TB file system takes 20 hrs
– restoring a 5 TB file system can take up to a week
– Acquirement needs 1 FTE / 10^6 €
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Thank you for your attention