Crescando : Predictable Performance for Unpredictable Workloads
Scalable Performance of the Panasas Parallel File SystemBreakthrough data throughput AND random I/O...
Transcript of Scalable Performance of the Panasas Parallel File SystemBreakthrough data throughput AND random I/O...
www.panasas.com
Go Faster. Go Parallel.
Scalable Performance of the Panasas Parallel File System
Brent WelchDirector of Software ArchitecturePanasas, Inc.
NSC 08
www.panasas.com
Go Faster. Go Parallel.
Scalable Performance of the Panasas Parallel File System
Brent Welch, Marc Unangst, Zainul Abbasi, Garth Gibson*, Brian Mueller, Jason Small, Jim Zelenka, Bin ZhouPanasas Inc* Carnegie Mellon and Panasas Inc
USENIX FAST 08 Conference
Slide 3 | FAST08 Panasas, Inc.
Outline
Panasas Background, Hardware and Software
Per-File, Client Driven RAID
Declustering and Scalable Rebuild
Metadata management and performance
Slide 4 | FAST08 Panasas, Inc.
Founded 1999 By Prof. Garth Gibson, Co-Inventor of RAIDTechnology Parallel File System and Parallel Storage Appliance Locations US: HQ in Fremont, CA, USA
R&D centers in Pittsburgh & Minneapolis
EMEA: UK, DE, FR, IT, ES, BE, Russia
APAC: China, Japan, Korea, India, AustraliaCustomers FCS October 2003, deployed at 200+ customersMarket
Focus
Alliances
Energy Academia Government Life Sciences Manufacturing Finance ISVs: Resellers:
Primary Investors
Panasas Company Overview
Slide 5 | FAST08 Panasas, Inc.
Accelerating Enterprise Parallel Storage Adoption
Slide 6 | FAST08 Panasas, Inc.
Panasas Architecture
Cluster technology provides scalable capacity and performance: capacity scales symmetrically with processor, caching, and network bandwidth
Scalable performance with commodity parts provides excellent price/performance
Object-based storage provides additional scalability and security advantages over block-based SAN file systems
Automatic management of storage resources to balance load across the cluster
Shared file system (POSIX) with the advantages of NAS, with direct-to-storage performance advantages of DAS and SAN
DiskCPUMemoryNetwork
Slide 7 | FAST08 Panasas, Inc.
Panasas Blade Hardware
DirectorBlade StorageBlade
Integrated GE Switch
Shelf Front1 DB, 10 SB Shelf Rear
Midplane routes GE, power
Battery Module(2 Power units)
Slide 8 | FAST08 Panasas, Inc.
Panasas Product Advantages
Proven implementation with appliance-like ease of use/deployment
Running mission-critical workloads at global F500 companies
Scalable performance with Object-based RAID
No degradation as the storage system scales in size
Unmatched RAID rebuild rates – parallel reconstruction
Unique data integrity features
Vertical parity on drives to mitigate media errors and silent corruptions
Per-file RAID provides scalable rebuild and per-file fault isolation
Network verified parity for end-to-end data verification at the client
Scalable system size with integrated cluster management
Storage clusters scaling to 1000+ storage nodes, 100+ metadata managers
Simultaneous access from over 12000 servers
Slide 9 | FAST08 Panasas, Inc.
Internal cluster management makes a large collection of blades work as a single system
iSCSI/OSD
OSDFS
StorageNodes
SysMgr
PanFS
NFS/CIFS
Client
Manager Nodes
Client
Compute Nodes
RPC
Out of Band architecture with direct, parallel paths from clients to storage nodes
Up to 12,000
1,000+
100+
Slide 10 | FAST08 Panasas, Inc.
Proven Panasas Scalability
Storage Cluster Sizes Today (e.g.)
Boeing, 50 DirectorBlades, 500 StorageBlades in one system. (plus 25 DirectorBlades and 250 StorageBlades each in two other smaller systems.)
LANL RoadRunner. 100 DirectorBlades, 1000 StorageBlades in one system today, planning to increase to 144 shelves next year.
Intel has 5,000 active DF clients against 10-shelf systems, with even more clients mounting DirectorBlades via NFS. They have qualified a 12,000 client version of 2.3, and will deploy “lots” of compute nodes against 3.2 later this year.
BP uses 200 StorageBlade storage pools as their building block
LLNL, two realms, each 60 DirectorBlades (NFS) and 160 StorageBlades
Most customers run systems in the 100 to 200 blade size range
Slide 11 | FAST08 Panasas, Inc.
Linear Performance Scaling
Breakthrough data throughput AND random I/O
Performance and scalability for all workloads
Slide 12 | FAST08 Panasas, Inc.
Scaling the system
Scale the system and clients at the same time (N-to-N IOzone)
IOZone Read and Write Sequential I/O Performance (4MB Block Size)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
10/8 40/32 80/64 120/96
Number of StorageBlades/Clients
Ag
gre
gat
e B
and
wid
th (
MB
/sec
)
Write
Read
Thunderbird
Slide 13 | FAST08 Panasas, Inc.
Scaling Clients
IOzone Multi-Shelf Performance Test (4GB sequential write/read)
0
500
1,000
1,500
2,000
2,500
3,000
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 155 200
# clients
MB
/sec
1-shelf write
1-shelf read
2-shelf write
2-shelf read
4-shelf write
4-shelf read
8-shelf write
8-shelf read
Slide 14 | FAST08 Panasas, Inc.
IOR Segmented IO
IOR -a POSIX -C -i 3 -t 4M -b $num_clients
0
100
200
300
400
500
600
700
800
900
1000
0 4 8 12 16 20 24 28 32 36
# Clients
MB
/sec
Shared File Read
Shared File Write
Separate File Read
Separate File Write
Slide 15 | FAST08 Panasas, Inc.
Panasas Parallel Storage Outperforms Clustered NFS
0
50
100
150
200
250
300
350
400
450
NFS DirectFLOW DirectFLOW
Tim
e (m
in)
7 hours 17 mins
Av. ReadBW
300MB/s
5 hours 16 mins
Av. ReadBW
350MB/s2 hours 51 mins
Av. ReadBW
650MB/s
2.5X faster(less time)
4 Shelves 4 Shelves1 Shelf
Source: Paradigm & Panasas, February 2007
Paradigm GeoDepth Prestack Migration
Slide 16 | FAST08 Panasas, Inc.
Number of Cores
2541
1318
779
4568
2680
1790
0
1000
2000
3000
4000
5000
64 128 256
PanFS -- FLUENT 12
NFS -- FLUENT 6.3
Truck Aero111M Cells
Scalability of Solver + Data File WriteTi
me
(Sec
onds
) of S
olve
r + D
ata
File
Writ
e
Time of Solver + Data File Write
Lower is
better
1.7x
1.5x
1.9x1.7x
FLUENT Comparison of PanFS vs. NFS on University of Cambridge Cluster
NOTE: Read times are not included in
these results
Slide 17 | FAST08 Panasas, Inc.
Number of cells
111,091,452
Solver PBNS, DES, Unsteady
Iterations5 time steps, 100 total
iters - data save after last iteration
Output size:FLUENT v6.3 (serial I/O; size of .dat file)
FLUENT v12 (serial I/O; size of .dat file)
FLUENT v12 (parallel I/O; size of .pdat file)
14,808 MB
16,145 MB
19, 683 MB
Unsteady external aero for 111 MM cell truck; 5 time steps with 100 iterations, and a single .dat file write
Univ of Cambridge DARWIN ClusterLocation: University of Cambridge http://www.hpc.cam.ac.uk
Vendor: Dell ; 585 nodes; 2340 cores; 8 GB per node; 4.6 TB total memCPU: Intel Woodcrest DC, 3.0 GHz / 4MB L2 cache
Interconnect: InfiniPath QLE7140 SDR HCAs; Silverstorm 9080 and 9240 switches,File System: Panasas PanFS, 4 shelves, 20 TB capacity
Operating System: Scientific Linux CERN SLC release 4.6
DARWIN 585 nodes; 2340 cores
Panasas: 4 Shelves, 20 TB
Truck111M Cells
Details of the FLUENT 111M Cell Model
Slide 18 | FAST08 Panasas, Inc.
Automatic per-file RAID
System assigns RAID level based on file size
<= 64 KB RAID 1 for efficient space allocation
> 64 KB RAID 5 for optimum system performance
> 1 GB two-level RAID-5 for scalable performance
RAID-1 and RAID-10 for optimized small writes
Automatic transition from RAID 1 to 5 without re-striping
Programmatic control for application-specific layout optimizations
Create with layout hint
Inherit layout from parent directory
Small File
RAID 1 Mirroring
RAID 5 Striping
Large File
Very Large File
2-level RAID 5 Striping
Clients are responsible for writing data and its parity
Slide 19 | FAST08 Panasas, Inc.
H G
k E
Declustered RAID
Files are striped across component objects on different StorageBlades
Component objects include file data and file parity for reconstruction
File attributes are replicated with two component objects
Declustered, randomized placement distributes RAID workload
C F
E
2-shelfBladeSet
Mirroredor 9-OSDParityStripes
FAIL
Read abouthalf of eachsurviving OSDWrite a littleto each OSD
Scales linearly
Slide 20 | FAST08 Panasas, Inc.
Scalable RAID Rebuild
Shorter repair time in larger storage pools
Customers report 30 minute rebuilds for 800GB in 40+ shelf blade set
Variability at 12 shelves due to uneven utilization of DirectorBlade modules
Larger numbers of smaller files was better
Reduced rebuild at 8 and 10 shelves because of wider parity stripe
Rebuild bandwidth is the rate at which data is regenerated (writes)
Overall system throughput is N times higher because of the necessary reads
Use multiple “RAID engines” (DirectorBlades) to rebuild files in parallel
Declustering spreads disk I/O over more disk arms (StorageBlades)
0
20
40
60
80
100
120
140
0 2 4 6 8 10 12 14
# Shelves
One Volume, 1G Files
One Volume, 100MB Files
N Volumes, 1GB Files
N Volumes, 100MB Files
Rebuild MB/sec
Slide 21 | FAST08 Panasas, Inc.
RAID Rebuild vs Stripe Width
Panasas system automatically selects stripe width up to 11 wide
8 to 11 wide is best for bandwidth performance
System packs an even number of stripes into Bladeset, leaving at least one spare
Narrower stripes rebuild faster
Less data to read to reconstruct writes
More DirectorBlades helps
1, 2, or 3 per shelf
50+ in a single system
Rebuild MB/sec
0
20
40
60
80
100
120
140
160
1+1 2+1 3+1 4+1 5+1 6+1 7+1RAID-5 Stripe Configuration
4 mgr +18 storage3 mgr + 8 storage
Slide 22 | FAST08 Panasas, Inc.
Scalable rebuild is mandatory
Having more drives increases risk, just like having more light bulbs increases the odds one will be burnt out at any given time
Larger storage pools must mitigate their risk by decreasing repair times
The math saysif (e.g.) 100 drives are in 10 RAID sets of 10 drives each and
each RAID set has a rebuild time of N hours
The risk is the same if you have a single RAID set of 100 drives
and the rebuild time is N/10
Block-based RAID scales the wrong direction for this to work Bigger RAID sets repair more slowly because more data must be read
Only declustering provides scalable rebuild rates
Total number of drives Drives per RAID set Repair time
Slide 23 | FAST08 Panasas, Inc.
Client
MetadataServer
oplog
caplog
21. Create
3. Create
4
Replycache
5
6. Reply
7. Write
8
OSDs
Txn_log
Creating a File in 2 milliseconds
NVRAM,replicatedto backupin 90 usecover 1GE
Directories are objects with lists of name to object ID and location hint mappings
Slide 24 | FAST08 Panasas, Inc.
Metarate operations/sec
0
500
1000
1500
2000
2500
3000
3500
4000
DB-100 DF
DB-100aDF
DB-100NFS
DB-100aNFS
Linux NFS
CreateUtime
Slide 25 | FAST08 Panasas, Inc.
Metarate operations/sec
0
500
1000
1500
2000
2500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15# Clients
UtimeCreate
Slide 26 | FAST08 Panasas, Inc.
Metarate operations/sec
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# Clients
Reported rate is speed of slowest client times number of clients, which is an MPI metric
(Cache hits in the client that created all the files)
Slide 27 | FAST08 Panasas, Inc.
MPI Coordinated File Operations
Panasas mdtest Performance
01000
2000300040005000
600070008000
900010000
CreateFile
Stat File RemoveFile
CreateDir
Stat Dir RemoveDir
Op
era
tio
ns
/Se
co
nd
Unique Directory
Single Process
Shared Directory
mpirun -n 64 mdtest -d $dir -n 100 -i 3 -N 1 -v -u
Slide 28 | FAST08 Panasas, Inc.
Summary
Per-file, object-based RAID gives scalable on-line performance
Offloads the metadata server
Parallel block allocation among the storage nodes
Declustered parity group placement yields linear increase in rebuild rates with the size of the storage pool
May become the only way to effectively handle large capacity drives
Metadata is stored as attributes on objects
File create is complex, but made fast with efficient journal implementation
Coarse-grained metadata workload distribution is a simple way to scale
Slide 29 | FAST08 Panasas, Inc.
Technology Review
Turn-key deployment and automatic resource configuration
Scalable Object RAID
Very fast RAID rebuild
Vertical Parity to trap silent corruptions
Network parity for end-to-end data verification
Distributed system platform with quorum-based fault tolerance
Coarse grain metadata clustering
Metadata fail over
Automatic capacity load leveling
Storage Clusters scaling to ~1000 nodes today
Compute clusters scaling to 12,000 nodes today
Blade-based hardware with 1Gb/sec building block
Bigger building block going forward
Slide 30 | FAST08 Panasas, Inc.
The pNFS Standard
The pNFS standard defines the NFSv4.1 protocol extensions between the server and client
The I/O protocol between the client and storage is specified elsewhere, for example:
SCSI Block Commands (SBC) over Fibre Channel (FC)
SCSI Object-based Storage Device (OSD) over iSCSI
Network File System (NFS)
The control protocol between the server and storage devices is also specified elsewhere, for example:
SCSI Object-based Storage Device (OSD) over iSCSI
ClientStorage
NFS 4.1 Server
Slide 31 | FAST08 Panasas, Inc.
Key pNFS Participants
Panasas (Objects)
Network Appliance (Files over NFSv4)
IBM (Files, based on GPFS)
EMC (Blocks, HighRoad MPFSi)
Sun (Files over NFSv4)
U of Michigan/CITI (Files over PVFS2)
Slide 32 | FAST08 Panasas, Inc.
pNFS Status
pNFS is part of the IETF NFSv4 minor version 1 standard draft
Working group is passing draft up to IETF area directors, expect RFC later in ’08
Prototype interoperability continuesSan Jose Connect-a-thon March ’06, February ’07, May ‘08
Ann Arbor NFS Bake-a-thon September ’06, October ’07
Dallas pNFS inter-op, June ’07, Austin February ’08, (Sept ’08)
AvailabilityTBD – gated behind NFSv4 adoption and working implementations of pNFS
Patch sets to be submitted to Linux NFS maintainer starting “soon”
Vendor announcements in 2008
Early adoptors in 2009
Production ready in 2010
www.panasas.com
Go Faster. Go Parallel.
Questions?
Thank you for your time!
Slide 34 | FAST08 Panasas, Inc.
Panasas Global Storage Model
Client Node
Panasas System A Panasas System B
BladeSet 1
VolX
VolY
VolZ
/panfs/sysa/delta/file2
BladeSet 2
delta
home
BladeSet 3
VolM
VolN
VolL
Client Node Client Node Client Node
/panfs/sysb/volm/proj38
TCP/IP network
PhysicalStorage
Pool
LogicalQuotaTree
DNS Name
Slide 35 | FAST08 Panasas, Inc.
IB and other network fabrics
Panasas is a TCP/IP, GE-based storage product
Universal deployment, Universal routability
Commodity price curve
Panasas customers use IB, Myrinet, Quadrics, …
Cluster interconnect du jour for performance, not necessarily cost
IO routers connect cluster fabric to GE backbone
Analogous to an “IO node”, but just does TCP/IP routing (no storage)
Robust connectivity through IP multipath routing
Scalable throughput at approx 650 MB/sec IO router (PCI-e class)
Working on a 1GB/sec IO router
IB-GE switching platforms
Cisco/Voltare switch provides wire-speed bridging
Slide 36 | FAST08 Panasas, Inc.
To Site Network
Multi-Cluster sharing: scalable BW with fail over
NFS
DNS1
KRB
Cluster C
Compute Nodes
Archive
Layer 2 switches
Panasas
Storage
Cluster A
I/O Nodes
Cluster B
Colors depict
subnets
Slide 37 | FAST08 Panasas, Inc.
New and Unique: Network Parity
Horizontal Parity
Vertical Parity
Network Parity
Extends parity capability across the data path to the client or server node
Enables End-to-End data integrity validation
Protects from errors introduced by disks, firmware, server hardware, server software, network components and transmission
Client either receives valid data or an error notification
New