Repack and Tape Label Options Tim Bell Charles Curran Gordon Lee June 27 th 2008
description
Transcript of Repack and Tape Label Options Tim Bell Charles Curran Gordon Lee June 27 th 2008
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Repack and Tape Label Options
Tim BellCharles Curran
Gordon LeeJune 27th 2008
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 2
The Bulk Repack Problem
• IBM and Sun have new drives coming– Aim for production at CERN in January– Higher capacity (1TB per tape)– Faster drives (up to 160MBytes/s)
• Require repacking to avoid buying new media and robot slots
• Current dataset– 104 million files– 15PB storage– 39000 tapes
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 3
Why are we copying ?
Vendor Current Future At CERN Delta Capacity
Cost to purchase
IBM 700GB 1000GB 9692 2.9PB 0.5MCHF
Sun 513 500GB 1000GB 14890 7.4PB 1.3MCHF
Sun 613 500GB 1000GB 15408 7.7PB 1.4MCHF
Total 18.0PB 3.2MCHF
• Cost to purchase is the additional media and slots required if we write at new densities but do not copy and recycle old tapes
• Adds up to a saving of 3.2M CHF• With higher density and repack, media requirements for
2009 are covered• Without higher density, 2 new 10,000 slot robots would be
required in 2009
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 4
Per-VO file sizes
alice atlas cms compass lhcb na48 other user0
500
1000
1500
2000
2500
3000
3500
Average File Size on Tape per VOM
B
• Some improvements in file sizes from LHC experiments over the past 6 months but no major revolution expected
• Current average is 154MBytes per file
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 5
Per Tape Distribution
• Long tail up to 154,000 files per tape• Only 25% of tapes have average file size >1 Gbyte• Projected year end 2008 based on LHC usage
0-49
9
1500
-199
9
3000
-349
9
4500
-499
9
6000
-649
9
7500
-799
9
9000
-949
9
1000
0-10
499
1150
0-11
999
1300
0-13
499
1450
0-14
999
1600
0-16
499
1750
0-17
999
1900
0-19
499
2050
0-20
999
2200
0-22
499
2350
0-23
999
2500
0-25
499
2650
0-26
999
2800
0-28
499
2950
0-29
999
0
2000
4000
6000
8000
10000
12000
14000
Distribution of files per tape
Tape Count Projected End 2008
Files per tape
Tap
es
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 6
Castor tape formats
A B C
H TAM M M H M B TM M H M C T M
Castor
AUL
A M B M C MNL
File Marks In AUL, these are written at the end of each label and each user data file. In NL, these are written at the end of each user file
Labels Meta data about the file contents. These are stored as full data files on the tape with a terminating file mark. Headers in front of the user data, trailers at the end.
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 7
File size and performance
• AUL shows 7.3 seconds overhead per file• NL shows 3.3 seconds overhead per file• Tests using low level tape to tape copy are covered by read/cksum/write• Figures confirmed by running repack2 and Castor to aul and nl tapes
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 8
Repack in a year
• This is the number of drives which would need to be dedicated to complete the repack within 1 year.
• The write performance varies with different output label types• Includes projected data to year end 2008• Drive costs around 35K CHF over 3 years
aul nl il0
10
20
30
40
50
60
Write
Read
Output Tape Label Type
Dri
ves
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 9
Ignore worst cases
• Determine drive requirements if we ignore the projected 6000 tapes with >10000 files
• Leave worst cases in the robot unpacked (i.e. Cost of 0.5MCHF for 3000 more tapes/slots/robots)
aul nl il0
5
10
15
20
25
30
35
40
Write
Read
Output Tape Label Type
Dri
ves
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 10
Repack using 20 drives
• Approach to take easy tapes with large files first• Repack using aul tapes would take over 3 years to
complete• Max80 figures reflect the performance if engine is able to
sustain reading at 80MBytes/s. Max50 for 50MBytes/s and Max25 for 25MBytes/s
• The ‘to migrate’ queue would be around 400,000 files at the end of processing if 20 drives are used.
0 100 200 300 400 500 600 7000%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
aul,max80 nl,max80 il,max80 aul,max50 aul,max25
Days Taken Using 20 Drives
Rep
ack
Co
mp
lete
d
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 11
IL – Internal Label Format
• New format of data on tape to reduce the number of file marks
• Stores data located by block offset rather than file sequence number
• Tape mark only at the end of the migration stream rather than end of each file
• Simple prototype copy program has produced 85MBytes/s. Full drive speed can be achieved if shared buffers used.
• This label format is new and is therefore not currently supported by Castor
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 12
IL tape format
A B C
H TAM M M H M B TM M H M C T M
Castor
AUL
A M B M C MNL
MIL
Internal Label Contains the VID, checksum, Castor name, block number. These are stored in the first few kilobytes of each tape block written
User Data User Data is stored after the internal label and completes a full tape block. The castor file is split into smaller chunks to fit within a tape block such that chunk size + Internal Label size = tape block size. Unit repeated until end of data. No tape mark written at the end of file since internal label contains information about which file it is.
File Marks In AUL, these are written at the end of each label and each user data file. In NL, these are written at the end of each user file. In IL, these are written at the end of the migration stream
Labels Meta data about the file contents. These are stored as full data files on the tape with a terminating file mark. Headers in front of the user data, trailers at the end.
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 13
Intermediate Conclusion
• Given the file sizes and drives currently being used, the label format is the limiting factor for performance
• The engine used for copying is a secondary performance factor. This factor becomes more important for label formats or file sizes which support higher speeds such as 50MB/s or more.
• Scanning tapes at full drive speed can be used to validate a complete repack commit to the name server
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 14
Option A – bulk repack
• Need a new low level label format using block addressing to write many castor files without tape marks
• Develop a new low level repack program which writes out in il format using direct tape to tape copy with two tape drives on a tape server
• Enhance Castor to support reading il format in the short term
• Writing il format requires modifications to rtcpd/rtcpclientd as current writing is file-by-file and il requires a full stream. This is unlikely before clustering implementation is done so continue to write new data in aul format until clustering implementation is complete which will require rework in this area.
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 15
Option B - clustering
• Architecture task force recommended to cluster related data onto tape.
• One possible implementation of this would be to merge many related Castor files into a single large file when migrating to tape and recalled as a unit.
• Start using the repack2 engine at maximum speed and aul tapes on tapes with large files until clustering is available
• Once clustering is available, repack many tapes in parallel to allow related files to be grouped together on tape for more efficient recall.
• Need at least 30 disk servers for production repack service class to ensure reasonable clustering and drive performance.
• Cluster implementation needs to be architected, implemented, policies defined and deployed at very latest by end 1Q 2009 to avoid delays in the repacking process.
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 16
Option C – tape to tape copy
• Develop a new low level repack program which is able to write nl tape format output using direct tape to tape copy with two tape drives on a tape server
• Write in nl format and partial re-scan of tapes on completion to validate contents
• 80% of tapes (giving 14PB additional space) can be completed in 1 year with 25 drives which may be sufficient for 2009 data
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 17
Costing
• Option A – bulk repack– Development for
• Bulk repack tool• Support of new label format for read in Castor• Name server fields for block offset ?
– 22 drives for 1 year• Option B – start repack2/aul then clustering
– Development for• 2nd level disk hierarchy• Legacy cluster definitions
– Hardware• 33 disk servers @ 8K CHF / disk server dedicated for one year• Fat tape servers purchase required ?
– 33 drives for 1 year
• Option C – copy only good cases to nl– Development for
• Bulk repack tool
– 25 drives for 1 year– Purchase 3000 additional slots (0.5M CHF)
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 18
Points• What tool for repack 2010/11?
– Must repack all of the 50PB data in 2011 to new media– 10Gbit/s ethernet and drives at 160MBytes/s– Do we still need a low level tool anyway even if clustering can be used ?– Can we avoid the repack2 restrictions on number of concurrent files being
processed and submitted to the stager ?• What risk with new tape IL format ?
– Complete testing before EOY 2008– Nameserver/stager changes for block offset
• What risk with nl format ?– If tapes are appended to, tape drive malfunction may overwrite data– Write to the tapes once only, scan and then commit to reduce nl risk– Test recovery program based on name server checksums
• What risk with new bulk tool ?– How can we test it ? Scanning tool is also required for validation
• What risk for clustering deliverable ?– Architecture, will multiple user files per tape file be selected ?– Additional hardware for disk layer / fat tape layer– Define experiment and legacy clusters– Schedule is critical for repack success .. Emergency orders for tape capacity
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 19
Points (contd)
• How many drives can we spare ?– Need to get underway during low data recording periods
– Further drive purchase ? Use old drives for reading ? – More drives means more load on the stager as queues longer
• Can we reduce read mounting in the future by repack/clustering ?– Use repack as a rebalancing tool by reading in several tapes and re-clustering– What is the access frequency for older LHC data ?– Is the disk layer large enough to be able to effectively cluster on repack ?
• What are the relative efforts ?– Developing new clustering solutions ? Needs to be done anyway but the repack
requirements may bring time pressure– Investment to tune repack2 to get the necessary throughput and robustness will
need to continue and occupy substantial development resources– The low level tool would require scripting and a method to track outstanding
work similar to that used for repack-1
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 20
Conclusion
• ?
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
Backup Slides
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 22
What is in an AUL label
Vol1 Volume Label This contains the Volume Serial Number (VSN) in 6 bytes and this is not necessarily the same as the volume identifier or VID. The VID is site dependent and is normally the number on the cartridge sticker. Vol1 also specifies if label information on the tape is coded in EBCDIC or ASCII
Hdr1 Header 1 label This contains the last 17 chars of the filename and the date / time of writing the file.
Hdr2 Header 2 Label This contains a 5 character field for the block size in bytes used for the file. The 5 characters limits the blocksize to 99999 and for Castor tapes, the "real" block size is held in uhl1. hdr2 also contain the tape format - F for fixed block and U for unformatted. Castor uses a FS format which is fixed block with the option that the last block of the file can be truncated.
uhl1 User header label 1 uhl labels can be defined to hold any non standard data such as the full file name. In Castor, uhl1 holds the real block size which can be greater than the 99999 five character value. These can be repeated several times in 80 block chunks.
eof1 Trailer label 1 a trailer label separated by tape mark from the data
eof2 Trailer label 2 a second trailer label utl1 User trailer label 1 like the user header labels, utl labels can be defined to hold any non standard data. In Castor,
utl1 holds the "actual" number of blocks written to the file.
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 23
What is in AUL / UHL 1 ?
Field Example
User Header label UHL(UTL also possible)
Header Label Number 1
Site CERN
Actual files sequence number 000012345
Actual record length 000262144
Tape mover hostname TPSRV201
Drive manufacturer STK
Drive model T9940B
Drive serial number 456000001642
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 24
What is in AUL / UHL 2 ?
Field Example
User Header label UHL(UTL also possible)
Header Label Number 2
Bit file ID (64 bits) 00000000000000376975
Name Server hostname CASTORNS1
Absolute mode 0644
Uid 0000000395
Gid 0000001028
File size in bytes (64 bits) 00000000010031553895
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 25
What is in AUL / UHL 3 ?
Field Example
User Header label UHL(UTL also possible)
Header Label Number 3
User name timbell
Experiment/Project name
Checksum algorithm AD (adler32)CS (cksum)
File checksum (32 bits)
Last modification (UTC) 2001/04/04 08:51:30
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 26
What is in AUL / UHL 4 ?
Field Example
User Header label UHL(UTL also possible)
Header Label Number 4
Copy number 00001
Segment number 00001
Segment size in bytes 00000000010031553895
Checksum algorithm AD (adler32)CS (cksum)
Segment checksum (32 bits)
Tape write timestamp (UTC) 2001/04/04 08:51:30
Number of blocks 0000002342
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 27
Repack using 20 drives
• Full extended timeline showing aul,max25 to completion
0 200 400 600 800 1000 1200 1400 16000%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
aul,max80 nl,max80 il,max80 aul,max50 aul,max25
Days Taken Using 20 Drives
Rep
ack
Co
mp
lete
d
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 28
Performance for large files
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 29
Tape-to-Tape repack?
CERNCERN
Disk Server
Stager
repack
• Tape-to-tape copy rather than copying through the stager avoids network bottleneck
• Initial tests indicate that the tape writing overheads are larger for our typical files
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 30
Tests to scale repack 2
30
• 3 disk servers• 3 tape drives in• 3 tape drives out• File size of 2GB+• Elapsed 3h for
1500GB, 46MBytes/s• Around 60MBytes/s
during steady state
• 6 disk servers• 3 tape drives in• 3 tape drives out• File size of 500MB+
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 31
Tests to scale repack 2
31
• 3 disk servers• 1 tape drive in• 1 tape drive out• Reaches Gigabit
ethernet wire speeds
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 32
c2public small files
• Migrated 400,000 files in 18 days• Two drives• Two disk servers
• Using a mixture of nl and aul tapes on IBM drives• Corresponds to a file / drive every 8 seconds
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 33
File size and performance
Date Alice Atlas CMS LHCb
CCRC May ’08 322 MB 1291 MB 872 MB 1327 MB
March ‘08 143 MB 230 MB 1490 MB 865 MB
CCRC Feb ’08 340 MB 320 MB 1470 MB 550 MB
Jan ’08 200 MB 250 MB 2000 MB 200 MB
0 500 1000 1500 2000 2500 30000
10
20
30
40
50
60
70
80
90
100
lhcbcms
atlas
alice
Typical Drive Performance
File Size (MB)
Dri
ve S
pee
d (
Mb
ytes
/s)
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it 34
Additional Information
• Repack Options– https://twiki.cern.ch/twiki/bin/view/FIOgroup/TapeBulkRepack
• Repack Performance Analysis• http://it-div-ds.web.cern.ch/it-div-ds/HO/repack_challenge.html
• Label Options– https://twiki.cern.ch/twiki/bin/view/FIOgroup/TapeLabelOptions