2. About the Institute
3. ~700 employees. Large scale genomic research.
4. We have active cancer, malaria, pathogen and genomic variation studies. All data is made publicly available.
5. Previously...at BioIT Europe: 6. The Scary Graph Instrument upgrades Peak Yearly capillary sequencing 7. The Scary Graph 8. Managing Growth
Moore's law will not save us.
9. Sequencing cost: T d =12 months 10. Classic Sanger Stealth project
Not long after:
A fun summer was had by all! 11. Classic Sanger Stealth project
Not long after:
A fun summer was had by all! 12. What we learned...
13. Nobody stops to tidy up until they have no more disk space. Data-triage:
14. BAM only. Storage-Tax:
15. Historically sequencing and IT were budgeted separately. 16. Makes Pis aware of the IT costs, even if it does not cover 100%. 17. Flexible Infrastructure
18. Assume from day 1 we will be adding more. 19. Expand simply by adding more blocks. Make storage visible from everywhere.
This allows us to move compute jobs between farms.
20. Currently using LSF to manage workflow.LSF Fast scratch disk Archival / Warehouse disk Network 21. Our Modules:
22. Simple might not be so robust, but it is much simpler and faster to fix if it breaks. More reliable in practice. Compute:
Bulk Storage:
23. 50-100TB chunks. Fast Storage:
Reasonably successful.
24. Data management
#df -h FilesystemSizeUsed Avail Use% Mounted on lus02-mds1:/lus02108T107T1T99% /lustre/scratch102 #df -iFilesystemInodesIUsedIFree IUse% Mounted on lus02-mds1:/lus02300296107 136508072 163788035 45% /lustre/scratch102 25. Sequencing data flow. Automated processing and data management Sequencer Analysis/ alignment Internalrepository EGA / SRA (EBI) compute-farm High-performance storage Manual data movement 26. Unmanaged data
Data is left in the wrong place.
Important data left in scratch areas, or high IO analysis being run against slow storage. Finding data is impossible.
Are we backing up the important stuff? 27. Are we keeping control of our private datasets? 28. Managing unstructured data
29. Works well for the pipelines where it is currently used. Hard to get buy-in from our non production users.
Our Breakthrough Moment:
Big benefits:
30. 50% reduction in disk utilisation.
Easy to do capacity planning. 31. Bottlenecks:
32. As data sizes increase,even smal datal groups get hit. Money talks:
We do not want lots of distinct data tracking systems.
33. Groups need to exchange data. 34. Small groups do not have the manpower to hack something together. We need something with a simple interface so it can easily support ad-hoc requests. 35. Sequencing data flow. Automated processing and data management Manual Sequencer Analysis/ alignment Internalrepository EGA / SRA (EBI) compute-farm High-performance storage Managed data movement 36. What are we using?
Successor to SRB.
HEP community has lots of lessons learned that we can benefit from. 37. iRODS ICAT Catalogue database Rule Engine Implements policies Irods Server Data on disk User interface WebDAV, icommands,fuse Irods Server Data in database Irods Server Data in S3 38. iRODS Features
Scalable:
39. Replicates data. 40. Fast parallel data transfers across local and wide area network links. Extensible
Federated
41. First implementation Automated processing and data management Manual Sequencer Analysis/ alignment Internalrepository EGA / SRA (EBI) compute-farm High-performance storage 42. First Implementation
43. Hold bam files, and a small amount of metadata. Rules: 44. Replicate:
Set access controls:
45. Example access: $ icd /seq/5307 $ ils /seq/5307: 5307_1.bam 5307_2.bam 5307_3.bam $ ils -l 5307_1.bamsrpipe0 res-g21987106409 2010-09-24.13:35 & 5307_1.bam srpipe1 res-r21987106409 2010-09-24.13:36 & 5307_1.bam 46. Metadata imeta ls -d /seq/5307/5307_1.bam AVUs defined for dataObj /seq/5307/5307_1.bam: attribute: type value: bam units:---- attribute: sample value: BG81 units:---- attribute: id_run value: 5307 units:---- attribute: lane value: 1 units:---- attribute: study value: TRANSCRIPTION FACTORS IN HAEMATOPOIESIS - MOUSE units:---- attribute: library value: BG81 449223 units: 47. Query imeta qu -d study = "TRANSCRIPTION FACTORS IN HAEMATOPOIESIS - MOUSE" collection: /seq/5307 dataObj: 5307_1.bam ---- collection: /seq/5307 dataObj: 5307_2.bam ---- collection: /seq/5307 dataObj: 5307_3.bam ---- 48. So what...? 49. Next steps Sanger iRODs Datacentre 2 Datacentre 1 Replicate EGA/ERA Automatedrelease/purge CollaboratoriRODs Federate 50. Wishlist: HPC Integration Data is staged in/out to filesystem Archive / Metadatasystem FastStorage/ POSIX filesystem Compute farm FastStorage/ POSIX filesystem + Metadata sytem Compute farm System can do rule/metadata based ops and standard POSIX ops too. 51. Managing Workflow 52. Modular Compute
53. Storage and servers spread across several locations. Fast link Storage Storage Storage Storage CPU CPU CPU CPU CPU medium link slow link 54. How do we manage data and workflow?
How do we steer workload to where we want it?
55. LSF Data Aware Scheduler
56. LSF knows how much free space is available on each pool. Users can optionally register datasets as being on a particular storage pool.
57. Future Work
Let the system move data.
58. Hot datasets change over time. 59. Replicate/move the datasets to faster storage, or a greater number of storage pools. Making LSF to do data migration/replication will be a hard.
60. Acknowledgements
61. Phil Butcher 62. ISG
63. Gen-Tao Chiang 64. Pete Clapham 65. Simon Kelley Platform Computing
66. Chris Duddington 67. Da Xu
Top Related