Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for...
-
Upload
nicholas-gregory -
Category
Documents
-
view
212 -
download
0
Transcript of Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for...
Bio-IT World Asia, June 7, 2012
High Performance Data Management and Computational Architectures for Genomics Research at National and
International Scales
William K. Barnett, Ph.D.Richard LeDuc, Ph.D.
National Center for Genome Analysis Support
Bio-IT World Asia, June 7, 2012National Center for Genome Analysis Support: http://ncgas.org
Summary
• Changing genomics analytical needs• NCGAS and its mission• NCGAS cyberinfrastructure• The 100 Gigabit demonstration• Scaling genomics analysis• The NCGAS research model• Outcomes for life sciences research
National Center for Genome Analysis Support: http://ncgas.org
Changing genomics analytical needs
• Next Gen sequencers are generating more data and getting cheaper
• Sequencing is: Becoming commoditized at large centers and Multiplying at individual labs
• Analytical capacity has not kept up Bioinformatics support Computational support (thousand points solution) Storage support
Bio-IT World Asia, June 7, 2012
National Center for Genome Analysis Support: http://ncgas.org
NCGAS widening the analytical bottleneck
• Funded by National Science Foundation (grant # ABI-1062432)
• Large memory clusters for assembly• Bioinformatics consulting for biologists• Optimized software for better efficiency• Providing services at: http://ncgas.org
Bio-IT World Asia, June 7, 2012
National Center for Genome Analysis Support: http://ncgas.org
Making it easier for Biologists
• Galaxy interface provides a “user friendly” window to NCGAS resources
• Supports many bioinformatics tools
• Available for both research and instruction.
Common
Rare
Computational Skills
LOW
HIGH
Bio-IT World Asia, June 7, 2012
National Center for Genome Analysis Support: http://ncgas.org
NCGAS Service Model
Hardware Layer
OS Layer
Services Layer
Applications
Bioinformatics
Network Layer
PublicCloud Providers
NCGAS
Mason (512 GB/node)
Systems Administration
Galaxy, Parallelization
Hardened Applications and Workflows
Expert Consulting
100 Gbps I2
Bio-IT World Asia, June 7, 2012
NEED
APIs
National Center for Genome Analysis Support: http://ncgas.org
NCGAS Galaxy Applications Model
Virtual box hosting Galaxy.Indiana.edu
The host for each tool is configured to meet IU needs
Quarry Mason
Data CapacitorRFS
Virtual box hosting Galaxy.NCGAS.org
The host for each tool is configured to meet National needs
Custom Site HostingGalaxy.YourSite.???
The host for each tool is configured to meet Your needs
Bio-IT World Asia, June 7, 2012
National Center for Genome Analysis Support: http://ncgas.org
NCGAS Workflow Demo at SC 11
• STEP 1: data pre-processing, to evaluate and improve the quality of the input sequence
• STEP 2: sequence alignment to a known reference genome
• STEP 3: SNP detection to scan the alignment result for new polymorphisms
Bloomington, IN Seattle, WABio-IT World Asia, June 7, 2012
10 Gbps 100 Gbps
Mason
IU POD
Data Capacitor
NCBI Reference Data
Lustre WAN File System
Large Sequencing Center
NCGAS Virtual Genomics Science Instrument
International Collaboratorsvia TransPAC, Geant
SmallerSequencing Centers
FTP
Commodity Internet (1Gbps but highly variable)
Internet2 (100Gbps)
0
100
Gbps
NLR to Sequencing Centers (10Gbps/link)
IU Data Capacitor (20 Gbps throughput)
Ultra SCSI 160 Disk (1.2 Gbps, 160 MBps)
DDR3 SDRAM (51.2 Gbps, 6.4GBps, )
This Architecture Scales!
Bio-IT World Asia, June 7, 2012National Center for Genome Analysis Support: http://ncgas.org
National Center for Genome Analysis Support: http://ncgas.org
How would this work at scale?
1. Biologists anywhere use Galaxy
2. Sequence data transferred over Research Nets
3. Lustre WAN flows data into Data Capacitor
4. Data Capacitor mounts reference data
5. Results available on Data Capacitor for subsequent analyses (secure to HIPAA standards)
Bio-IT World Asia, June 7, 2012
National Center for Genome Analysis Support: http://ncgas.org
Outcomes for Life Sciences Research…
• National and international networks have the capacity to handle genomics data.
• Distributed workflow tools lower the bar for biologists to accomplish genomic science.
• NCGAS is an extensible model of a scaled and integrated infrastructure for biological research.
• This model can extend internationally
Bio-IT World Asia, June 7, 2012
National Center for Genome Analysis Support: http://ncgas.org
Thank You
Questions?
Bill Barnett ([email protected])
Rich LeDuc ([email protected])
Bio-IT World Asia, June 7, 2012