Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for...

13
Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales William K. Barnett, Ph.D. Richard LeDuc, Ph.D. National Center for Genome Analysis Support

Transcript of Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for...

Page 1: Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales.

Bio-IT World Asia, June 7, 2012

High Performance Data Management and Computational Architectures for Genomics Research at National and

International Scales

William K. Barnett, Ph.D.Richard LeDuc, Ph.D.

National Center for Genome Analysis Support

Page 2: Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales.

Bio-IT World Asia, June 7, 2012National Center for Genome Analysis Support: http://ncgas.org

Summary

• Changing genomics analytical needs• NCGAS and its mission• NCGAS cyberinfrastructure• The 100 Gigabit demonstration• Scaling genomics analysis• The NCGAS research model• Outcomes for life sciences research

Page 3: Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales.

National Center for Genome Analysis Support: http://ncgas.org

Changing genomics analytical needs

• Next Gen sequencers are generating more data and getting cheaper

• Sequencing is: Becoming commoditized at large centers and Multiplying at individual labs

• Analytical capacity has not kept up Bioinformatics support Computational support (thousand points solution) Storage support

Bio-IT World Asia, June 7, 2012

Page 4: Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales.

National Center for Genome Analysis Support: http://ncgas.org

NCGAS widening the analytical bottleneck

• Funded by National Science Foundation (grant # ABI-1062432)

• Large memory clusters for assembly• Bioinformatics consulting for biologists• Optimized software for better efficiency• Providing services at: http://ncgas.org

Bio-IT World Asia, June 7, 2012

Page 5: Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales.

National Center for Genome Analysis Support: http://ncgas.org

Making it easier for Biologists

• Galaxy interface provides a “user friendly” window to NCGAS resources

• Supports many bioinformatics tools

• Available for both research and instruction.

Common

Rare

Computational Skills

LOW

HIGH

Bio-IT World Asia, June 7, 2012

Page 6: Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales.

National Center for Genome Analysis Support: http://ncgas.org

NCGAS Service Model

Hardware Layer

OS Layer

Services Layer

Applications

Bioinformatics

Network Layer

PublicCloud Providers

NCGAS

Mason (512 GB/node)

Systems Administration

Galaxy, Parallelization

Hardened Applications and Workflows

Expert Consulting

100 Gbps I2

Bio-IT World Asia, June 7, 2012

NEED

APIs

Page 7: Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales.

National Center for Genome Analysis Support: http://ncgas.org

NCGAS Galaxy Applications Model

Virtual box hosting Galaxy.Indiana.edu

The host for each tool is configured to meet IU needs

Quarry Mason

Data CapacitorRFS

Virtual box hosting Galaxy.NCGAS.org

The host for each tool is configured to meet National needs

Custom Site HostingGalaxy.YourSite.???

The host for each tool is configured to meet Your needs

Bio-IT World Asia, June 7, 2012

Page 8: Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales.

National Center for Genome Analysis Support: http://ncgas.org

NCGAS Workflow Demo at SC 11

• STEP 1: data pre-processing, to evaluate and improve the quality of the input sequence

• STEP 2: sequence alignment to a known reference genome

• STEP 3: SNP detection to scan the alignment result for new polymorphisms

Bloomington, IN Seattle, WABio-IT World Asia, June 7, 2012

Page 9: Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales.

10 Gbps 100 Gbps

Mason

IU POD

Data Capacitor

NCBI Reference Data

Lustre WAN File System

Large Sequencing Center

NCGAS Virtual Genomics Science Instrument

International Collaboratorsvia TransPAC, Geant

SmallerSequencing Centers

FTP

Page 10: Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales.

Commodity Internet (1Gbps but highly variable)

Internet2 (100Gbps)

0

100

Gbps

NLR to Sequencing Centers (10Gbps/link)

IU Data Capacitor (20 Gbps throughput)

Ultra SCSI 160 Disk (1.2 Gbps, 160 MBps)

DDR3 SDRAM (51.2 Gbps, 6.4GBps, )

This Architecture Scales!

Bio-IT World Asia, June 7, 2012National Center for Genome Analysis Support: http://ncgas.org

Page 11: Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales.

National Center for Genome Analysis Support: http://ncgas.org

How would this work at scale?

1. Biologists anywhere use Galaxy

2. Sequence data transferred over Research Nets

3. Lustre WAN flows data into Data Capacitor

4. Data Capacitor mounts reference data

5. Results available on Data Capacitor for subsequent analyses (secure to HIPAA standards)

Bio-IT World Asia, June 7, 2012

Page 12: Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales.

National Center for Genome Analysis Support: http://ncgas.org

Outcomes for Life Sciences Research…

• National and international networks have the capacity to handle genomics data.

• Distributed workflow tools lower the bar for biologists to accomplish genomic science.

• NCGAS is an extensible model of a scaled and integrated infrastructure for biological research.

• This model can extend internationally

Bio-IT World Asia, June 7, 2012

Page 13: Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales.

National Center for Genome Analysis Support: http://ncgas.org

Thank You

Questions?

Bill Barnett ([email protected])

Rich LeDuc ([email protected])

Bio-IT World Asia, June 7, 2012