Chambwe bosc2010

Post on 11-Jun-2015

503 views 1 download

Tags:

Transcript of Chambwe bosc2010

THE GOBY FRAMEWORK: TOWARDS EFFICIENT NEXT-GENERATION SEQUENCING DATA ANALYSIS

Nyasha Chambwe, Kevin C. Dorff, Marko Srdanovic, Xutao Deng, Stuart J.D. Andrews, Fabien Campagne

The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine; Department of Physiology and BiophysicsWeill Medical College of Cornell University

http://goby.campagnelab.org

McPherson J.D. Nat Methods. 2009

Applications of Next Generation Sequencing

Roche/454 GS FLX Titanium

Illumina/Solexa GA IIe

Life Technologies SOLiD 3

Helicos BioSciences Heliscope

NGS Chemistry Pyrosequencing Reversible Terminators

Sequencing by ligation

Reversible Terminators

Avg Read Length (bp)

330 75 50 32

Run Time (days) 0.35 4 7 8

Giga bases/run 0.45 18 30 37

Million reads/run 1.36 240 600 1156

Metzker, M.L. Nat Rev Genet. 2010

Next Generation Sequencers

Next Generation Sequence Data Formats

Key Limitations• Text based formats do

not scale well to handle large amounts of data

• Naïve compression prevents semi-random access

File Format Wish List

Structured schema/data representation Well specified and documented (not ambiguous)

Fast parsing speedLanguage and operating system portabilityBackward and forward compatibilityCompressionRandom accessStreaming

File Formats

File Formats

readsreads alignmentsalignments histogramshistograms

Low Level APIs

Low Level APIs

Tools/UtilitiesTools/

Utilities

ApplicationsApplications

Java, C++, PythonJava, C++, Python

ReadersReaders WritersWriters IteratorsIterators

File Format

Conversions

File Format

ConversionsAlignment ProcessingAlignment Processing VisualizationVisualization

RNA-Seq PipelineRNA-Seq Pipeline IGV Plug-inIGV Plug-in

The Goby Software Framework

File Formats

File Formats

Low Level APIs

Low Level APIs

Tools/UtilitiesTools/

Utilities

ApplicationsApplications

Java, C++, PythonJava, C++, Python

ReadersReaders WritersWriters IteratorsIterators

File Format

Conversions

File Format

ConversionsAlignment ProcessingAlignment Processing VisualizationVisualization

RNA-Seq PipelineRNA-Seq Pipeline IGV Plug-inIGV Plug-in

The Goby Software Framework

Structured non-ambiguous representation

Goby uses Protocol Buffers (PB) to provide “a flexible, efficient, automated mechanism for serializing structured data” (PB website)

• PB generate parsers in different languages e.g., Java, C++, Python, Perl, R, C, C#, Visual Basic, PHP, Objective C, Ruby, Common Lisp

• Provide forward and backward compatibility

Goby compact formats Data is represented by Protocol Buffers as a

message defined by a .proto file

File Format Wish List

Structured schema/data representation Well specified and documented (not ambiguous)

Fast parsing speedLanguage and operating system portabilityBackward and forward compatibilityCompressionRandom accessStreaming

Goby compact formats

Chunking: Semi-random access Efficient parallel processing

File Format Wish List

Structured schema/data representation Well specified and documented (not ambiguous)

Fast parsing speedLanguage and operating system portabilityBackward and forward compatibilityCompressionRandom accessStreaming

Goby File Size ComparisonsMAQC sample B = Ambion Human Brain Reference RNA (HBRR or HBR, Catalog #6050) sequenced on four next-gen platforms

File Formats

File Formats readsreads alignmentsalignments histogramshistograms

Low Level APIs

Low Level APIs

Tools/UtilitiesTools/

Utilities

ApplicationsApplications

Java, C++, PythonJava, C++, Python

ReadersReaders WritersWriters IteratorsIterators

File Format

Conversions

File Format

ConversionsAlignment ProcessingAlignment Processing VisualizationVisualization

RNA-Seq PipelineRNA-Seq Pipeline IGV Plug-inIGV Plug-in

The Goby Software Framework

File Formats

File Formats readsreads alignmentsalignments histogramshistograms

Low Level APIs

Low Level APIs

Tools/UtilitiesTools/

Utilities

ApplicationsApplications

Java, C++, PythonJava, C++, Python

ReadersReaders WritersWriters IteratorsIterators

File Format

Conversions

File Format

ConversionsAlignment ProcessingAlignment Processing VisualizationVisualization

RNA-Seq PipelineRNA-Seq Pipeline IGV Plug-inIGV Plug-in

The Goby Software Framework

Alignment Iterator

Code fragment to:1. Scan through two alignments (input1, input2)2. Print information for each entry3. Print information for chromosomes 1,2,X only

File Formats

File Formats readsreads alignmentsalignments histogramshistograms

Low Level APIs

Low Level APIs

Tools/UtilitiesTools/

Utilities

ApplicationsApplications

Java, C++, PythonJava, C++, Python

ReadersReaders WritersWriters IteratorsIterators

File Format

Conversions

File Format

ConversionsAlignment ProcessingAlignment Processing VisualizationVisualization

RNA-Seq PipelineRNA-Seq Pipeline IGV Plug-inIGV Plug-in

The Goby Software Framework

File Formats

File Formats readsreads alignmentsalignments histogramshistograms

Low Level APIs

Low Level APIs

Tools/UtilitiesTools/

Utilities

ApplicationsApplications

Java, C++, PythonJava, C++, Python

ReadersReaders WritersWriters IteratorsIterators

File Format

Conversions

File Format

ConversionsAlignment ProcessingAlignment Processing VisualizationVisualization

RNA-Seq PipelineRNA-Seq Pipeline IGV Plug-inIGV Plug-in

The Goby Software Framework

RNA-Seq Pipeline• Objective: To determine levels of expression in samples

and perform differential expression analysis• Supports:

Mapping to full genome Mapping to annotated cDNAs (reads match inside exons and

across exon-exon boundaries)• Sequencing platform independent• Published normalization methods implemented

Mortazavi A et al. Nat Methods. 2008 Bullard JH et al. BMC Bioinformatics. 2010

• Bias correction for platform specific biases Hansen KD et al. Nucleic Acids Res. 2010

Sample RNA-Seq Results

Conclusion• Goby file formats are efficient and non-

ambiguous • Alignments are about five times smaller than

BAM alignments • API makes it easy to write efficient code to

handle large datasets• Framework provides utilities and analysis

pipelines for common NGS data analysis tasks

Acknowledgements

Campagne LabFabien Campagne Kevin C. DorffMarko SrdanovicStuart J.D. Andrews

Broad InstituteJim Robinson

http://goby.campagnelab.org

FDA/NCTRLeming Shi

Sequencing Quality Control Project (SEQC)HelicosIllumina Life Technologies Roche

cDNA Search