Post on 01-Apr-2015
Next-Gen Sequencing Bioinformatics Support
GPCL-BAC
Rick Jordan, Programmer/Analyst J. Lyons-Weiler, Sci. Director
September 26, 2008
Process
• GPCL-BAC Director & Analyst meet w/PI– Discuss Data Analysis Needs & Study Design
• PI Decides on Use of BAC or “Go it Alone”– “Go It Alone” -> data (.sff files)– “Use the BAC”
• data analysis $ estimate
• annotation, assembly, & analysis + data
• PI reviews Preliminary Research Report w/Analyst
• After final analysis, PI receives Report & Data
• Often the Analysis will be tailored to the application
de novo Analysis FlowchartData/Reads exported to
data rig
454 GS FLX Image filesSequences
Sequence processing
dataRunParams.parse
Image processing
analysisParams.parse
Signal processing
.sff files
454RuntimeMetrics.csv
454QualityFilterMetrics.csv
454BaseCallerMetrics.csv
Assembler
Analysis & Annotation
Assembler
GS FLX System
GS or Lasergene
Image processing
Lasergene SeqBuilder
• Reference sequence e.coli K12
Signal Processing
de novo Genome Assembly
• Two software packages currently used:
– GS FLX Assembler (Newbler algorithm) Can be used for all experiments
– Lasergene (SeqMan Pro) Single-end experiments only
GS de novo Assembler
• Input: .sff files and per-base quality scores• Output: Consensus sequence, assembled de
novo
• Main processing steps:– Identify pairwise overlaps between reads– Construct multiple alignments of contigs– Generate consensus basecalls of contigs– Output contig consensus sequences and quality
scores, along with ACE file of multiple alignments and assembly metrics files
From 454 Sequencing GS-FLX Data Analysis Software Manual, Dec 2007
e.g. Graphic Figure of the Assembly (Lasergene 7.2)
GS Reference Mapper
• Generates the consensus DNA sequence by mapping, or alignment, of the reads to a reference sequence
• Provides a list of high-confidence mutations (individual bases or blocks of bases that differ between the consensus DNA sequence of the sample and the reference sequence)
From 454 Sequencing GS-FLX Data Analysis Software Manual, Dec 2007
Genome Annotation (sequence functional classes)
Zuber et al. (2007)
Gene annotation with SeqManPro
Project
e.g. Diagrams
Smith et al. (2007)
Impacted Pathways
# Input
# Pathway %Pathway
Pathway Impact # Genes Genes Genes Genes corrected
Rank Name Score In Pathway on Chip on Chip in Input p-value p-value
1 Phosphatidylinositol signaling system 10.508 55 4 46 7.273 0.007995 0.007995
2 ECM-receptor interaction 6.746 62 4 57 6.452 0.016721 0.016721
3 Wnt signaling pathway 6.731 113 6 92 5.31 0.005133 0.005133
4 B cell receptor signaling pathway 6 59 4 47 6.78 0.008623 0.008623
5 Melanogenesis 5.765 86 4 63 4.651 0.023291 0.023291
6 Gap junction 5.422 76 4 63 5.263 0.023291 0.023291
7 GnRH signaling pathway 5.026 84 4 68 4.762 0.029813 0.029813
8 Focal adhesion 4.954 163 5 140 3.067 0.095078 0.095078
9 Long-term potentiation 4.868 62 3 49 4.839 0.052339 0.052339
10 Olfactory transduction 4.673 27 2 21 7.407 0.050233 0.050233
11 Calcium signaling pathway 4.644 164 5 131 3.049 0.076528 0.076528
e.g. Pathway view
e.g. COGS table
Smith et al. (2007)
e.g. Sequencing statistics table
Marcy et al. (2007)
Base Caller Metrics
Quality Filter Metrics
Runtime Metrics
Quality measures by region
Q Score - TCA - Region 4
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
0 10 15 20 25 30 35 40
Q Score
Numb
er of ba
ses
TCA ATG
Q Score - ATG - Region 4
0
50000
100000
150000
200000
250000
300000
0 12 17 22 27 32 37
Q Score
Numb
er of
bases
Q Score - TCA - Region3
0100000020000003000000400000050000006000000700000080000009000000
Q Score
Numb
er of ba
ses
Q Score - TCA - Region2
0500000
100000015000002000000250000030000003500000400000045000005000000
Q Score
Numb
er of
bases
Q Score - ATG - Region2
0
50000
100000
150000
200000
250000
300000
350000
400000
Q Score
Numb
er of
bases
Q Score - ATG - Region3
0
50000
100000
150000
200000
250000
Q Score
Numb
er of
bases
Q Score - TCA - Region1
0200000400000600000800000
100000012000001400000160000018000002000000
0 10 15 20 25 30 35 40
Q Score
Numb
er of
bases
Q Score - ATG - Region1
0
50000
100000
150000
200000
250000
Q Score
Numb
er of
bases
Read lengths by region
Length - TCA - Region 4
0
2000
4000
6000
8000
10000
12000
41 62 82 102 122 142 162 182 202 222 242 262 282 315
Length of read
Numb
er of
read
TCA ATG
Length - ATG - Region 4
0
100
200
300
400
500
60 213 234 255
Length of read
Numb
er of
read
Length - TCA - Region3
01000
20003000
40005000
60007000
80009000
43 63 83 103 123 143 163 183 203 223 243 263 283 303
Length of read
NUmb
er of
read
Length - AGT - Region 3
0
50
100
150
200
250
300
350
50 80 194 200 214 232 237 245 254 259
Length of read
Numb
er of
read
Length - TCA - Region2
0
200
400
600
800
1000
1200
36 57 77 97 117 137 157 177 197 217 237 257 277 297 317
Length of Read
Numb
er of
read
Length - ATG - Region2
0
100
200
300
400
500
600
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
Length of read
Numb
er of re
ad
Length - TCA - Region1
0
100
200
300
400
500
600
700
38 58 78 98 118 138 158 178 198 218 238 258 278 298 319
Length of read
Numb
er of
read
Length - ATG - Region1
0
50
100
150
200
250
300
350
60 211 238 257
Length of read
Numb
er of re
ad
e.g. Blast results
e.g. Predicted nucleotide and protein alignment
Raymond et al. (2007)
e.g. Predicted protein alignment
Raymond et al. (2007)
Grant TextNext Generation Sequence Bioinformatics Analysis.The Bioinformatics Analysis Core is sufficiently endowed with software and human resources to
conduct the analysis of data from resequencing and de novo sequencing studies. Software acquisitions include the default Genome Sequencer modules and the recently acquired specialized Lasergene 7.2 software by DNA*. One BAC staff member is dedicated to the analysis of long-read NextGen sequencing data and is responsible for generating research reports for each project.
Genome Sequencer FLX System SoftwareThe FLX System Software includes modules for each stage in the analysis. All raw data are
accessible, and the system also offers a variety of third party software packages for niche applications.
Data QA/QCThe Core uses a variety of data quality control measures including consensus accuracy and quality
scores including per base (Q20+) and per genome (%Bases Q20+; the proportion of an assembled genome with base call accuracy of >99%).
The Core has also acquired licenses required to execute the full suite of Lasergene applications to round out the core’s Genome Annotation capabilities. In addition to the sequence assembler/SNP discover algorithms in SeqMan Pro, and the visualization and sequence editing modules (SeqBuilder), the Lasergene suite adds the capacity for gene finding (GeneQuest) and protein structure analysis & prediction (Protean).
The variety of file types that the core is expected to handle is greatly aided by Laser Genes’s EditSeq and by the much-improved interoperability of SeqMan Pro (which can import .sff, .fna, .fas and .qual files).
Research Report Components• Tables
– Base Call Metrics– Quality Filter Metrics– Run Time Metric Tables– Quality Score
• per base (Q40+) • per genome (%Bases Q40+; the proportion of an assembled genome with base call accuracy of >99%).
– Quality Measure Distributions (By region)– Read Length Measure Distributions– Overall Sequence Statistics Tables– Blast tables– COGs Table
• Figures– Assembly Figures– Alignment Diagrams– Gene Functional Categories Diagrams– Genome View Diagrams– Nucleotide Alignment Diagrams– Predicted Protein Alignment Diagrams– Gene Ontology Functional Class Diagrams/Charts– Pathway Views– COGs Figures
• Methods Text– Manuscripts– Proposals
• Letter of Support
Application Areas• Ancient DNA• ChIP-seq/Methylation/Epigenetics• Eukaryotic Whole Genome Sequencing• Expression tags• Genetic variation detection• HIV sequencing• Metagenomics and Microbial Diversity• Mitochondria/viruses/plastids/plasmids• Prokaryotic Whole Genome Sequencing• Sequence Capture/Target Region Resequencing• Small RNAs• Somatic variation detection• Transcriptome Sequencing
Roche 454/GS-FLX Web Site
de novo Analysis FlowchartData/Reads exported to
data rig
454 GS FLX Image filesSequences
Sequence processing
dataRunParams.parse
Image processing
analysisParams.parse
Signal processing
.sff files
454RuntimeMetrics.csv
454QualityFilterMetrics.csv
454BaseCallerMetrics.csv
Assembler
Analysis & Annotation
Assembler
GS FLX System
GS or Lasergene
Final Service Product
• Pre-analysis output files– dataRunParams.parse– 454 BaseCallerMetrics.csv– 454 QualityFilterMetrics.csv– 454 RuntimeMetricsAll.csv
• Post-analysis output files– .sff files (for each region)– Research report (.ppt)– Additional text editing