Bioinformatics in next generation sequencing...
Transcript of Bioinformatics in next generation sequencing...
Bioinformatics in next generation sequencing projects
Rickard SandbergAssistant ProfessorDepartment of Cell and Molecular BiologyKarolinska Institutet
May 2012
Thursday, May 24, 12
Standard sequence library generation
Thursday, May 24, 12
Illumina Sequencing Technology
Thursday, May 24, 12
Illumina (Solexa) Sequencing
Thursday, May 24, 12
Illumina paired-end and index-read sequencing
Thursday, May 24, 12
Once sequenced the problembecomes computational
Computational analyses is the bottleneck• Rapid improvement in sequencing• Still need for customized analysis for most projects
Thursday, May 24, 12
Overview of computational analyses
Image analysisBase calling
Primary Analyses: Mapping(Assembly)
Data typespeci!c analyses(e.g. peak calling,
calculate expression)
Custom projectspeci!c analyses
ChIP-Seq peak calling
RNA-Seq expression levelsgenome sequence
assembled contig
Thursday, May 24, 12
Preliminary Analyses
Raw Image (TB)
Platform-specific analysis using the vendors programs
Sequences and Quality scoresText File (GB)
Real Time Analysis
Thursday, May 24, 12
Sequenced reads
>EAS54_6_R1_2_1_413_324CCCTTCTTGTCTTCAGCGTTTCTCC
Fasta file:
Fastq file:
SOLiD
csfasta file>1_39_146_F3T22100200202311030112002022222002021>1_39_194_F3T11022322003020303320012223122202221
SOLiD, QV file>1_39_146_F314 6 21 27 5 18 6 15 22 27 18 17 14 18 26 15 24 19 18 18 8 20 17 12 20 6 14 13 23 6 11 12 7 13 4 >1_39_194_F326 27 16 27 23 22 23 25 22 10 5 21 4 17 20 26 26 17 25 27 23 25 14 24 26 4 4 4 4 4 4 4 4 4 14
GAACTCTGCCTTTTTCAGTGATGAGGAAAGGAGTTCTCTCTGGTCCCCAG
aaab^_U_aa [U [ _Z ] a `WU_^X`GT^_ \ TM^ ^ \ ___ \ Z \ YQVVXUBBBB
Read identifier
Quality scores
@HWI - EAS269:1:120:1786:18#0/1
+HWI - EAS269:1:120:1786:18#0/1
Thursday, May 24, 12
Phred Quality Score, Q
Each base call has an estimate of the probability of being wrong (error probability, p)
Q = -10 * log10(p)
Phred Quality Score Probability of incorrect base call Base call accuracy10 1 in 10 90 %20 1 in 100 99 %30 1 in 1000 99.9 %40 1 in 10000 99.99 %50 1 in 100000 99.999 %
Thursday, May 24, 12
FastQ encodings
• Sanger FastQ: Phred score from 0-93 using the ASCII characters 33 – 126
• Solexa (+1.3 pipeline): Phred score from 0-62 using the ASCII characters 0-62
• Solexa (older pipelines): Solexa score using ASCII characters -5 to 62
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126
S - Sanger Phred+33, 41 values (0, 40) I - Illumina 1.3 Phred+64, 41 values (0, 40) X - Solexa Solexa+64, 68 values (-5, 62)
Thursday, May 24, 12
Fastq quality control (FastQC)
http://www.youtube.com/watch?v=bz93ReOv87YVideo tutorial:
Thursday, May 24, 12
Overview of computational analyses
Image analysisBase calling
Primary Analyses: MappingAssembly
Data typespeci!c analyses(e.g. peak calling,
calculate expression)
Custom projectspeci!c analyses
ChIP-Seq peak calling
RNA-Seq expression levelsgenome sequence
assembled contig
Thursday, May 24, 12
Short Read AssemblyVelvet and SOAPdenovode novo genomic assembler specially designed for short read sequencing technologies
Nature 2009
Thursday, May 24, 12
Human Genome Assembly
UCSC Genome Browser
Thursday, May 24, 12
Mapping of readsTask: Map millions of short sequences (25-100 nt) onto a genome (3 000 Mbp ) or transcriptome
Computationally feasible
Mismatches (sequencing errors and SNPs)
Unique / Repetitive matches
Indels (Normal variation, CNVs)
Large rearrangements (translocations)
BLAST, BLAT tools not designed for these tasks
Thursday, May 24, 12
MAQ bowtie
Thursday, May 24, 12
Commonly used programs
Program Approach Comments
Bowtie / Bowtie2Burrow-Wheeler
Transformation (BWT) Illumina, (SOLiD), fast
MAQ Spaced Seed Indexing Illumina, (SOLiD), SNPs
BWA Burrow-Wheeler Transformation (BWT)
Illumina, (SOLiD), indels
NovoalignNeedleman-Wunch
AlignmentIllumina, indels, slower, free (single proc mode)
ZOOM Designed spaced seeds Illumina, fast, indels, not free
Mappers from Illumina (ELAND) and SOLiD (bioscope/mapreads)
http://en.wikipedia.org/wiki/List_of_sequence_alignment_software#Short-Read_Sequence_Alignment
Thursday, May 24, 12
Storing mapped Alignments
Formats for storing alignments should include:
genomic coordinates
mismatches, insertion, deletions etc.
quality information
Thursday, May 24, 12
Samtools
Sequence Alignment Map (SAM)
Generic Alignment format
Supports long and short reads
Human readable, !exible and compact
Emerging standard
h"p://samtools.sourceforge.net/
Li6H.*,6Handsaker6B.*,6Wysoker6A.,6Fennell6T.,6Ruan6J.,6Homer6N.,6Marth6G.,6Abecasis6G.,6Durbin6R.6and610006Genome6Project6Data6Processing6Subgroup6(2009)6The6Sequence6alignment/map6(SAM)6format6and6SAMtools.6BioinformaScs,625,62078W9.6[PMID:619505943]
Thursday, May 24, 12
SAM Example
16 chr Y 616000 255 22M731N28M* 0 0 ATTTCGACCATGATCATCGAACCTTCCCCTGGATCCACTTCCACGATCAC#9 ; -7 +2@4 : 2=20 - 14= : ><?< ; : BB? : 4<BB?ABBBBABCBBBBC=BB NM: i : 0XS: A:-
Bit field, where 16 means reverse strand
Start position
Alignment structure. Here: 22 aligned bases,then 731 bases intron, then 28 aligned bases
HWI - EAS269:1:114:1242:1582#0
Thursday, May 24, 12
CIGAR Format
M, match/mismatch
I, insertion
D, deletion
S, softclip
...
Ref: GCATTCAGATGCAGTACGC
Read: ccTCAG--GCAGTAgtg
Pos: 5
CIGAR: 2S4M3D6M3S
Thursday, May 24, 12
Samtools for SAM/BAM !les
Library and software package (C, Java)
Creating, sorting, indexing SAM & BAM
Visualizing alignments in command
SNP calling
Short indel detection
BAM (Binary representation of SAM) ~25% "le size reduction
Thursday, May 24, 12
Overview of computational analyses
Image analysisBase calling
Primary Analyses: MappingAssembly
Data typespeci!c analyses(e.g. peak calling,
calculate expression)
Custom projectspeci!c analyses
ChIP-Seq peak calling
RNA-Seq expression levelsgenome sequence
assembled contig
Thursday, May 24, 12
Visualization
Integrated Genome Viewer (Broad Inst.)
Custom tracks at UCSC Genome Browser
Thursday, May 24, 12
Visualization
Thursday, May 24, 12
Integrated Genome Viewer
Imports many mentioned formats (SAM, BAM, BED etc)
Excellent for visualization of RNA-Sequencing or ChIP-sequencing data
Can also download/visualize data from public or private servers
Thursday, May 24, 12
UCSC Genome Browser
Recently introduced new formats for e#cient viewing of large data sets:
- BedGraph
- BigWig
Add as custom tracks (slower)
Thursday, May 24, 12
Peak characteristics di"er with signal
Thursday, May 24, 12
Peak characteristics di"er with signal
H3K4me3: Sharp promoter peaksH3K36me3: Broad transcription elongation signal
Thursday, May 24, 12
Important !le formats
Sequences: FastQ
Aligned reads: SAM/BAM
Genome annotations: Bed, G$
Coverage: Wig, (Tdf )
http://genome.ucsc.edu/FAQ/FAQformat.html
Thursday, May 24, 12
BED format
chrom6W6The6name6of6the6chromosome6(e.g.6chr3,6chrY,6chr2_random)6or6scaffold6(e.g.6scaffold10671).
chromStart6W6The6starSng6posiSon6of6the6feature6in6the6chromosome6or6scaffold.6The6first6base6in6a6chromosome6is6numbered60.
chromEnd6W6The6ending6posiSon6of6the6feature6in6the6chromosome6or6scaffold.6The6chromEnd6base6is6not6included6in6the6display6of6the6feature.6
For6example,6the6first61006bases6of6a6chromosome6are6defined6as6chromStart=0,6chromEnd=100,6and6span6the6bases6numbered60W99.
http://genome.ucsc.edu/FAQ/FAQformat.html
track name=pairedReads description="Clone Paired Reads" useScore=1chr22 1000 5000
Thursday, May 24, 12
BED continued
strand - Defines the strand - either '+' or '-'.thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays).thickEnd - The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays).itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line. NOTE: It is recommended that a simple color scheme (eight colors or less) be used with this attribute to avoid overwhelming the color resources of the Genome Browser and your Internet browser.blockCount - The number of blocks (exons) in the BED line.blockSizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.blockStarts - A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.
track name=pairedReads description="Clone Paired Reads" useScore=1chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601
Thursday, May 24, 12
Variable step Fixed step
variableStep chrom=chr2300701 12.5300702 12.5300703 12.5300704 12.5300705 12.5is equivalent to:variableStep chrom=chr2 span=5300701 12.5
fixedStep chrom=chr3 start=400601 step=100112233
WIG format
Wiggle format (WIG) allows the display of continuous-valued data in a track format
Thursday, May 24, 12
Data Repositories
Short Read Archive (fastq) [discontinued!]http://www.ncbi.nlm.nih.gov/sraEuropean Nucleotide Archive
Gene Expression Omnibus (bed, wig, fastq)http://www.ncbi.nlm.nih.gov/geo/
Thursday, May 24, 12
SEQAnswers, an active forum for discussions on next-generation sequencing methods and bioinformatics
http://seqanswers.com/Thursday, May 24, 12