IMGS 2012Bioinformatics Workshop:
File Formats for Next Gen Sequence Analysis
19901992
19941997
19992001
20032005
20072009
0.00
10,000.00
20,000.00
30,000.00
40,000.00
50,000.00
60,000.00
70,000.00
$0.00
$20.00
$40.00
$60.00
$80.00
$100.00
$120.00
$140.00
Giga
base
s Cost per Kb
Lucinda Fulton, The Genome Center at Washington University
Cost Throughput
Sequencing Technologies
http://www.geospiza.com/finchtalk/uploaded_images/plates-and-slides-718301.png
Sequence “Space”• Roche 454 – Flow space
– Measure pyrophosphate released by a nucleotide when it is added to a growing DNA chain
– Flow space describes sequence in terms of these base incorporations– http://www.youtube.com/watch?v=bFNjxKHP8Jc
• AB SOLiD – Color space– Sequencing by DNA ligation via synthetic DNA molecules that contain two nested known
bases with a flouorescent dye– Each base sequenced twice– http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related
• Illumina/Solexa – Base space– Single base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH groups– Sequencing via cycles of base addition/detection followed deprotection of the 3’ OH– http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related
• GenomeTV – Next Generation Sequencing (lecture)– http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related
http://finchtalk.geospiza.com/2008/03/color-space-flow-space-sequence-space_23.html
FlexibleGood: with rapidly changing data/tech
Poor: validationHuman Readable
Convenient for de-buggingComputer doesn’t care!
SequencesFASTAFASTQSAM/BAM
AlignmentsSAM/BAMMAF
AnnotationsBEDGTFGFF3GVFVCF
http://genome.ucsc.edu/FAQ/FAQformat.html
http://www.sequenceontology.org/
FASTQ
FASTA
FASTQ: Data Format• FASTQ
– Text based– Encodes sequence calls and quality scores with ASCII characters– Stores minimal information about the sequence read– 4 lines per sequence
• Line 1: begins with @; followed by sequence identifier and optional description
• Line 2: the sequence• Line 3: begins with the “+” and is followed by sequence identifiers and
description (both are optional)• Line 4: encoding of quality scores for the sequence in line 2
• References/Documentation– http://maq.sourceforge.net/fastq.shtml– Cock et al. (2009). Nuc Acids Res 38:1767-1771.
Sequence data format
FASTQ Example
FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771.
For analysis, it may be necessary to convert to the Sanger form of FASTQ.
FASTQ: Details• FASTQ
– Text based– Encodes sequence calls and quality scores with ASCII characters– Stores minimal information about the sequence read– 4 lines per sequence
• Line 1: begins with @; followed by sequence identifier and optional description
• Line 2: the sequence• Line 3: begins with the “+” and is followed by sequence identifiers and
description (both are optional)• Line 4: encoding of quality scores for the sequence in line 2
• References/Documentation– http://maq.sourceforge.net/fastq.shtml– Cock et al. (2009). Nuc Acids Res 38:1767-1771.
Phred Quality Score Probability of incorrect base call Base call accuracy
10 1 in 10 90 %
20 1 in 100 99 %
30 1 in 1000 99.9 %
40 1 in 10000 99.99 %
50 1 in 100000 99.999 %
Q = Phred Quality ScoresP = Base-calling error probabilities
Quality scores
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126
S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)
Format/Platform QualityScoreType ASCII encodingSanger Phred: 0-93 33-126Solexa Solexa:-5-62 64-126Illumina 1.3 Phred: 0-62 64-126Illumina 1.5 Phred: 0-62 64-126Illumina 1.8 Phred: 0-62 33-126 *** Sanger format!
Quality score encoding differ among the platforms
Most analysis tools require Sanger fastq quality score encoding
http://main.g2.bx.psu.edu/
SAM (Sequence Alignment/Map)
• SAM is the output of aligners that map reads to a reference genome– Tab delimited w/ header section and alignment
section• Header sections begin with @ (are optional)• Alignment section has 11 mandatory fields
– BAM is the binary format of SAM
http://samtools.sourceforge.net/
Alignment data format
http://samtools.sourceforge.net/SAM1.pdf
Mandatory Alignment Fields
http://samtools.sourceforge.net/SAM1.pdf
Alignment Examples
Alignments in SAM format
CIGAR string -> 8M2I4M1D3M
Annotation Formats• Mostly tab delimited files that describe the location of
genome features (i.e., genes, etc.)• Also used for displaying annotations on standard genome
browsers • Important for associating alignments with specific genome
features• descriptions• Knowing format details can be important to translating
results!– BED is zero based– GTF/GFF are one based
GTF
http://useast.ensembl.org/info/website/upload/gff.html
Annotation data format
chr1 86114265 86116346 nsv433165chr2 1841774 1846089 nsv433166chr16 2950446 2955264 nsv433167chr17 14350387 14351933 nsv433168chr17 32831694 32832761 nsv433169chr17 32831694 32832761 nsv433170chr18 61880550 61881930 nsv433171
chr1 16759829 16778548 chr1:21667704 270866 -chr1 16763194 16784844 chr1:146691804 407277 +chr1 16763194 16784844 chr1:144004664 408925 -chr1 16763194 16779513 chr1:142857141 291416 -chr1 16763194 16779513 chr1:143522082 293473 -chr1 16763194 16778548 chr1:146844175 284555 -chr1 16763194 16778548 chr1:147006260 284948 -chr1 16763411 16784844 chr1:144747517 405362 +
BED formatAnnotation data format
BED: zero based, start inclusive, stop exclusive
GTF/GFF: one based, inclusive
Length = stop-start
Length = stop-start+1
GRCh37
NCBI36
Top Related