Post on 26-May-2020
8/11/16
1
DB
IGV
SERVER/REMOTE
Personal Computer/Local
Terminal
Data files
Applications and Servers
Compute
SSH
WEB
WEB
App
Service
Data files
SCP
Window
NGSDataandSequenceAlignment
ManpreetS.KatariAug11,2016
Outline
• NGSData• FastA• FastQ• SAM• BAM• GFF
• SequenceAlignment• GlobalvsLocal• DynamicProgramming• BurrowWheeler’sAlgorithm.
ImportantfilestypesFASTA
FASTQ
SAMBAM
GFF
Sequencefiles
Alignmentfiles
Annotationfiles
Importantfiletypes:FASTAAsequenceinFASTAformatbeginswithasingle-linedescription,followedbylinesofsequencedata.Thedescriptionlineisdistinguishedfromthesequencedatabyagreater-than(">")symbolinthefirstcolumn.Thewordfollowingthe">"symbolistheidentifierofthesequence,andtherestofthelineisthedescription(bothareoptional).
Thereshouldbenospacebetweenthe">"andthefirstletteroftheidentifier.Itisrecommendedthatalllinesoftextbeshorterthan80characters.Thesequenceendsifanotherlinestartingwitha">"appears;thisindicatesthestartofanothersequence.
Importantfiletypes:FASTA>chrICCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTG
GCCAACCTGTCTCTCAACTTACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAACCACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATCCAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATACTGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAAATAACA
CACACGTGCTTACCCTACCACTTTATACCACCACCACATGCCATACTCACCCTCACTTGTATACTGATTTTACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTCCACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATGCACGGCACTTGCCTCAGCGGTCTATACCCTGTGCCATTTACCCATAACGCCCATCATTAT
CCACATTTTGATATCTATATCTCATTCGGCGGTCCCAAATATTGTATAACTGCCCTTAATACATACGTTATACCACTTTTGCACCATATACTTACCACTCCATTTATATACACTTATGTCAATATTACAGAAAAATCCCCACAAAAATCACCTAAACATAAAAATATTCTACTTTTCAACAATAATACATAAACATATTGGCTTGTGGTAGCAACACTATCATGGTATCACTAACGTAAAAGTTCCTCAA
TATTGCAATTTGCTTGAACGGATGCTATTTCAGAATATTTCGTACTTACACAGGCCATACATTAGAATAATATGTCACATCACTGTCGTAACACTCTTTAGCGT
8/11/16
2
Importantfiletypes:FASTA>seq0FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF>seq1
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM>seq2EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK>seq4EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL
Importantfiletypes:FASTQ
FASTQformatisatext-basedformatforstoringbothabiologicalsequence(usuallynucleotidesequence)anditscorrespondingqualityscores.BoththesequenceletterandqualityscoreareeachencodedwithasingleASCIIcharacterforbrevity
Fastq formatReadIdentifier
ReadSequence
ReadSequenceQuality
FASTQ:DataFormat• FASTQ
• Textbased• EncodessequencecallsandqualityscoreswithASCIIcharacters• Storesminimalinformationaboutthesequenceread• 4linespersequence
• Line1:beginswith@;followedbysequenceidentifierandoptionaldescription
• Line2:thesequence• Line3:beginswiththe“+”andisfollowedbysequenceidentifiersanddescription(bothareoptional)
• Line4:encodingofqualityscoresforthesequenceinline2
• References/Documentation• http://maq.sourceforge.net/fastq.shtml• Cocketal.(2009).Nuc AcidsRes38:1767-1771.
Sequencedataformat
Phred QualityScore Probabilityofincorrectbasecall Basecallaccuracy
10 1in10 90 %
20 1in100 99 %
30 1in1000 99.9 %
40 1in10000 99.99 %
50 1in100000 99.999 %
Q=Phred QualityScoresP=Base-callingerrorprobabilities
Qualityscores
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~| | | | | |33 59 64 73 104 126
S - Sanger Phred+33, raw reads typically (0, 40)X - Solexa Solexa+64, raw reads typically (-5, 40)
I - Illumina 1.3+ Phred+64, raw reads typically (0, 40)J - Illumina 1.5+ Phred+64, raw reads typically (3, 40)
with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)
Format/Platform QualityScoreType ASCIIencodingSanger Phred:0-93 33-126Solexa Solexa:-5-62 64-126Illumina 1.3 Phred:0-62 64-126Illumina 1.5 Phred:0-62 64-126Illumina 1.8 Phred:0-62 33-126 ***Sangerformat!
Qualityscoreencodingdifferamongtheplatforms
MostanalysistoolsrequireSangerfastq qualityscoreencoding
8/11/16
3
http://en.wikipedia.org/wiki/FASTQ_format
FASTQQualityscores
http://en.wikipedia.org/wiki/FASTQ_format
FASTQQualityscores
CapitalLettersaregood!
SAM(SequenceAlignment/Map)
• SAMistheoutputofalignersthatmapreadstoareferencegenome
• Tabdelimitedw/headersectionandalignmentsection
• Headersectionsbeginwith@(areoptional)• Alignmentsectionhas11mandatoryfields
• BAMisthebinaryformatofSAM
http://samtools.sourceforge.net/
Alignmentdataformat
http://samtools.sourceforge.net/SAM1.pdf
MandatoryAlignmentFields
BitwiseFlag
0x20= 16^1*2+16^0*0 = 32What is 77? Find greatest value without going over77-64 = 13 4013- 8 = 5 8
5- 4 = 1 41- 1 = 0 1
What is 141?
12481632641282565121024
CIGARstring
29M 1D 9M 1D 9M 2D 21M 2D 18M 1D 70M
8/11/16
4
SAMformat SAMformatgetsconvertedtoaBAMfile
AnnotationFormats
• Mostlytabdelimitedfilesthatdescribethelocationofgenomefeatures(i.e.,genes,etc.)
• Alsousedfordisplayingannotationsonstandardgenomebrowsers• Importantforassociatingalignmentswithspecificgenomefeatures• Descriptions
Columns:Seqid,Source,Type,Start,End,Score,Strand,Phase,Attribute(Identifier)
GFFformat
Columns:Seqid,Source,Type,Start,End,Score,Strand,Phase,Attribute(Identifier)
Globalandlocalapproachestoaligningsequences
24
GLOBAL: Attempt to “match” and assess similarity between two entire sequences
LOCAL: Find subsequences of high similarity
… and then possibly “stick” (chain, net, thread) together local alignments to
obtain an overall comparison of the original sequences.
The second approach is more meaningful
(especially for long sequences, of different lengths, like whole
genomes)
Two protein or DNA sequences are unlikely to present a straightforward overall “match”, even if they are closely
related.
Why? Substitutions are not the only process by which they diverge:
insertions, deletions and rearrangements
8/11/16
5
25
Dynamicprogramming:Pairwisesequencealignment
Createamatrix tableforcomparingsequenceswithonesequencealongeachaxis(sizem+1,n+1)
Fillinpartialalignmentscores untilscoreforentiresequencehasbeencalculated:- Assignscoreforeachpositioni,j,progressingfrom
toplefttobottomright
- Scorerule:takemaximumofthethreechoices1.Takevaluefrom left,assigngappenalty(gapalongleftaxis)
2.Takevaluefrom top,assigngappenalty(gapalongtopaxis)
3.Takevaluefrom diagonalaboveleft,assignmatch/m ismatchscore
- Repeat untiltableiscompleted.
Usetrace-back toobtainfullalignment:AC--TCG
ACAGTAG
BLAST: heuristic database search using local alignmentBasic Local Alignment Search Tool
26
AcartoonforBLAST
(parameters)
…TLSRDQHAWRLS……
QW(queryword),sizeW=3.
{(RDQ ,16)(RBQ ,14)…(REQ ,12)…(RDB,11)} ForeachQW,usethescoringmatrixtoformaneighborhoodNB={allwordsofsizeWwithascore> T=11}
…TLSRDQ HAWRLS………RLSREQ HTWRSS……
Findmatch(es)towordsbelongingtoNBinthesubject(target)sequence
Foreachmatch,usethescoringmatrixandgappenalties toproduceanHSP(HighscoringSegmentPair),thenextendalignmentonbothsides,until• dropis>X,or• scoregoesbelowSmin (minimumhitscore)
querysequence
Smin
X
(cumulative)score
AboutBlat(fromgenome.ucsc.edu)• “BLATonDNAisdesignedtoquicklyfindsequencesof95%andgreatersimilarityoflength25basesormore.“
• “Itmaymissmoredivergentorshortersequencealignments.Itwillfindperfectsequencematchesof20bases.“
• “BLATisnotBLAST.”
• “DNABLATworksbykeepinganindexoftheentiregenomeinmemory.Theindexconsistsofalloverlapping11-merssteppingby5exceptforthoseheavilyinvolvedinrepeats.”
• “Theindextakesupabout2gigabytesofRAM.Thegenomeitselfisnotkeptinmemory,allowingBLATtodeliverhighperformanceonareasonablypricedLinuxbox.“
• “Theindexisusedtofindareasofprobablehomology,whicharethenloadedintomemoryforadetailedalignment.”
HashTable(BLAT)
28
ShortReadApplications
Findingthealignmentsistypicallytheperformancebottleneck
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…GCGCCCTA
GCCCTATCGGCCCTATCG
CCTATCGGACTATCGGAAA
AAATTTGCAAATTTGC
TTTGCGGTTTGCGGTA
GCGGTATA
GTATAC…
TCGGAAATTCGGAAATTT
CGGTATAC
TAGGCTATA
GCCCTATCGGCCCTATCG
CCTATCGGACTATCGGAAA
AAATTTGCAAATTTGC
TTTGCGGT
TCGGAAATTCGGAAATTTCGGAAATTT
AGGCTATATAGGCTATATAGGCTATATGGCTATATG
CTATATGCG
…CC…CC…CCA…CCA…CCAT
ATAC…C…C…
…CCAT…CCATAG TATGCGCCC
GGTATAC…CGGTATAC
GGAAATTTG
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…ATAC……CC
GAAATTTGC
NewalignmentalgorithmsmustaddresstherequirementsandcharacteristicsofNGSreads
• Millionsofreadsperrun(30xofgenomecoverage)• ShortReads(asshortas36bp)• Differenttypesofreads(single-end,paired-end,mate-pair,etc.)• Base-callingqualityfactors• Sequencingerrors(~1%)• Repetitiveregions• Sequencingorganismvs.referencegenome• Mustadjusttoevolvingsequencingtechnologiesanddataformats
8/11/16
6
Indexing• Genomesandreadsaretoolargefordirectapproacheslikedynamicprogramming
• Indexing isrequired
• ChoiceofindexiskeytoperformanceSuffixtree Suffixarray Seedhashtables
Manyvariants,incl.spacedseeds
SuffixArray Find"ctat”inthereference
• InventedbyDavidWheelerin1983(BellLabs).Publishedin1994.“ABlockSortingLosslessDataCompressionAlgorithm”
SystemsResearchCenterTechnicalReportNo124.PaloAlto,CA:DigitalEquipmentCorporation,BurrowsM,WheelerDJ.1994
• Originallydevelopedforcompressinglargefiles(bzip2,etc.)
• Lossless,FullyReversible
• AlignmentToolsbasedonBWT:bowtie,BWA,SOAP2,etc.
• Approach:• Alignreadsonthetransformed referencegenome,usinganefficientindex(FMindex)• Solvethesimpleproblemfirst(alignonecharacter)andthenbuildonthatsolutiontosolveaslightly
harderproblem(twocharacters)etc.
• Resultsingreatspeedandefficiencygains(afewGigaByte ofRAMfortheentireH.Genome).OtherapproachesrequiretensofGigaBytes ofmemoryandaremuchslower.
NGSReadAlignmentBurrowsWheelerTransformation(BWT)
c t g a a a c t g g t $t g a a a c t g g t $ cg a a a c t g g t $ c ta a a c t g g t $ c t ga a c t g g t $ c t g aa c t g g t $ c t g a ac t g g t $ c t g a a at g g t $ c t g a a a cg g t $ c t g a a a c tg t $ c t g a a a c t gt $ c t g a a a c t g g$ c t g a a a c t g g t
Text= c t g a a a c t g g t $
Ø Introduce$attheendandconstructallcyclicpermutationsofText
BurrowsWheelerTransformation
$ c t g a a a c t g g ta a a c t g g t $ c t ga a c t g g t $ c t g aa c t g g t $ c t g a ac t g a a a c t g g t $c t g g t $ c t g a a ag a a a c t g g t $ c tg g t $ c t g a a a c tg t $ c t g a a a c t gt $ c t g a a a c t g gt g a a a c t g g t $ ct g g t $ c t g a a a c
BWT(Text)= t g a a $ a t t g g c cBurrowsWheelerMatrix
• Sortrowsalphabetically,keepingofwhichrowwentwhere
BurrowsWheelerTransformation
ExactMatchingwithFMIndex
• Inprogressiverounds,top &bot delimittherangeofrowsbeginningwithprogressivelylongersuffixesofQ