Post on 21-Dec-2015
Bioinformatic Techniques & Tools for SNP Analysis
Jill L Wegrzyn
Department of Plant Sciences University of California at Davis, Davis, CA 95616 USAE-mail: jlwegrzyn@ucdavis.edu
SNP Analysis Agenda
Sequence-Based SNP Identification
Common Bioinformatic SolutionsPhred, Phrap, Consed, Polyphred, and Polybayes
Re-Sequencing Project (ADEPT2)
High-Throughput SNP Identification Solution (PineSAP)
Finding SNPs: Sequence-based SNP MiningFinding SNPs: Sequence-based SNP Mining
Sequence Overlap - SNP DiscoverySequence Overlap - SNP Discovery
GTTACGCCAATACAGGTTACGCCAATACAGGGATCCAGGAGATTACCATCCAGGAGATTACCGTTACGCCAATACAGGTTACGCCAATACAGCCATCCAGGAGATTACCATCCAGGAGATTACC
Genomic Genomic
RRSRRSLibraryLibrary
ShotgunShotgunOverlapOverlap
BACBACLibraryLibrary
BACBACOverlapOverlap
DNADNASEQUENCINGSEQUENCING
mRNAmRNA
cDNAcDNALibraryLibrary
ESTESTOverlapOverlap
RandomRandomShotgunShotgun
Align toAlign toReferenceReference
Comprehensive SNP Discovery: ResequencingComprehensive SNP Discovery: Resequencing
• Overlapping PCR Amplicons across entire gene
• Make no assumptions about sequence function
• Sequence diversity and genetic structure for each gene is different
• Proper association studies can only be designed in this context
• Complete resequencing facilitates population genetics methods
Sequence each end
of the fragment.
Base-calling
Quality determination
Contig assembly
Final quality determination
Sequence viewing
Polymorphism tagging
Polymorphism reporting
Individual genotyping
Polymorphism detection
PolyPhred/Polybayes
Consed
Analysis
Sequence Phred PhrapAmplify DNA
5’ 3’
Sequence-based SNP Identification
Phylogenetic analysis
ATAGACG ATACACG ATAGACG ATACACG
ATAGACGATACACG
Homozygotes Heterozygote
What is Phred/Phrap/Consed ?
Phred/Phrap/Consed is a DNA sequence analysis package from University of Washington:
Trace file (chromatograms) reading
Quality (confidence) assignment to each individual base
Vector and repeat sequences identification and maskingSequence assembly
Assembly visualization and editing
Automatic finishing
Phred, Phrap, Consed, Polyphred, Polybayes
phred: Base calling and quality assignments
phrap: Contig formation and new quality assignments
consed: Visual X-Windows graphic interface, to view and edit alignments and contigs, and to view the original traces
polyphred: find polymorphisms in phrap contigs, quality calls, add data to phrap files to permit consed finding and visualization of polymorphisms.
polybayes: Fully probabilistic SNP detection algorithm that calculates the probability (SNP score) that discrepancies at a given location of a multiple alignment represent true sequence variations as opposed to sequencing errors.
Files and Program Execution
Input files: ABI chromat or SCF files
Order of Program execution: phred, then phrap, then consed or: phred, then phrap, then polyphred,
polybayes, then consed or: phred, then phrap, then consed, then
polyphred, polybayes, then consed
Trace File
High quality region – no ambiguities
Trace File
Medium quality region – some ambiguities
Trace File
Poor quality region – low confidence
What is Phred?
• Phred observes the base trace, makes base calls, and assigns quality values (qv) of bases in the sequence. • Writes base calls and qv to output files for Phrap assembly. • Quality values (phred scores) range from 0 to 60. 20 and above is considered a confident base call (1 in 100 chance that is has been called wrong ~99% accuracy)
• Useful for consensus sequence construction
ATGCATTC string1 CGTTCATGC string2 ATGC-TTCATGC superstring
• Here we have a mismatch ‘A’ and ‘G’, the qv will determine the dash in the superstring. The base with higher qv will replace the dash.
Phred value formula
q = - 10 x log10 (p)
whereq - quality valuep - estimated probability error for a base call
Examples:Examples:
qq = 20 means = 20 means pp = 10 = 10-2-2 (1 error in 100 bases) (1 error in 100 bases)
qq = 40 means = 40 means pp = 10 = 10-4-4 (1 error in 10,000 (1 error in 10,000 bases)bases)
Why Phred?
• Output sequence might contain errors.• Vector contamination• Dye-terminator reaction might not occur.• Weak or variable signal strength of peak
corresponding to a base.
The structure of a phd file
BEGIN_SEQUENCE 01EBV10201A02.g
BEGIN_COMMENT
CHROMAT_FILE: EBV10201A02.gABI_THUMBPRINT: PHRED_VERSION: 0.990722.gCALL_METHOD: phredQUALITY_LEVELS:99TIME: Thu May 24 00:18:58 2001TRACE_ARRAY_MIN_INDEX: 0TRACE_ARRAY_MAX_INDEX: 12153TRIM: CHEM: termDYE: big
END_COMMENT
BEGIN_DNAt 8 5c 13 17a 19 26c 19 32
t 6 11908t 6 11908a 6 11921a 6 11921g 6 11927g 6 11927t 6 11947t 6 11947c 6 11953c 6 11953a 6 11964a 6 11964g 6 11981g 6 11981c 4 11994c 4 11994n 4 12015n 4 12015c 4 12037c 4 12037n 4 12044n 4 12044n 4 12058n 4 12058n 4 12071n 4 12071n 4 12085n 4 12085n 4 12098n 4 12098n 4 12111n 4 12111n 4 12124n 4 12124c 4 12144c 4 12144n 4 12151n 4 12151END_DNAEND_DNA END_SEQUENCEEND_SEQUENCE
t 24 2221t 24 2221a 24 2232a 24 2232a 22 2245a 22 2245a 27 2261a 27 2261g 25 2272g 25 2272c 19 2286c 19 2286c 12 2302c 12 2302t 19 2314t 19 2314g 12 2324g 12 2324g 15 2331g 15 2331g 19 2346g 19 2346g 23 2363g 23 2363t 33 2378t 33 2378g 36 2390g 36 2390c 44 2404c 44 2404c 44 2419c 44 2419t 39 2433t 39 2433a 39 2446a 39 2446a 34 2460a 34 2460t 35 2470t 35 2470g 34 2482g 34 2482
t 16 8191t 16 8191g 19 8200g 19 8200t 13 8211t 13 8211c 13 8229c 13 8229g 4 8241g 4 8241n 4 8253n 4 8253c 4 8263c 4 8263t 10 8276t 10 8276t 9 8286t 9 8286c 12 8301c 12 8301t 16 8313t 16 8313c 12 8329c 12 8329c 12 8336c 12 8336c 15 8343c 15 8343t 19 8356t 19 8356c 9 8371c 9 8371g 13 8386g 13 8386g 14 8397g 14 8397a 7 8417a 7 8417g 9 8427g 9 8427g 4 8445g 4 8445
Phrap
• Phrap constructs the contig sequence as a mosaic of the highest quality parts of the reads rather than a statistically computed “consensus”.
• Avoids the complex algorithm issues associated with multiple alignment methods
• Speed and accuracy• Phrap is an assembler NOT an aligner
• The sequence produced by Phrap is quite accurate• Less than 1 error per 10 kb in typical datasets.
• Phrap considers sequence quality at a given position (determined by Phred)
Running phredPhrapConstruct directory structure
directory for the run, containing subdirectories chromat_dir, edit_dir, phd_dir, poly_dir
Move tracefiles to directory chromat_dir
Run phredPhrap from directory edit_dir Determine phredPhrap parameters
Default Adjusting Parameters
The greater the forcelevel, the more liberal the assembly (generally produces fewer contigs).
however results in more manual editing reviewing polymorphisms later
Phrap output files
• *.contigs*.contigs – fasta file containing the contigs– fasta file containing the contigs- Contigs with more than one readContigs with more than one read
- Singletons (single reads with a match to some other contig Singletons (single reads with a match to some other contig but could not be merged consistently)but could not be merged consistently)
• *.singlets*.singlets –– fasta file of the singlet readsfasta file of the singlet reads- Reads with no match to other readReads with no match to other read
• *.ace*.ace – for viewing the assembly w/ Consed– for viewing the assembly w/ Consed
• *.view*.view – for viewing the assembly w/ – for viewing the assembly w/ PhrapviewPhrapview
Consed
Consed is a program for viewing and editing Consed is a program for viewing and editing assemblies produced by Phrapassemblies produced by Phrap
a. Assembly viewer a. Assembly viewer - allows for visualization of - allows for visualization of contigs, assembly (aligned reads), quality contigs, assembly (aligned reads), quality values of reads and final sequence.values of reads and final sequence.
b. Trace file viewer b. Trace file viewer – single and multiple trace – single and multiple trace files can be visualized allowing for comparison files can be visualized allowing for comparison of a given sequence in several reads.of a given sequence in several reads.
c. Navigationc. Navigation – – identify and list regions which are identify and list regions which are below a given quality threshold, contain high below a given quality threshold, contain high quality discrepancies, single-strand coverage, quality discrepancies, single-strand coverage, etc.etc.
Phred/Phrap/Consed Pipeline
Chromat_dirChromat_dir
Phd_dirPhd_dir
Edit_dirEdit_dir
Directories:Directories:
Assembly viewing/editingConsed
AssemblyPhrapassembled contigs - seqs_fasta.screen.contigsassembly file - seqs_fasta.screen.ace#
Vector screening and maskingCross_Match (local alignment program) x vector.seqscreened/masked file - seqs_fasta.screen
Conversion - phd to fastaphd2fasta.plnucleotide sequences - seqs_fastaquality values - seqs_fasta.screen.qual
Quality (confidence) values assignmentPhredphd files - *.phd
Inputchromatogram files
Using PolyPhred to Visualize SNPs
•Compares sequences across traces obtained from different individuals to identify sites for SNPs. •Will occasionally miscall genotypes - frequency of such mistakes depends on the sequencing chemistry used to generate the trace. •To reduce the number of miscalled sites, ignores regions of poor quality & ends
Polyphred polyphred -ace gene_file.ace.1 -tag p -snp hom -indel -f 50 >
gene_file.polyphred.out
where gene_file.ace.1 is the .ace file present in the edit_dir directory. Note: this command will give polymorphisms of quality 'ranks' 1
through 3: 1 is the highest quality, 6 is the lowest quality The qualifier -f 50 is used to list 50bp of flanking sequence on each
side of the detected polymorphism The qualifier -indel is used to identify insertions or deletions The qualifier -snp hom is used to identify homozygous SNPs only The qualifier -tag p is used to list the tagged polymorphisms in the
polyphred output file *.polyphred.out.
It first reads the ACE file to obtain the consensus sequence and the names of the trace (chromat) files used in the assembly. It then reads the PHD and POLY files associated with each trace.
Polyphred
Reads the ACE file to obtain the consensus sequence and the names of the trace (chromat) files used in the assembly.
Reads the PHD and POLY files associated with each trace.
During the SNP search phase, PolyPhred combines information from all of the sequence traces to derive a genotype and a score for each sequence
The score indicates how well the trace at the site matches the expected pattern for a SNP.
Updates the ACE and PHD files by adding tags that mark the positions of the sites. The tagged sites can then be examined using Consed.
Polybayes
Bayesian statistical model takes into account:
- depth of coverge- base quality values of the sequences-a priori expected rate of polymorphic sites in region
Polybayes calculations are aided with information on major/minor allele frequencies as well as polymorphism rates within the species under investigation
**Can also integrate into the poly files for viewing with Consed
Allele Discovery of Economic Pine Traits
Goals Re-sequence ~8,000 amplicons in
Loblolly Pine SNP Discovery Genotyping of 3,000 trees using Illumina
technology Phenotyping of same 3,000 trees Association studies
EST Databases(UGA, UMN)
Assemble Contigs
PrimerDesign
PrimerValidation
Diversity PanelDNA Extraction
Megagametophyte
Agencourt Sequencing18 individuals & 5 other
SNP IdenticationIllumina Genotyping
PhenotypingDrought, Dieases, & Wood
AssociationStudies
Population DNA Extraction(Needles)
Resequencing Project Pipeline
Primer Selection and Validation
Primer SelectionUsing OSP primer picking
Sequencing on validation DNA
Primer synthesis at Illumina
Successes - select 1 primer set per amplicon
BLASTing of validation DNA sequence back to EST database
Successes
Sequencing of diversity panel - 1 primer set per amplicon which has been successful at each step
Successes
Report to TreeGenes - several primers per amplicon will be stored in TreeGenes
Report to TreeGenes
Report to TreeGenes
Report to TreeGenes
Species Num. Successful Amplicons
Pinus taeda 7,424
Pinus radiata 6,249 (84%)
Pinus lamertiana
2,234 (30%)
Picea abies 1,024 (13%)
Pseudotsuga menziesii
750 (10%)
**312 successful in all 5 species
AlignmentCritical in the automation of base calls Commonly used Phrap (from PhredPhrap) is an assembler and is NOT ideal for
alignments Many commonly used aligners work best with protein sequences or
with a reference sequence Preservation of quality scores for input into SNP identification programs Speed for high-throughput programs
Automated SNP Calls- Reference Sequence Required- Traditional approaches without reference sequence include “eSNPs”
(human, maize, and pine) -Very little redundancy outside of abundant genes-Overall high number of false positives (single pass reads)
- Not specific to frequencies observed in different organisms- High number of false positives in currently accepted methods
(Polybayes & Polyphred)
Alignment and SNP Calling PipelineChallenges in High-Throughput SNP Identification
PineSAPRe-Sequencing datafrom Agencourt:
Initial Processing
Base Calling Sequence Alignment
SNP Identification
Machine Learning
Data Storage & Release
Base Calling and Sequence Alignment
Modified PhredPhrapallows for trimming of bases from start and end of sequence based on trace quality
Ace2FASTAConverts native PhredPhrap output (ace file) into an unaligned FASTA file
ProbconsRNAOptimal DNA sequence alignment program
AlignedContig2ReadFASTAProvides single multifasta file with all reads aligned to the contig from PhredPhrap AND the contigs alignment to the other contigs from probconsRNA
FASTA2AceConverts resulting FASTA file back into ace file for SNP Identification
Number of contigs Percentage
1 23.67%
2 27.00%
3 19.67%
4 14.00%
5 6.00%
6 4.33%
7 2.67%
8 2.33%
Phrap assembly
Alignment Pipeline Success
Problems (out of 356 validation set)
Sequencing problems 3 0.84%
Problems in end or middle 4 1.12%
F&R don't overlap 2 0.56%
Alignment problems due to low quality 1 0.28%
Alignment problem due to wrong direction in original phrap alignment 2 0.56%
Summary (out of 356 validation set)
Correct alignments 344 96.63%
Alignments we are working on fixing +5 98.03%
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
SNP Identification and Machine Learning
Algorithms applied Independently for SNP Identification:
Polyphred and Polybayes both utilizealignment and base quality information to identify potential polymorphisms.
Both programs yield a high # of false positives
Machine LearningUtilizes sequence based information and compares actual base calls to proposed calls to develop SNP identification filtersspecific to pine
Parameters under Consideration: Local and global quality scores, frequencies of major and minor alleles, local and global SNP frequences, relative positions in sequence to existing variation, and polyphred and polybayes scores
SNP Identification Overview
Examine features to improve the accuracy of SNP location prediction
Utilize machine learning to apply the features
Refine the accuracy of the learning algorithm through adjustments to feature representation (iterative process)
Utilize the final classifier against the large re-sequenced set to improve accuracy of SNP calls originating from Polybayes and Polyphred
Features Selected
Description RepresentationSequence Depth Continuous
Variation Type Categorical
Polybayes Score Continuous
Polyphred Score Continuous
Freq of major/minor alleles Continuous
Max quality of major/minor alleles Continuous
Local average quality Continuous
Overall average quality Continuous
Alignment Quality Continuous
Feature Representation Sequence depth
Count of number of sequences in the alignment at the position of variation.
All sequences in the alignment may not overlap at the position of variation;
number is different from the total number of the sequences in the alignment
Variation type Variation type can be SNP or INDEL
PolyBayes score PolyBayes program assigns a Bayesian posterior probability value for
each called SNP using the frequency priors given for observing a variation at that position
Polyphred score Polyphred assigns a score calculated primarily from sequence depth
and quality score.
Feature Representation
Base frequencies The number of occurrences of different bases at the
position of variation is important in determining a polymorphic position.
Frequencies of the first (major allele) and the second (minor allele) are represented as ratio to sequence depth.
Relative distance Sequence quality at the ends of the alignment tends to
be poor due to inherent limitations of current sequencing technology.
SNP position was represented as the ratio of the distance in the consensus sequence from the closest end, or the relative distance
Feature Representation
Sequence quality Variation is observed because of a poor quality base. Based on the base frequencies calculated:
maximum qualities of the major and minor alleles average qualities of major and minor alleles
Alignment quality Misalignment of bases caused by sequence alignment
programs sometimes result in an erroneous SNP call. In the neighborhood of the SNP (+/- 10 bases) all the
mismatches with the consensus sequence are given a penalty and the penalty is more if the mismatch is continuous
Classification Goal: Prediction
Software: Decision tree C4.5 program is open-source C code (WEKA)
At each point in the construction of the decision tree, C4.5 selects the feature to test based on maximum information gain.
The goal is to generate a minimum size tree that correctly classifies all the elements in the training set.
The size of the tree is the number of nodes (decision nodes) and the number of leaves are the classes (categories) that they are distributed to.
SNP Identification Datasets
Training set for loblolly pine was composed of a total of 300 validated sequences. Divided to represent the relative percentages of sequence source Testing set is composed of 120 validated sequence sets
Training set for poplar was composed of 42 validated sequences selected at random Testing set is composed of a total of 30 validated sequence sets.
Validation = manually observed FP, FN, TP, and TN SNP calls through observation of tracefiles in Consed.
Alignment and SNP Identification
Accuracy = (TP + TN)/total
Sensitivity = TP/(TP + FN)
Specificity = TN/(FP + TN)
Alignment and SNP IdentificationEvaluation Criteria
Evaluation J48 Polyphred Polybayes
Accuracy 93.6 76.25 78.02
Sensitivity 88.21 83.22 86.54
Specificity 98.73 N/A N/A
Evaluation J48 Polyphred Polybayes
Accuracy 94.6 79.35 80.24
Sensitivity 90.54 85.01 88.14
Specificity 97.23 N/A N/A
Resulting Identifications SNPs have been called using PineSAP on
7424 amplicons representing 6924 Unigenes.
Average amplicon length is 445 bp Average of 5.5 SNPs/amplicons Pine is highly polymorphic! Total SNPS ~41,500 Distribution of SNPs/amplicon
1404 - 0 SNPs 1133 - 1 SNP 4887 - 2 or more SNPs
Custom lists can be exported into Illumina style input **option for adding IUPAC codes for SNPs in flanking
sequence
Alignment and SNP IdentificationSNP Formatting
Alignment and SNP IdentificationIllumina Design
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Data Storage and Release
SNP Identifcations
GENOTYPING
dbSNP NCBI SNP database
SNP Discovery: dbSNP database
SNP data submitted to dbSNPSNP data submitted to dbSNP
dbSNP processing of SNPsdbSNP processing of SNPs
SNPs submittedSNPs submittedby research communtiyby research communtiy((ssubmitted SNPs = ss#)ubmitted SNPs = ss#)
Unique mappingUnique mappingto a genome locationto a genome location((rreference eference SSNP = rs#)NP = rs#)
(by 2hit-2allele)(by 2hit-2allele)
Summary PineSAP improves
Inaccuracies introduced by using Phrap to align sequences
Time which would be required by using a aligner such as ProbconsRNA or ClustalW on its own
PineSAP has a 98% success rate when used to align loblolly resequencing data.
PineSAP identified a success list of features to enhance polymorphism predictions and generated a prediction accuracy of 93%
PineSAP provided a full alignment and polymorphism detection system that can be adapted to specific genomes
More Information…
Phred/Phrap/Consed: www.phrap.org/
Polyphred: droog.mbt.washington.edu/PolyPhred.html
Polybayes: genomeold.wustl.edu/groups/informatics/software/polybayes/