Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill,...
-
date post
18-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill,...
Sequence Analysis
Hemant KelkarCenter for Bioinformatics
University of North CarolinaChapel Hill, NC 27599
Scope of Series
Talk I
• Overview and BLAST
Talk II
• Protein analysis/Sequence Alignment
Talk III
• Evolution
• Genomics and challenges
Bioinformatics
• Mathematical, Statistical and computational methods that are used for solving biological problems
• Glue that holds the “omics” data together
Help …
• Is “my sequence” in the databases?• Is it similar to any sequence in the DB?• Does it have any know motifs/domains
that can help in identification?• Is there a structural homolog?• Are there any polymorphisms?• Genetic Map location?
Bioinformatics TOOLS!
Bioinformatics Tools
• Genetic Code
• Protein Structure
• Protein Evolution
Similarity search e.g. BLAST, FASTA
http://restools.sdsc.edu/biotools/biotools9.html
e.g. CLUSTALW, T-COFFEE, Phylip
Primary Sequence Databases
• GenBank (http://www.ncbi.nlm.nih.gov/Genbank/index.html
) • PIR (http://pir.georgetown.edu/) • Swiss-Prot (http://us.expasy.org/sprot/)
Sequence information as is generated in the laboratory
Derived Sequence Databases
• PFAM (http://www.sanger.ac.uk/Software/Pfam/) : Protein families based on HMM models
• InterPRO (http://www.ebi.ac.uk/interpro/) : Protein families and domains based on functional sites
• TransFac (http://www.gene-regulation.com/) transcription factor db
• Cytochrome P450 database (http://drnelson.utmem.edu/CytochromeP450.html)
Databases based on functional or phylogenetic analysis
Derived Sequence Databases
• Flybase (http://www.flybase.org/) : Fly Genome
• Wormbase (http://www.wormbase.org/) : C. elegans
• Genome Browser (http://genome.ucsc.edu/) :
Human and Mouse • MGI (http://www.informatics.jax.org/) : Mouse
• Microbial Genome Resource : (http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl)
Databases based on taxonomy
Sequence Alignments
• Provide a measure of relation between the nucleotide or protein sequence
• This allows us to decipher:
Structural relationships
Functional relationships
Evolutionary relationships
Sequence Similarity Searches
• Information conserved evolutionarily
• DNA sequences NOT coding for proteins/rRNAs diverge rapidly• When possible use protein sequences for similarity searches
• Non-homologous protein identification is much less reliable• What is measured and what is inferred?
Similarity
• Is always based on an observable
• Usually expressed as % identity
• Quantifies the divergence of two sequences
• substitutions/insertions/deletions
• Residues crucial for structure and/or function
Homology
• Homology always implies that the molecules share a common ancestor
• Absolute answer
• Molecules ARE or ARE NOT homologous
• No degrees
How to Find Similar Sequences
• Global Sequence Alignments
• Sequence comparison along entire length
• Homolog of similar length• Local Sequence Alignments
• Similar regions in two sequences
• Regions outside the local alignment excluded
• Sequences of different length/similarity
Dotplot
Scoring Matrices
• Empirical weighting schemes
• Considers important biology
• Side chain chemistry/structure/function
• Functional/Structural Conservation
• Ile/Val – small and hydrophobic
• Ser/Thr – both polar
• Size/Charge/Hydrophibicity
Nucleotide Matrix
A C G TA 5 -4 -4 -4C -4 5 -4 -4G -4 -4 5 -4T -4 -4 -4 5
PAM Scoring Matrices
• Margaret Dayhoff (1978)
• Point accepted mutations (PAM)
• Patterns of substitutions in highly related proteins (>85% identical), based on multiple sequence alignments
• New side chains must function similarly
• 1 PAM 1 AA change per 100 AA
• 1 PAM ~ 1 % Divergence
BLOSUM Matrices
• Henikoff and Henikoff (1992)
• Blocks Substitution Matrices
• Differences in conserved ungapped regions
• Directly calculated no extrapolations
• Sensitive to structural/functional subs
• Generally perform better for local similarity searches
Scoring Matrix – BLOSUM62
BLOSUM n
• Calculated from sequences sharing no more than n% identity
• Sequences with more than n% identity are clustered and weighted to 1• Reducing the value of “n” yields more divergent/distantly-related sequences • BLOSUM62 used as default by many of the online search sites
Matrices and more
PAM Matrices (Altschul, 1991)
PAM 40 Short alignments >70%
PAM120 >50%
PAM250 Longer weaker local areas >30%
BLOSUM Matrices (Henikoff, 1993)
BLOSUM 90 Short alignments >60%
BLOSUM 80 >50%
BLOSUM 62Commonly used >35%
BLOSUM 30 Longer, weaker local alignments
Gaps
• Compensate for insertion and deletions• Improvement alignments
• Must be kept to a reasonably small number • 1 per 20 residues is logical
• Need a different scoring scheme
Gap Penalties
• Penalty for gap introduction
• Penalty for Gap extension
where G = gap-opening penalty 511
L = Gap-extension penalty 21
n = Length of gap
Deductions for Gap = G + Ln
NucProt
BLAST
• Basic Local Alignment Search Tool
• Seeks high-scoring segment pair (HSP)
• Sequences that can be aligned w/o gaps
• have a maximal aggregate score
• score be above score threshold S• Many HSP reported for ungapped blast
BLAST Algorithms
Program Query TargetBLASTN Nucloetide NucleotideBLASTP Protein ProteinBLASTX Nucleotide Protein
(6-Frame)
TBLASTN Protein Nucleotide (6FR)TBLASTX Nucloetide(6FR) Nucloetide(6FR)
Neighborhood Words
Query: SISPGQRVGLLGRTGSGKSTLLSAFLRMLN-IKGDIE
STL13
SAL8
SNL8
SVL8
SBL7
SCL7
SDL7
Etc.
= 4 + 5 + 4
Neighborhood Score Threshold
(T = 8)
Query Word (W = 3)
High-Scoring Segment Pairs
STL13
SAL8
SNL8
SVL8
SBL7
SCL7
SDL7
Etc.Query: SISPGQRVGLLGRTGSGKSTLLSAFLRMLN-IKGDIE
++ G + ++G G+GKS+LLSA L L+ ++G + Sbjct: TVPQGCLLAVVGPVGAGKSSLLSALLGELSKVEGFVS
Extension
Significance Decay
• Mismatches
• Gap penalties
Extension
Cumulative Score
X
S
T
Query: SISPGQRVGLLGRTGSGKSTLLSAFLRMLN-IKGDIE ++ G + ++G G+GKS+LLSA L L+ ++G +
Sbjct: TVPQGCLLAVVGPVGAGKSSLLSALLGELSKVEGFVS
Karlin Altschul Equation
E = kmNe-λs
m Number of letters in query
N Number of letters in db
mN Size of search space
λs Normalized score
k minor constant
http://www.ncbi.nlm.nih.gov