Computational Biology Tools - University of California ... · Computational Biology Tools ......
Transcript of Computational Biology Tools - University of California ... · Computational Biology Tools ......
![Page 1: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/1.jpg)
Brian Kidd
October 21, 2010
Computational Biology Tools
Lecture 8:
Protein Sequence Databases and
Analysis Tools
![Page 2: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/2.jpg)
see survey 04 on eCommons (tests and quizzes)
![Page 3: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/3.jpg)
Questions/Concerns from Last Time
![Page 4: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/4.jpg)
Overview
1. Protein Sequence Databases• SwissProt, UniProt, NCBI
2. Protein Analysis tools• Linear sequence analysis• 3D structure analysis
3. Finding Distant Homologs
![Page 5: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/5.jpg)
Sequence Databases
SwissProt (ExPASY)
highly curated, updated less frequently
translated nucleotide sequences
automatic translation, fast but less info
Unified Protein Resource
Combines SwissProt, TrEMBL, PIR sequences
TrEMBL (ExPASY)
UniProt (EBI)
![Page 6: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/6.jpg)
Sequence Analysis Sites
For protein sequences and tools to analyze them, the two major centers are:
ExPASY: Expert Protein Analysis System
many tools – http://ca.expasy.org/tools
Databases: SwissProt, TrEMBL
PIR: Protein Information Resource (folded into UniProt consortium; no longer major resource site)
NCBI: Entrez Protein and Domains
![Page 7: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/7.jpg)
More Sequence Databases
Non-redundant
NR (NCBI), UniRef (PIR/EBI)
Reference
RefSeq (NCBI) – reannotated by NCBI
Domains/Families
Pfam – protein families (Sanger center + mirror sites)
SMART – Simple Modular Architecture Research Tool
CDD – Conserved protein Domain Database (NCBI), combines Pfam, SMART, and COGs databases
InterPro – (based on UniProt, at EMBL-EBI
Many others...
![Page 8: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/8.jpg)
Linear Sequence Analysis
Calculate its physical properties
What can you learn from a (single) protein sequence?
Signal sequences, transmembrane domains, coiled-coils, post-translational modification sites, secondary structure (non-homologous)
Domains, functional motifs (homologous)
Identify sequence motifs and families
Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy (hydrophobic vs. hydrophilic regions)
Does not take into account post-translational modifications, so calculations are usually not 100% accurate
![Page 9: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/9.jpg)
Protein Sequence Analysis Tools
ExPASY Proteomics Tools
Calculate physical properties
Predict sequence motifs
what ExPASY calls “Topology” : localization, TM domains
Signal sequences, post-translational modifications
Search pattern and profile collections
PredictProtein and Meta-PP
Meta-server providing access to many servers with one submission form
![Page 10: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/10.jpg)
Structure Databases
Experimental
PDB: Protein Data Bank
Families:
SCOP, CATH, Dali Database, Homstrad
Models/Predictions
ModBase
SwissModel
NOTE: all of these databases are described in the January Database issue of Nucleic Acids Research (NAR)
Includes links to the databases
![Page 11: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/11.jpg)
3D Structure Analysis
Visualization
Evaluate structure “quality”
Domain structure, global fold, active sites, point mutations, SNPs, splice sites
Calculate physical properties
Prediction
Surface areas, distances, side-chain conformations, contact maps
Structural alignment (i.e. similarity to other structures)
Physical properties: binding affinity, pKa’s, stability, specificity
3D structure (homology modeling, fold recognition, de novo)
Advanced: protein design, “docking” of two proteins, active site modeling
![Page 12: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/12.jpg)
Secondary Structure Prediction
Three good methods:
Psipred
SAM-T02/T04/T06
PhD (PredictProtein)
Compare a couple methods
Use the three-state predictions
![Page 13: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/13.jpg)
Information FlowSequence ⇔ Structure ⇔ Function
Evolutionary selection operates on function
Structure is more closely linked to function than is sequence, so structure tends to be more conserved than sequence
Need to search farther in sequence space to find proteins with related structures and functions
![Page 14: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/14.jpg)
Detecting Remote Similarities
Remote similarities can more easily be detected by comparing protein sequences
DNA sequences change faster than protein sequences (wobble position, redundant codons)
4 letter DNA code vs. 20 letter amino acid code means that matches by chance are more likely in DNA ➜ the protein code has more information in it!
![Page 15: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/15.jpg)
Detecting Homology
NEAR Evolutionary Distance FAR
BLASTnBLASTp
PSIBLASTFold Recognition
METHODS
DNA SequenceProtein Sequence
Protein Structure
SIMILARITY
![Page 16: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/16.jpg)
Similar Sequence Share Similar Structures
Compare all pairs of proteins in the same “family” (pairs for which homology is very probable)
Homologs do not necessarily share much sequence similarity
Proteins with > 30% sequence identity almost always share the same fold
Saunder et al., Proteins 40:6-22 (2000).
Family
All others Immunoglobulins
Mor
e st
ruct
ural
sim
ilarit
y
![Page 17: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/17.jpg)
PSIBLAST
Position-Specific Iterated BLAST
Use BLASTp and identify related sequences (E-value threshold)
Creates a scoring matrix specialized for your sequence
Allows more distantly related sequences to be identified
Steps:
Create a profile from related sequences
Search for related sequences using this profile
Repeat
![Page 18: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/18.jpg)
!"#$%&'(
!"#$%&)*
BLASTing the Protein Universe
![Page 19: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/19.jpg)
Evolution and the Protein Universe
![Page 20: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/20.jpg)
PSIBLASTing the Protein Universe
!"#$%"&'()*
!"#$%"&'()+
![Page 21: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/21.jpg)
Sequence Profiles
Align all sequences and count how often each amino acid occurs at every position
Combine with prior information about substitution frequencies using pseudo-counts from BLOSUM62
Convert to log odds score to give a Position-Specific Scoring Matrix (PSSM)
![Page 22: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/22.jpg)
!!!!!!!!!!!"!!#!!$!!%!!&!!'!!(!!)!!*!!+!!,!!-!!.!!/!!0!!1!!2!!3!!4!!5!!!!!!6!.!!!76!78!78!79!78!76!78!79!78!!6!!8!78!!:!!;!79!78!76!78!76!!6!!!!!!8!-!!!76!!6!!;!!6!7<!!8!!<!78!!;!79!79!!9!78!7<!76!!;!76!79!78!79!!!!!!9!3!!!79!79!7<!7=!79!78!79!79!79!79!78!79!78!!6!7<!79!79!68!!8!79!!!!!!<!5!!!!;!79!79!7<!76!79!79!7<!7<!!9!!6!79!!6!76!79!78!!;!79!76!!<!!!!!!=!3!!!79!79!7<!7=!79!78!79!79!79!79!78!79!78!!6!7<!79!79!68!!8!79!!!!!!:!"!!!!=!78!78!78!76!76!76!!;!78!78!78!76!76!79!76!!6!!;!79!78!!;!!!!!!>!,!!!78!78!7<!7<!76!78!79!7<!79!!8!!<!79!!8!!;!79!79!76!78!76!!6!!!!!!?!,!!!76!79!79!7<!76!79!79!7<!79!!8!!8!79!!6!!9!79!78!76!78!!;!!9!!!!!!@!,!!!76!79!7<!7<!76!78!79!7<!79!!8!!<!79!!8!!;!79!79!76!78!76!!8!!!!!6;!,!!!78!78!7<!7<!76!78!79!7<!79!!8!!<!79!!8!!;!79!79!76!78!76!!6!!!!!66!"!!!!=!78!78!78!76!76!76!!;!78!78!78!76!76!79!76!!6!!;!79!78!!;!!!!!68!"!!!!=!78!78!78!76!76!76!!;!78!78!78!76!76!79!76!!6!!;!79!78!!;!!!!!69!3!!!78!79!7<!7<!78!78!79!7<!79!!6!!<!79!!8!!6!79!79!78!!>!!;!!;!!!!!6<!"!!!!9!78!76!78!76!76!78!!<!78!78!78!76!78!79!76!!6!76!79!79!76!!!!!6=!"!!!!8!76!!;!76!78!!8!!;!!8!76!79!79!!;!78!79!76!!9!!;!79!78!78!!!!!6:!"!!!!<!78!76!78!76!76!76!!9!78!78!78!76!76!79!76!!6!!;!79!78!76!!!!!AAA!!!!9>!1!!!!8!76!!;!76!76!!;!!;!!;!76!78!79!!;!78!79!76!!<!!6!79!78!78!!!!!9?!)!!!!;!79!76!78!79!78!78!!:!78!7<!7<!78!79!7<!78!!;!78!79!79!7<!!!!!9@!2!!!!;!76!!;!76!76!76!76!78!78!76!76!76!76!78!76!!6!!=!79!78!!;!!!!!<;!3!!!79!79!7<!7=!79!78!79!79!79!79!78!79!78!!6!7<!79!79!68!!8!79!!!!!<6!4!!!78!78!78!79!79!78!78!79!!8!78!76!78!76!!9!79!78!78!!8!!>!76!!!!!<8!"!!!!<!78!78!78!76!76!76!!;!78!78!78!76!76!79!76!!6!!;!79!78!!;!!
!"#$%&'()*'"+%,-#-*.#"$/0-1)%/*2%!3*10-#*/4%5'*#$-1)%%678,9%:;<=>;?>::<;@AB%C%?::D%E#F*%G-4'H%I%8#*)+%7*1B
A Sample PSSM
![Page 23: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/23.jpg)
PSSM Corruption
False positives can occur in a PSIBLAST search if the PSSM becomes corrupted
One sequence that is not homologous to the query gets included in the alignment used to make the PSSM
The PSSM now looks a bit like this spurious sequence and will match well to other similar spurious sequences
The additional spurious sequences that are detected are included in the new alignment, amplifying the corrupting signal
How do PSSMs become corrupted?
Once a “bad” sequence is included in the PSSM, the search veers off course and cannot be corrected
![Page 24: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/24.jpg)
Preventing PSSM Corruption
Applying filtering of biased composition regions (low complexity filter)
Use better methods to estimate the E-value (composition-based statistics)
Increase threshold for judging two sequences to be similar: adjust E-value from 0.001 (default) to a lower value such as 0.0001
Manually inspect the output from each iteration and remove suspicious hits
![Page 25: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/25.jpg)
PHI-BLAST
Pattern-Hit-Initiated BLAST
What other proteins contain a particular sequence pattern and are similar in the vicinity of this pattern?
May filter out cases where pattern matches randomly and doesn’t indicate homology
Combines matching of regular expressions with local alignments surrounding the match
Pattern matching uses ScanProsite syntax
Sequence similarity search is like PSIBLAST
![Page 26: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/26.jpg)
Syntax Rules for Patterns[] any one of the listed characters allowed
E[LIV]X(0,3)PP[STG]matches:
ELPPS
EVIPPG
does not match:
ELIVPPPPG
{} any character except the listed ones allowed
x(n) n positions in which any residue is allowed
x(n,m) n-m positions in which any residue is allowed
Examples:GXW[YF][EA][IVLM]matches:
GTWFEL
GKWYAI
does not match:
GGWYFEI
GWYEI
![Page 27: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/27.jpg)
Gene Discovery with BLAST
Start with the sequence of a known protein
Search a DNA database (e.g HTGS, dbEST, or genomic sequence from a specific organism
Find matches...• to DNA encoding known
proteins• to DNA encoding
related (novel!) proteins• to false positives
Search your DNA or protein against a protein database (nr) to confirm you have identified a novel gene
tblastn
insepctblastx
orblastp
nr
![Page 28: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/28.jpg)
Essentials at this Point
Accessing literature and sequence information from various databases (NCBI and UCSC)
BLAST (all variants)
Pairwise sequence analysis tools and algorithms
Single sequence analysis tools DNA:EMBOSS, ORFs, Restriction Enzymes, & Primers
Protein databases and analysis tools
PSI and PHI BLASTs
![Page 29: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so](https://reader031.fdocuments.us/reader031/viewer/2022030611/5adab17c7f8b9a137f8dd664/html5/thumbnails/29.jpg)
For Next Time
Reading
Problem set
B4D Chapter 9 – Building a Multiple Sequence Alignment
B4D Chapter 10 – Editing and Publishing Alignments
Continue working on PS #2 (due Friday, October 29)
http://www.soe.ucsc.edu/classes/bme110/Fall10/calendar.html