Bioinformatics: Introduction and Methods
Transcript of Bioinformatics: Introduction and Methods
![Page 1: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/1.jpg)
Bioinformatics: Introduction and Methods Le Zhang
Computer Science Department, Southwest University
![Page 2: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/2.jpg)
Functional prediction of genetic variants
Le Zhang, Ph. D. Computer Science Department Southwest University
![Page 3: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/3.jpg)
Unit 1: Overview of the problem
Le Zhang, Ph. D. Computer Science Department Southwest University
![Page 4: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/4.jpg)
Do you think Angelina made the right decision to remove her breasts?
![Page 5: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/5.jpg)
Angelina Joli has a genetic mutation in BRCA1.
How can we predict the likelihood of her getting breast cancer given this mutation? • P(breast cancer|her mutation) • P(breast cancer free|her mutation)
![Page 6: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/6.jpg)
The dawning of the age of personalized medicine Next‐generation sequencing can sequence one person’s whole genome with ~$3000.
The personal genomes hold promises for a future of personalized medicine.
![Page 7: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/7.jpg)
Where did your genetic variations come from?
somatic mutations de novo mutations inherited from parents
Annapurna Poduri et. al. Somatic Mutation, Genomic Variation, and Neurological Disease Science 5 July 2013: 341
![Page 8: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/8.jpg)
Types of genetic variations in a human genome
• Chromosomal aneuploidy • Structural Variations (SVs) • Copy Number Variations (CNVs) • Short insertion/deletions (indels) • Single Nucleotide Variations (SNVs)
Nomenclature: Mutation vs. polymorphism vs. variation vs. variant
![Page 9: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/9.jpg)
Structure Variation (SV) and Copy Number Variation (CNV) Insertion Deletion Inversion Translocation CNV
![Page 10: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/10.jpg)
Indel – short Insertion/Deletion Within intergenic/intronic regions Within coding regions
Frameshifting Non‐frameshifting x
![Page 11: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/11.jpg)
SNV – Single Nucleotide Variation There are about 3 million SNVs in one person’s genome, equivalent of ~ 1/1000 frequency.
![Page 12: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/12.jpg)
SNVs within coding regions
Stop gain(nonsense)
Stop loss
Non‐synonymous(missense)
Synonymous(silent)
Affect splicing Missense mutation Nonsense mutation
![Page 13: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/13.jpg)
Missense (nonsynonymous) SNVs
Missense SNVs change the amino acid.
Missense SNVs account for ~2% of the genome but >50% of all mutations known to be
involved in human inherited diseases.
![Page 14: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/14.jpg)
BRCA1 vs. breast cancer
In 1990, DNA linkage studies on large families identified BRCA1 as the first gene associated with
breast cancer. BRCA1 located on chromosome 17 80,818 bp in length 23 exons encodes a protein of 1,863 amino acids a tumor suppressor gene that repairs damaged DNA and regulates cell growth and cell death. Approximately 5‐10% of breast cancers and 14% of ovarian cancers occur from a BRCA1 or BRCA2 genetic mutation.
![Page 15: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/15.jpg)
However, not all missense SNVs cause phenotype change. Some are pathogenic, but many are neutral. Atotal of 238 known missense variations in BRCA1
163 are present only in patients
62 are present only in healthy persons
13 in both patients and healthy persons
![Page 16: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/16.jpg)
On average, a healthy individual has
Class
Synonymous SNPs
Non‐synonymous SNPs
Small in‐frame indels
Small frameshift indels
Stop losses
Stop‐introducing SNPs
Genes disrupted by large deletions
Total genes containing LOF variants
HGMD ‘damaging mutation’ SNPs
Number
60,157
68,300
714
954
77
1,057
147
2,304
671
Class
SNP
Number
3,019,909
Indel
Deletions
Duplications
mobile element
insertions
361,669
15,893
407 4,775
Within protein‐coding regions,
![Page 17: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/17.jpg)
Still an unsolved problem with lots of active on‐going research!
• What features differentiate disease‐causing variants from neutral ones? • How can we predict whether a variation is disease‐causing?
![Page 18: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/18.jpg)
Unit 2: Databases of genetic variations
Le Zhang, Ph. D. Computer Science Department Southwest University
![Page 19: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/19.jpg)
![Page 20: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/20.jpg)
dbSNP
http://www.ncbi.nlm.nih.gov/SNP/
Created in September 1998 by by the
NCBI(National Center for Biotechnology Information) in collaboration with the NHGRI(National Human Genome Research Institute)
Its goal is to act as a single database
that contains all identified genetic variation
![Page 21: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/21.jpg)
232,952,851 62,676,337 44,278,189 27,608,151 73,909,251 35,997,830
dbSNP New information obtained by dbSNP becomes available to the public periodically in a series of “builds”
Contains a range of molecular variation: SNPs Indels
multinucleotide polymorphisms microsatellite markers short tandem repeats heterozygous sequences
As of dbSNP build 138: Consist of variants from131 Organisms For Homo sapiens
Number of Submissions (ss) Number of RefSNP clusters (rs) Validated rs Number of rs in gene Number of ss with genotype Number of ss with frequency
![Page 22: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/22.jpg)
dbSNP– Data increase From dbSNP build 125 in 2005 to build 138 in 2013, for Homo sapiens 250,000,000
200,000,000
150,000,000
100,000,000
50,000,000
0 2005 2007 2008 2009 2011 2012
Number of Submissions(ss)
Number of rs in gene Number of RefSNP Clusters(rs)
![Page 23: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/23.jpg)
dbSNP- Record
![Page 24: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/24.jpg)
dbSNP- Record
![Page 25: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/25.jpg)
1000 Genomes http://www.1000genomes.org/ The 1000 Genomes Project, launched in January 2008, is an international research effort to establish by far the most detailed catalogue of human genetic variation. Pilot‐ In 2010, the project finished its pilot phase Phase I ‐ In October 2012, the sequencing of 1092 genomes was announced in a Nature publication
![Page 26: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/26.jpg)
1000 Genomes
![Page 27: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/27.jpg)
1000 Genomes
Sequencing technology used:
Illumina SOLID 454
Phase I Whole genome Whole exome
strategy Low coverage whole genome sequencing
Deeping sequencing of whole
exome
Coverage 2‐6X 50‐100X
Sample number
1,092 1,039
![Page 28: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/28.jpg)
OMIM Online Mendelian Inheritance in Man A database catalogues all the known diseases with a genetic component, and links them to the relevant genes in the human genome Contain information on all known mendelian disorders and over 12,000 genes.
http://www.omim.org/
![Page 29: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/29.jpg)
OMIM initiated in the early 1960s by Dr. Victor A. McKusick as a catalog of mendelian traits and disorders, entitled Mendelian Inheritance in Man as a book 12 book editions of MIM were published between 1966 and 1998
The online version, OMIM, was created in 1985 and made generally available on the internet starting in 1987.
![Page 30: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/30.jpg)
OMIM Entry Statistics
![Page 31: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/31.jpg)
OMIM
![Page 32: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/32.jpg)
Human Gene Mutation Database (HGMD)
a comprehensive collection of germline mutations in nuclear genes that underlie,
or are associated with, human inherited disease.
By 2013, the database contained over 141,000 different variants detected in over
5,700 different genes
Two versions: Professional – need subscription every year Public – freely available but permanently 3 years out of date, and does not contain any of the additional annotations or extra features present in HGMD Professional
![Page 33: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/33.jpg)
Human Gene Mutation Database (HGMD)
Created by biologist David N. Cooper and mathematician Michael
Krawczak in 1996.
Originally established for the scientific study of mutational mechanisms
in human genes causing inherited disease, but has since acquired a much broader utility as a central unified repository for germ‐line disease‐related functional variation.
All HGMD mutation data are manually curated from the scientific
literature.
![Page 34: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/34.jpg)
HGMD
![Page 35: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/35.jpg)
HGMD 2013.2
![Page 36: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/36.jpg)
HGMD http://www.hgmd.cf.ac.uk/ac/index.php
![Page 37: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/37.jpg)
Locus specific databases (LSDBs)
Collect all known variants of each disease related gene in a specific database
Annotate with Complete and accurate information on genetic mutations
Most LSDBs are build based on LOVD (Leiden Open Variation Database) which is a database framework of storing variants information
http://www.lovd.nl/3.0/home
![Page 38: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/38.jpg)
LSDBs
![Page 39: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/39.jpg)
Unit 3: Conservation-base and Rule-based
methods: SIFT & PolyPhen
Le Zhang, Ph. D. Computer Science Department Southwest University
![Page 40: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/40.jpg)
Questions:
• What features differentiate disease‐causing variants from neutral ones?
• How can we predict whether a variation is disease‐causing?
![Page 41: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/41.jpg)
Phenotypical/functional “effects” of human genetic variations
• Disease vs. normal • Deleterious vs. neutral
• Personal trait differences (e.g., height)
Observations, not “truth”
Statistical and stochastic, not deterministic
• Animal model phenotypic changes • Cellular phenotypic changes
• Protein function changes
• Protein structure changes
• Protein sequence changes
![Page 42: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/42.jpg)
• Nonsense mutations are usually considered deleterious. • even though it is not always the case…
• Known deleterious mutations are enriched in nonsynonymous mutations. • ~50 known mutations of Mendelian disorders are nonsynonymous mutations
• ascertainment bias?
• synonymous mutations, intronic mutations, and intergenic mutations are under‐ studied. • According to GWAS studies, 88% of trait‐associated variants of weak effect are non‐coding.
• Most research so far had focused on nonsynonymous mutations.
![Page 43: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/43.jpg)
1999: Earliest attempt based on BLOSUM substitution matrix
• Assumption: if the substitution score between a variant residue and the wild type residue is positive, then the variant is neutral. If the substitution score is negative, then the variant is deleterious.
![Page 44: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/44.jpg)
More successful methods
• Conservation‐based (e.g., SIFT)
• Rule‐based (e.g., PolyPhen)
• Classifier‐based (e.g., PolyPhen2, SAPRED)
![Page 45: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/45.jpg)
Sort Intolerant From Tolerant substitutions (SIFT)
Published in 2001 by Pauline C. Ng and Steven Henikoff The first tool of predicting deleterious Amino Acid Subsitutions Website: http://sift.jcvi.org/
![Page 46: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/46.jpg)
SIFT bets on evolution Important positions (such as active sites) tend to be conserved in the protein family across species. • Mutations at well‐conserved positions tend to be deleterious.
Some positions have a high degree of diversity across species. • Mutations at these positions tend to be neutral.
![Page 47: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/47.jpg)
SIFT is a multistep procedure
Given a protein sequence:
Step 1. Search for similar sequences
Sequence search database: SWISS‐PROT
PSI‐blast is run for four iterations to collect a pool of sequences similar to the query
Step 2. Choose closely related sequences that are likely to share similar function
The psi‐blast results are grouped together if they are >90% identical in the regions aligned
![Page 48: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/48.jpg)
Step 3. Obtain the multiple alignment of these chosen sequences
![Page 49: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/49.jpg)
Step 4. Calculate normalized probabilities for all possible substitutions at each position at the alignment
If the SIFT score is less than 0.05, the SNV is considered to be deleterious. Otherwise, it is considered neutral.
![Page 50: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/50.jpg)
![Page 51: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/51.jpg)
Prediction results
Score cutoff: 0.05
![Page 52: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/52.jpg)
Accuracy of SIFT False Negative rate: 31% False Positive rate: 20% Coverage: 60%
![Page 53: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/53.jpg)
Truth("Goldstandard")
Positive Negative
Test
Outcome
Positive TruePositive
(hit)
FalsePositive (falsealarm)
Positivepredictivevalue
(PPV)=
Precision=
TP/(TP+FP)
Negative FalseNegative
(miss)
TrueNegative (correctrejection)
Negativepredictivevalue
(NPV)=
TN/(TN+FN)
Sensitivity=
Recall=
TP/(TP+FN)
Specificity=
TN/(TN+FP)
Accuracy=
(TP+TN)/total
Falsenegativerate
(β)=
TypeIIerror=
1-sensitivity=
FN/(TP+FN)
Falsepositiverate
(α)=
TypeIerror=
1-specificity=
FP/(TN+FP)
Falsediscoveryrate
(FDR)=
1-precision=
FP/(TP+FP)
![Page 54: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/54.jpg)
Polymorphism Phenotyping (PolyPhen): a rule‐based method Amino acid variants may impact folding, interaction sites, solubility or stability of the protein.
Changes in protein structure may affect protein function, which may lead to phenotype change.
PolyPhen predicts impact of amino acid allelic variants based on multi‐sequence alignment AND protein 3D structure features
![Page 55: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/55.jpg)
PolyPhen
![Page 56: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/56.jpg)
PolyPhen
1. Multi‐sequence alignment of homologous sequences
2. Structure‐based characterization of the substitution site DISULFIDE, THIOLEST or THIOEATH bond, BINDING site, ACTIVE site etc. Whether the variant is located in transmembrane regions Whether the variant is located in coiled coil regions Whether the variant is located in signal peptide regions
![Page 57: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/57.jpg)
PolyPhen 3. Get the protein 3D structure or using homolog modeling to predict its structure 4. Calculate the 3D structure features of the substitution site
Secondary structure Solvent accessible surface area
Φ Ψ dihedral angles
Normalized B‐factor for the residue Loss of hydrogen bond Contacts with critical sites, ligands or other polypeptide chains
![Page 58: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/58.jpg)
PolyPhen uses empirically derived rules to predict whether an nsSNP is damaging or benign
![Page 59: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/59.jpg)
Cons
If 3D structure is not available, it can only depend on MSA.
The rules are empirical.
PolyPhen Pros
Improved prediction accuracy when protein 3D structure is available
![Page 60: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/60.jpg)
PolyPhen2
An improved version of PolyPhen in 2010 http://genetics.bwh.harvard.edu/pph2/
Use more predictive features Based on Naïve Bayes machine learning
![Page 61: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/61.jpg)
Improved performance compared with PolyPhen
![Page 62: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/62.jpg)
Unit 4: Classifier-based methods: SAPRED
Le Zhang, Ph. D. Computer Science Department Southwest University
![Page 63: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/63.jpg)
Formulate as a supervised classification problem
+ ‐
Structural attributes & Sequence attributes Apply the classifier to newly identified SAPs
Attributes evaluation & Subset selection 60 attributes 10 groups Build SVM classifier On training data
![Page 64: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/64.jpg)
Single Amino acid Polymorphisms disease‐association Predictor (SAPRED)
Currently SAPRED supports two types of predictions: One is based on both the structural and sequence information the other relies on the sequence information only The former aims at higher prediction accuracy and more attributes with putative biological insights, while the latter can work with more queries whose structural models are not available.
![Page 65: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/65.jpg)
PDB – get protein 3D structure http://www.rcsb.org/pdb/home/home.do
![Page 66: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/66.jpg)
![Page 67: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/67.jpg)
Homology Modeling
http://swissmodel.expasy.org/
![Page 68: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/68.jpg)
Homology Modeling
![Page 69: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/69.jpg)
Biologically-Intuitive Attributes
Residue frequencies, conservation score,
Solvent accessibilities and Cβ density, secondary structure...
New attributes:
Structural neighbor profile
Nearby functional sites
Disordered regions
Hydrogen bonds change
β-aggregation
HLA family
![Page 70: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/70.jpg)
Residue frequencies in MSA
LacI 5-38
![Page 71: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/71.jpg)
NR,ai X j
where Xj j i Xj,c < R;
Structural neighbor profile
Definition:
A 20-D vector: take the Cα of the SAP residue as the center, draw a sphere with a specific radius. The residues inside are counted to get the number for each of the 20 kinds of residues. Each number is a component of the vector.
R: radius
L: protein length
ai: a specific residue type
r: distance between a
residue and the center residue
L j1
=1 if X = a & r
otherwise, Xj = 0
![Page 72: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/72.jpg)
Structural neighbor profile
The center is H128, radius is 10 Angstroms. Neighbors are: 42-47: LLICTY
50-52: AGT 55: I 59: V
106-110: LKTHL 112: T
125-127: KFL
129-131: VAR 176-177: HV 180-181: WW 184: K
188-194: QILFLFY 197: I 208: V 211: F
![Page 73: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/73.jpg)
a.a. A C D E F G H I K L
N 2 1 0 0 4 1 2 4 3 7
a.a. M N P Q R S T V W Y
N 0 0 0 1 1 0 4 4 2 2
Structural neighbor profile: vector
![Page 74: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/74.jpg)
Ov
eral
l ac
cura
cy
Structural neighbor profile
Predictive power of different structural neighbor profile
0.68 0.66
0.76 0.74 0.72 0.7
0.78
0 5 10 15 20
Radius (Å)
wildtype profile
variant profile
profile difference
Different radius had different prediction power.
We selected 13 Angstroms as the optimal value of the radius.
![Page 75: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/75.jpg)
Nearby functional sites
Functional sites like ACT_SITE, METAL annotated in Swiss-Prot have intuitive biological insights
SAPs exactly on these sites would disturb protein function heavily but only low coverage in the dataset.
We proposed the SAPs in the vicinity of functional sites could also affect the protein function more probably than others – enlarged the coverage of these attributes in the dataset.
![Page 76: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/76.jpg)
Nearby functional sites
![Page 77: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/77.jpg)
Disordered Region
122 SAPs in disordered regions, 114 (93%) are disease-associated.
From: http://ist.temple.edu/disprot/index.php
![Page 78: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/78.jpg)
Changed
Hydrogenbond
Disease Polymorphism ratio
-6 1 0 1/0
-5 12 1 12
-4 44 2 22
-3 114 16 7.25
-2 230 55 4.18
-1 403 213 1.89
0 1142 716 1.59
1 224 142 1.58
2 68 36 1.89
3 11 4 2.75
4 0 2 0
5 0 2 0
Hydrogen bond change
![Page 79: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/79.jpg)
Other attributes
52 SAPs in transmembrane regions, 49 (94%) are disease-
associated
194 SAPs altered β-aggregation properties, 169 (87%) are
disease-associated
435 SAPs from HLA families, all except one are “polymorphism”.
![Page 80: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/80.jpg)
SVM classifier SVM – support vector machine Separate transformed data with a hyper plane in a high‐dimensional space
Kernel function – Radial Basis Function(RBF)
Grid‐search to select proper values of parameter
![Page 81: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/81.jpg)
Support Vector Machine (SVM) Classifier -- Grid-search for parameters
log2C = 1; log2g = -7
![Page 82: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/82.jpg)
Five-fold cross-validation
Part Total proteins Total SAP Deleterious
SAP
Neutral SAP
1
2
3
4
5
Total
105
104
105
105
103
522
686
688
688
688
688
3438
449
450
450
450
450
2249
237
238
238
238
238
1189
![Page 83: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/83.jpg)
SAPstatus Predictedasdisease-
association(+)
Predictedas
polymorphism(-)
Disease-association(+) TP FN
Polymorphism(-) FP TN
Accuracy: ACC and MCC
ACC TPTN
TPTNFPFN
(TPTN FPFN)
(TN FN)(TN FP)(TP FN)(TP FP)
Overall accuracy:
Matthew correlation
coefficient:
MCC
![Page 84: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/84.jpg)
Predictive power
![Page 85: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/85.jpg)
SAPRED web server
http://sapred.cbi.pku.edu.cn/
![Page 86: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/86.jpg)
Run SAPRED
![Page 87: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/87.jpg)
Results
![Page 88: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/88.jpg)
Explanation of Results: Structural attributes
![Page 89: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/89.jpg)
Explanation of Results: sequence attributes
![Page 90: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/90.jpg)
Results using SAPRED_Seq
ACC=81.5% MCC=0.577
![Page 91: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/91.jpg)
Unit 5:
Support Vector Machine(SVM) Le Zhang, Ph. D.
Computer Science Department Southwest University
![Page 92: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/92.jpg)
……
Decision tree Neural Network Random Forest Ensemble learning
Model
Prediction
Training Data
New Data
Var1 Var2
Var3 VarN
Peking University
Machine learning model Methods SVM HMM Bayesian
![Page 93: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/93.jpg)
Peking University
Classification Classifying data is a common task in machine learning. Suppose some given data points each belong to one of two classes, and the goal is to decide which class a new data point will be in.
![Page 94: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/94.jpg)
Peking University
Introduction SVM is supervised learning model that analyze data and recognize patterns, used for classification and regression analysis. It selects a small number of critical boundary instances called support vectors from each class and build a linear discriminant function that separates them as widely as possible. SVMs can efficiently perform non‐linear classification using what is called the kernel trick, implicitly mapping their inputs into high‐dimensional feature spaces.
![Page 95: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/95.jpg)
Consider a two‐class, linearly separable classification problem Many decision boundaries! Are all decision boundaries equally good?
Peking University
What is a good Decision Boundary?
![Page 96: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/96.jpg)
Peking University
Decision Boundary
Intuitively, the best hyperplane is the one that represents the largest separation, or margin, between the two classes, since the larger the margin is, the lower the generalization error of the classifier will be.
![Page 97: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/97.jpg)
Peking University
Support Vector The instances that are closest to the maximum‐margin hyperplane—the ones with the minimum distance to it—are called support vectors.
![Page 98: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/98.jpg)
is the 1 or ‐1 to represent
y 1, 1,
0 0
Peking University
SVM - mathematics The data point is donated by , which is a n dimension vector, and the two different class. The hyperplane is 0 So the classification function is And
![Page 99: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/99.jpg)
and y . And in fact, f x y . So functional margin is:
The functional margin of a hyperplane is measured by
min
Peking University
SVM - mathematics The confidence of a classification can be measured by the functional margin, which is |f x |, and whether the classification is right can be determined by the consistence of signs of f x
However, the functional margin can be scaled even if the hyperplane remain the same, for example, w and b changed into 2w and 2b.
![Page 100: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/100.jpg)
r f x
| |
| |
In this maximum margin classifier, we want to max . Because the functional margin is scalable,
we can assume 1 without influence the optimal result.
Peking University
SVM - mathematics
A intuitional measurement can be obtained using the distance from the point to the hyperplane, which is called geometrical margin
![Page 101: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/101.jpg)
max 1
| | . . 1 , 1,2,…, .
Which equals to
min 1
2 . . 1 , 1,2,…, .
This is a optimization model with constraints, and can be easily solve by Quadratic Programming.
Peking University
SVM - mathematics So the objective function is
![Page 102: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/102.jpg)
L w,b,α 1
2 1
L
w L
b
0 0
0
Peking University
SVM - mathematics We can also solve this by Lagrange multipliers
![Page 103: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/103.jpg)
f x
,
Peking University
SVM - mathematics Finally the classification function can be rewritten as
![Page 104: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/104.jpg)
Peking University SVM - kernel The linear learning machine has very limited ability in practice, because of complexity in the real world, which needs more flexible hypothetical space. We can use a function ϕ to map x to a higher dimension space, in which all the points can be linear separable.
![Page 105: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/105.jpg)
,
Here we get the kernel function:
K x,z ,
Peking University
kernel
So the classification function can be extended as
![Page 106: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/106.jpg)
0 a
The we can construct a 5‐dimension space, where
Z , , , ,
So the hyperplane in the new feather space is
0
Peking University
kernel Take points in the picture for example, the two classes can be separated by a circle
![Page 107: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/107.jpg)
Linear kernel: K x ,x ,
, Polynomial kernel: K x ,x
Gauss kernel: K x ,
Peking University
Kernel function
![Page 108: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/108.jpg)
Gauss kernel
Peking University
SVM - example Linear kernel
![Page 109: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/109.jpg)
Peking University
Applications SVM has been used successfully in many real‐world problems bioinformatics (Mutation classification, Cancer classification) text (and hypertext) categorization image classification – different types of sub‐problems hand‐written character recognition
![Page 110: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/110.jpg)
Peking University
Pros and Cons With support vectors, the maximum‐margin hyperplane is relatively stable. However, they often produce very accurate classifiers because subtle and complex decision boundaries can be obtained. Compared with other methods, even the fastest training algorithms for support vector machines are slow when applied in the nonlinear setting.
![Page 111: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/111.jpg)
Unit 6: Comparative Protein Structure Modeling
of Genes And Genomes Le Zhang, Ph. D.
Computer Science Department Southwest University
![Page 112: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/112.jpg)
Catalogue
•
•
•
•
What is comparative protein structure modeling? Why could we do comparative modeling?
Why is comparative modeling important?
How to do comparative modeling?
Fold assignment and template selection
Target – template alignment
Model building
Model evaluation
• The application of comparative modeling
• Comparative modeling in structural genomics
![Page 113: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/113.jpg)
1. What Is Comparative Protein Structure Modeling?
• Comparative protein structure modeling predicts the three‐ dimensional structure for a given protein sequence of unknown structure (target) on the basis of sequence similarity to proteins of known structure (the templates).
![Page 114: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/114.jpg)
2. Why Could We Do Comparative Modeling?
• Small changes in the protein sequence usually result in small changes in its 3D structure. If similarity between two proteins is detectable at the sequence level, structural similarity can usually be assumed.
• The number of unique structural folds that proteins adopt is limited and because the number of experimentally determined new structures is increasing exponentially.
![Page 115: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/115.jpg)
• Designing mutants to test hypotheses about a protein’s function
• Identifying active and binding
• Identifying, designing and improving ligands for a given binding site
• Modeling substrate specificity
• Predicting antigenic epitopes
• Facilitating molecular replacement in x‐ray structure determination
• Refining models based on NMR constraints
• Testing and improving a sequence‐structure alignment
• Confirming a remote structural relationship
• Rationalizing known experimental observations.
3. Why Comparative Modeling Is Important?
• It is an efficient way to obtain useful information about the proteins of interest.
• Simulating protein–protein docking
• Inferring function from a calculated electrostatic potential around the protein
![Page 116: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/116.jpg)
4. How To Do Comparative Modeling?
• Fold assignment and template selection
• Target – template alignment
• Model Building
• Model evaluation
![Page 117: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/117.jpg)
• Three main classes of protein comparison methods :
1. Comparing the target sequence with each of the database sequences independently. Program : BLAST, FASTA etc.
2. Using multiple sequence comparisons to improve the sensitivity of the search. Program : PSI‐BLAST etc.
*especially useful when the sequencing identity below 25%
3. Threading or 3D template matching methods. *especially useful when there are no sequences clearly related to the modeling target.
4.1 Fold Assignment And Template Selection
![Page 118: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/118.jpg)
4.1 Fold Assignment And Template Selection
• Template selection :
A higher sequence similarity, The family of proteins, The quality of template structure, Solvent, pH, ligands…
• Potential problems:
Distantly related proteins used as templates (i.e., less than 25% sequence identity) may produce an unreliable model.
![Page 119: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/119.jpg)
4.1 Fold Assignment And Template Selection
• The databases and Programs you may use in this step:
a S, server , P, program b Some of the sites are mirrored on additional computers
C (a) MolSoft Inc., San Diego. (b) Molecular Simulations
Inc., San Diego. (c) Tripos Inc., St Louis. (d) ProCeryon Biosciences Inc. New York.
![Page 120: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/120.jpg)
• Once templates have been selected, a specialized method should be used to align the target sequence with the template structures. Program : CLUSTAL etc.
• The alignment becomes difficult in the “twilight zone” of less than 30% sequence identity. (Only 20% of the residues are likely to be correctly aligned when two proteins share 30% sequence.)
4.2 Target – Template Alignment
Similarity of BLOSUM62 is 62%, also ~45 & ~80.
![Page 121: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/121.jpg)
4.2 Target – Template Alignment
• In difficult cases, it is frequently beneficial to rely on multiple structure and sequence information. The information from structures helps to avoid gaps in secondary structure elements, in buried regions, or between two residues that are far in space.
• Potential problems: Although you can use the methods aforementioned, misalignment may occur especially when the target‐template sequence identity decreases below 30%.
![Page 122: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/122.jpg)
4.2 Target – Template Alignment
• Programs and World Wide Web servers you may use in this step:
a S, server , P, program b Some of the sites are mirrored on additional computers
C (a) MolSoft Inc., San Diego. (b) Molecular Simulations Inc., San Diego. (c) Tripos Inc., St Louis. (d) ProCeryon Biosciences Inc. New York.
![Page 123: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/123.jpg)
4.3 Model Building
• Three classes of methods can be used to construct a 3D model:
1. Modeling by Assembly of Rigid(刚性的) Bodies
Assemble a model from a small number of rigid bodies obtained from aligned protein structures.
2. Modeling by Segment Matching or Coordinate Reconstruction
Use a subset of atomic positions from template structures as “guiding” positions, and by identifying and assembling short, all‐atom segments that fit these guiding positions.
3. Modeling by Satisfaction of Spatial(空间的) Restraints(约束) Generate many constraints or restraints on the structure of the target sequence, using its
alignment to related protein structures as a guide.
![Page 124: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/124.jpg)
4.3 Model Building
• Programs and World Wide Web servers you may use in this step:
a S, server , P, program b Some of the sites are mirrored on additional computers
C (a) MolSoft Inc., San Diego. (b) Molecular Simulations Inc., San Diego. (c) Tripos Inc., St Louis. (d) ProCeryon Biosciences Inc. New York.
![Page 125: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/125.jpg)
4.3.1 Loop Modeling
• Loops often determine the functional specificity of a given protein framework. They contribute to active and binding sites.
• Loop modeling can be seen as a mini–protein folding problem, but they are generally too short to provide sufficient information about their local fold.
• Three methods:
1) Ab initio methods
2) Database search techniques 3) Both
![Page 126: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/126.jpg)
4.3.2 Sidechain Modeling • Side chain conformations are predicted from similar structures and from steric(立体的) or energetic considerations. • They are modeled using structural information from proteins in general and from equivalent disulfide(二硫) bridges in related structures. • Two effects on sidechain conformation: 1) The coupling between the main chain and side chains
2) The continuous nature of the distributions of side‐chain dihedral angles(二面角)
• Three different side‐chain prediction methods : 1)The packing of backbone‐dependent rotamers(旋转异构体) 2)The self‐consistent mean‐field approach to positioning rotamers based on their van der Waals interactions 3)The segment‐matching method of Levitt
![Page 127: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/127.jpg)
4.3.3 Potential Problems
• According to a recent survey analyzed the accuracy of 3 modeling methods, they can only correctly predict approximately 50% of χ1 angles and 35% of both χ1 and χ2 angles.
• Segments of the target sequence that have no equivalent region in the template structure (i.e., insertions or loops) are the most difficult regions to model, especially when the insertion is more than 9 residues long.
• Some correctly aligned segments of a model, the template is locally different (<3 A˚) from the target, resulting in errors in that region.
• As the sequences diverge, the packing of side chains in the protein core may changes.
![Page 128: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/128.jpg)
4.4 Model Evaluation
• Typical errors in comparative models :
1. Errors in side‐chain packing 2. Distortions and shift in correctly aligned regions. 3. Errors in regions without a template 4. Errors due to misalignments 5. Incorrect template.
![Page 129: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/129.jpg)
4.4 Model Evaluation
• Typical errors in comparative models :
![Page 130: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/130.jpg)
4.4 Model Evaluation
The criteria of evaluation
Having the correct fold or not
The target‐template sequences similarity
Distributions of many spatial features
The environment
Having good stereochemistry or not
![Page 131: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/131.jpg)
4.4 Model Evaluation
1) Having the correct fold or not A model will have the correct fold if the correct template is picked and if that template is aligned at least approximately correctly with the target sequence. A
The fold of a model can be assessed by a high sequence similarity with the closest template, an energy based Z‐score, or by conservation of the key functional or structural residues in the target sequence.
2) The target‐template sequences similarity Sequence identity above 30% is a relatively good predictor of the expected accuracy.
![Page 132: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/132.jpg)
Average model accuracy as a function of the template‐target sequences similarity
4.4 Model Evaluation
EDN: human eosinophil neurotoxin, is a ribonuclease with 3 α-
helices and 2 three-stranded antiparallel β-sheets arranged in a
single domain.
CRABPI: mouse cellular retinoic acid binding protein I, is a single domain protein composed of interacting α‐helices packed at the edge of two orthogonal, 4‐ and 6‐stranded antiparallel β‐sheets. For the CRABPI
model, 90% of Cαatoms superpose within 3.5 Å of their counterparts in the X‐ray structure; the rms error is 1.31 Å.
NM23H2: Human nucleoside diphosphate kinase, is a single
domain protein consisting of a central 4-stranded antiparallel β-
sheet surrounded by 8 α-helices. For the NM23H2 model, all but
one Cαatom superpose within 3.5 Å of the X-ray structure; rms difference is 0.41 Å.
![Page 133: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/133.jpg)
Solid line: sample models Dotted line: corresponding actual structures
4.4 Model Evaluation Average model accuracy as a function of the template‐target sequences similarity Percentage structure overlap is defined as the
fraction of equivalent residues. Two residues are equivalent when their Cα atoms are within 3.5 Å of each other upon rigid‐body, least‐squares superposition of the two structures.
![Page 134: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/134.jpg)
3) The environment
Example: some calcium‐binding proteins undergo large conformational changes when bound to calcium. If a calcium‐free template is used to model the calcium‐bound state of the target, it is likely that the model will be incorrect.
4) Having good stereochemistry or not
Including bond lengths, bond angles, peptide bond and side‐chain ring planarities, chirality, main‐chain and side‐chain torsion angles, and clashes between nonbonded pairs of atoms.
5) Distributions of many spatial features
Such features include packing, formation of a hydrophobic core, residue and atomic solvent acces sibilities, spatial distribution of charged groups, distribution of atom‐atom distance, atomic volumes, and main‐chain hydrogen bondin.
4.4 Model Evaluation
![Page 135: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/135.jpg)
4.4 Model Evaluation
• There are also methods for testing 3D models that implicitly take into account many of the criteria listed above. These methods are based on 3D profiles and statistical potentials of mean force.
• A physics‐based approach to deriving energy functions has been tested for use in protein structure evaluation (1999).
![Page 136: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/136.jpg)
4.4 Model Evaluation
• Programs and World Wide Web servers you may use in this step:
a S, server , P, program b Some of the sites are mirrored on additional computers
C (a) MolSoft Inc., San Diego. (b) Molecular Simulations Inc., San Diego. (c) Tripos Inc., St Louis. (d) ProCeryon Biosciences Inc. New York.
![Page 137: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/137.jpg)
Low accuracy <30% sequence identity Less than 50% of their Cα
atoms within 3.5 Å of their correct positions
High accuracy >50% sequence identity Approaches that of low
resolution X‐ray structures or medium resolution NMR structures rw (van der Waals radius) of C atom = 1.70Å
5. The Application of Comparative Modeling
• Three levels of model accuracy and some of the corresponding applications
Three levels
Middle aaccuracy 30‐50% sequence identity 85% of their Cα atoms within 3.5 Å of their correct positions
![Page 138: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/138.jpg)
5. The Application of Comparative Modeling
• Applications1: low accuracy models
• •
<30% sequence identity, having the correct fold Less than 50% of their Cα atoms within 3.5 Å of their correct
positions
• Use: To confirm or reject a match between remotely related proteins
![Page 139: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/139.jpg)
5. The Application of Comparative Modeling
• 30‐50% sequence identity
• 85% of their Cα atoms within 3.5 Å of their correct positions
• Use: Refinement of the functional prediction based on sequence to construct site‐directed mutants with altered or destroyed binding capacity other problems...
• Applications2: middle accuracy models
![Page 140: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/140.jpg)
5. The Application of Comparative Modeling
• Applications3: high accuracy models
• >50% sequence identity • The average accuracy of these models approaches that of low resolution X‐ray structures (3 Å resolution) or medium resolution NMR structures (10 distance restraints per residue) • s
• Use: For docking of small ligands or whole proteins onto a given protein.
![Page 141: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/141.jpg)
6. Comparative modeling in structural genomics
• The aim of structural genomics is to determine or accurately predict the 3D structure of all the proteins encoded in the genomes.
• This aim will be achieved by a focused, large‐scale determination of protein structures by X‐ray crystallography and NMR spectroscopy, combined efficiently with accurate protein structure modeling techniques.
![Page 142: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/142.jpg)
6. Comparative modeling in structural genomics
• For comparative modeling to contribute to structural genomics, automation of all the steps in the modeling process is essential.
• The automation of large‐scale comparative modeling involves assembling a software pipeline that consists of modules for fold assignment, template selection, target–template alignment, model generation, and model evaluation.
![Page 143: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/143.jpg)
• Two examples of large‐scale comparative modeling for complete genomes:
the SWISS‐MODEL web server: The sequences encoded in the E. coli genome have been used to build models for 10–15% of the proteins using the SWISS‐MODEL web server.
MODPIPE: MODPIPE produced models for five procaryotic and eukaryotic genomes. This calculation resulted in models for substantial segments of 17.2%, 18.1%, 19.2%, 20.4%, and 15.7% of all proteins in the genomes of Saccharomyces cerevisiae (6218 proteins in the genome); Escherichia coli (4290 proteins), Mycoplasma genitalium (468 proteins), Caenorhabditis elegans (7299 proteins, incomplete), and Methanococcus janaschii (1735 proteins).
6. Comparative modeling in structural genomics
![Page 144: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/144.jpg)
• Large‐scale comparative modeling will extend opportunities to tackle a myriad of problems by providing many protein models for many genomes.
Rotein evolution Drug design
A facile comparison of ligand binding requirements and Substitutions in and around important residues ......
A specific example:
The selection of a target protein for drug development !
6. Comparative modeling in structural genomics
![Page 145: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/145.jpg)
7. Conclusion • Over the past few years, there has been a gradual increase in both the accuracy of comparative models and the fraction of protein sequences that can be modeled with useful accuracy. • Further advances are necessary in recognizing weak sequence–structure similarities, aligning sequences with structures, modeling of rigid body shifts, distortions, loops and side chains, as well as detecting errors in a model. • It is currently possible to model with useful accuracy significant parts of approximately one third of all known protein sequences. • A major new challenge for comparative modeling is the integration of it with the torrents of data from genome sequencing projects as well as from functional and structural genomics.
![Page 146: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/146.jpg)
Reference • Martí‐Renom M A, Stuart A C, Fiser A, et al. Comparative protein structure modeling of genes and genomes[J]. Annual review of biophysics and biomolecular structure, 2000, 29(1): 291‐325. • Šali A, Potterton L, Yuan F, et al. Evaluation of comparative protein modeling by MODELLER[J]. Proteins: Structure, Function, and Bioinformatics, 1995, 23(3): 318‐326. • Fiser A, Do R K G, Šali A. Modeling of loops in protein structures[J]. Protein science, 2000, 9(9): 1753‐1773. • Fiser A, Do R K G, Šali A. Modeling of loops in protein structures[J]. Protein science, 2000, 9(9): 1753‐1773. • Sánchez R, Šali A. Comparative protein structure modeling in genomics[J]. Journal of Computational Physics, 1999, 151(1): 388‐401.
![Page 147: Bioinformatics: Introduction and Methods](https://reader031.fdocuments.us/reader031/viewer/2022012021/61689840d394e9041f70f2e2/html5/thumbnails/147.jpg)
Bioinformatics: Introduction and Methods
Computer Science Department, Southwest University
Thank you