Iosif Vaisman
description
Transcript of Iosif Vaisman
![Page 2: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/2.jpg)
NIH working definition of bioinformatics and computational biology (July 2000)
The NIH Biomedical Information Science and Technology Initiative Consortium agreed on the following definitions of bioinformatics and computational biology recognizing that no definition could completely eliminate overlap with other activities or preclude variations in interpretation by different individuals and organizations.
Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.
Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.
![Page 3: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/3.jpg)
Bioinformatics bibliography(papers with the word “bioinformatics” in title or abstract)
0100200300400500600700800900
1000
1988 1991 1994 1997 2000
Medline
ISI
PNAS
Liebman MN, Molecular modeling of protein structure and function: a bioinformatic approach.
J Comput Aided Mol Des 1988, 1(4):323-41
![Page 4: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/4.jpg)
Dynamics of Database Growth
EMBL Sequence Database
100
10000
1000000
100000000
1983 1987 1991 1995 1999 2003
![Page 5: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/5.jpg)
Comparative Sequence Sizes
• Yeast chromosome 3 350,000
• Escherichia coli (bacterium) genome 4,600,000
• Largest yeast chromosome now mapped 5,800,000
• Entire yeast genome 15,000,000
• Smallest human chromosome (Y) 50,000,000
• Largest human chromosome (1) 250,000,000
• Entire human genome 3,000,000,000
![Page 6: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/6.jpg)
The String Alignment Problemstring - a sequence of characters from some alphabet
given: two strings acbcdb and cadbd
one of possible alignments:
a c - - b c d b- c a d b - d -
scoring function:exact match +2mismatch -1insertion -1
score:3 . (2) + 5 . (-1) = 1
![Page 7: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/7.jpg)
The String Alignment Problem
given: two strings CTCATG and TACTTG
C T C A - T - G | | | |. T - A C T T G
score:4 . (2) + 4 . (-1) = 4
C T C A T G | | |T A C T T G
score:3 . (2) + 3 . (-1) = 3
![Page 8: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/8.jpg)
Entropy and Redundancy of Language
CUR F W D DIS AND P
A SED IEND ROUGHT EATH EASE AIN BLES FR B BR AND AG
![Page 9: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/9.jpg)
Entropy and Redundancy of Language
The sequences are 65% identical
A CURSED FIEND WROUGHT DEATH DISEASE AND PAIN|| |||| ||||| ||||||| ||||| ||||| |||A BLESSED FRIEND BROUGHT BREATH AND EASE AGAIN
** CUR**** F*****W******* D***** DIS*****AND P***|| |||| ||||| ||||||| ||||| ||||| |||**BLES****FR*****B*******BR*****AND ***** AG***
![Page 10: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/10.jpg)
Substitution Matrices
• Dayhoff (or MDM, or PAM) - Derived from global alignments of closely related sequences
PAM100 - number referes to evolutionary distance (Percentage of Acceptable point Mutations per 108 years)
PAM100PAM100 PAM100 PAM100
PAM150PAM200
100 million years
200 million years
300 million years
![Page 11: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/11.jpg)
Substitution Matrices
• BLOSUM (BLOcks SUbstitution Matrix) -Derived from local, ungapped alignments of distantly related sequences BLOSUM62 - number refers to the minimum percent identity
Reference: Henikoff & Henikoff Proteins 17:49, 1993
![Page 12: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/12.jpg)
Selecting a Matrix
• Compared sequences are related: 200 PAM or 250 PAM
• Database scanning: 120 PAM
• Local alignment search: 40 PAM, 120 PAM, 250 PAM
• Detection of related sequences using BLAST: BLOSUM 62
THERE IS NO “ONE SIZE FITS ALL” MATRIX !
Low PAM:short segments,high similarity
High PAM:long segments,low similarity
![Page 13: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/13.jpg)
Matrix Example
A B C D E F G H I K .. 1.5 0.2 0.3 0.3 0.3 -0.5 0.7 -0.1 0.0 0.0 .. A 1.1 -0.4 1.1 0.7 -0.7 0.6 0.4 -0.2 0.4 .. B 1.5 -0.5 -0.6 -0.1 0.2 -0.1 0.2 -0.6 .. C 1.5 1.0 -1.0 0.7 0.4 -0.2 0.3 .. D 1.5 -0.7 0.5 0.4 -0.2 0.3 .. E
1.5 -0.6 -0.1 0.7 -0.7 .. F 1.5 -0.2 -0.3 -0.1 .. G 1.5 -0.3 0.1 .. H 1.5 -0.2 .. I 1.5 .. K
![Page 14: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/14.jpg)
Dayhoff’s Acceptable Point Mutations
Ala AArg R 30Asn N 109 17Asp D 154 0 532Cys C 33 10 0 0Gln Q 93 120 50 76 0Glu E 266 0 94 831 0 422Gly G 579 10 156 162 10 30 112His H 21 103 226 43 10 243 23 10Ile I 66 30 36 13 17 8 35 0 3Leu L 95 17 37 0 0 75 15 17 40 253Lys K 57 477 322 85 0 147 104 60 23 43 39Met M 29 17 0 0 0 20 7 7 0 57 207 90Phe F 20 7 7 0 0 0 0 17 20 90 167 0 17Pro P 345 67 27 10 10 93 40 49 50 7 43 43 4 7Ser S 772 137 432 98 117 47 86 450 26 20 32 168 20 40 269Thr T 590 20 169 57 10 37 31 50 14 129 52 200 28 10 73 696Trp W 0 27 3 0 0 0 0 0 3 0 13 0 0 10 0 17 0Tyr Y 20 3 36 0 30 0 10 0 40 13 23 10 0 260 0 22 23 6Val V 365 20 13 17 33 27 37 97 30 661 303 17 77 10 50 43 186 0 17 A R N D C Q E G H I L K M F P S T W Y Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr
![Page 15: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/15.jpg)
Search and alignment entropy
• Information content per position: pam10 - 3.43 bits pam120 - 0.98 bits pam160 - 0.70 bits pam250 - 0.38 bits
blosum62 - 0.70 bits
• Information requirements: for search - 30 bits for alignment - 16 bit
![Page 16: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/16.jpg)
Search and alignment entropy
Query length Substitution matrix Gap costs <35 PAM-30 ( 9,1) 35-50 PAM-70 (10,1) 50-85 BLOSUM-80 (10,1) >85 BLOSUM-62 (11,1)
Recommended matrices for different query length
![Page 17: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/17.jpg)
FASTA AlgorithmS
eque
nce
ASequence B
First run(identities)
1
![Page 18: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/18.jpg)
FASTA AlgorithmS
eque
nce
A
Sequence B
Rescoring usingPAM matrix
high score low score
2
The score of the highest scoring initial region is saved as the init1 score.
![Page 19: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/19.jpg)
FASTA AlgorithmS
eque
nce
A
Sequence B
Joining threshold - eliminates disjointed segments
3
Non-overlapping regions are joined. The score equals sum of the scores of the regions minus a gap penalty. The score of the highest scoring region, at the end of this step, is saved as the initn score.
![Page 20: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/20.jpg)
FASTA Algorithm
Alignmentoptimizationusing dynamicprogramming
Seq
uenc
e A
Sequence B 4
The score for this alignment is the opt score.
![Page 21: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/21.jpg)
FASTA Algorithm
FastA uses a simple linear regression against the natural log of the search set sequence length to calculate a normalized z-score for the sequence pair.
Using the distribution of the z-score, the program can estimate the number of sequences that would be expected to produce, purely by chance, a z-score greater than or equal to the z-score obtained in the search. This is reported as the E() score.
![Page 22: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/22.jpg)
• When init1=init0=opt: 100 % homology over the matched stretch.
• When initn > init1: more than 1 matching region in the database with poorly matching separating regions.
• When opt > initn: the matching regions are greatly improved by adding gaps in one or both of the sequences.
FASTA Results
![Page 23: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/23.jpg)
BLAST - Basic Local Alignment Search Tool
• Blast programs use a heuristic search algorithm. The programs use the statistical methods of Karlin and Altschul (1990,1993).
• Blast programs were designed for fast database searching, with minimal sacrifice of sensitivity to distant related sequences.
![Page 24: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/24.jpg)
BLAST Algorithm
Query sequence of length L
Maximium of L-w+1 words(typically w = 3 for proteins)
For each word from the query sequence find the list of words with high score using a substitutionmatrix (PAM or BLOSUM)
Word list
1
![Page 25: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/25.jpg)
BLAST Algorithm
Database sequences
Exact matches of words from the word list to the database sequences
Word list
2
![Page 26: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/26.jpg)
BLAST Algorithm
3
Maximal Segment Pairs (MSPs)
For each exact word match, alignment is extended in both directions to find high score segments
![Page 27: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/27.jpg)
Gapped BLAST
• The Gapped Blast algorithm allows gaps to be introduces into the alignments. That means that similar regions are not broken into several segments.
• This method reflects biological relationships much better.
![Page 28: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/28.jpg)
BLAST family of programs• blastp - amino acid query sequence against a protein
sequence database • blastn - nucleotide query sequence against a
nucleotide sequence database • blastx - nucleotide query sequence translated in all
reading frames against a protein database • tblastn - protein query sequence against a nucleotide
sequence database dynamically translated in all reading frames
• tblastx - six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
![Page 29: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/29.jpg)
Database Searches• Run Blast first, then depending on your results run a
finer tool (Fasta, Smith-Waterman, etc.) • Where possible use translated sequence.• E() < 0.05 is statistically significant, usually
biologically interesting. Check also 0.05 < E() <10 because you might find interesting hits.
• Pay attention to abnormal composition of the query sequence, it usually causes biased scoring.
• Split large query sequence ( if >1000 for DNA, >200 for protein).
• If the query has repeated segments, remove them and repeat the search.
![Page 30: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/30.jpg)
Documenting the Search
• Algorithm(s)
• Substitution matrix
• Gap penalty (FASTA)
• Name of database
• Version of database
• Computer used
![Page 31: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/31.jpg)
MULTIPLE SEQUENCE ALIGNMENT
![Page 32: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/32.jpg)
Computational complexity
Alignment of protein sequences with 200 amino acid residues:
# of sequences CPU time
2 1 sec
3 200 sec
10 2008
sec
![Page 33: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/33.jpg)
Multiple alignment
VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWESNG--
Column cost: the sum of costs for all possible pairs
![Page 34: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/34.jpg)
Multiple alignment
A correct multiple alignment corresponds to an evolutionary history:
no correct way to determine practical way - to find an alignment with the maximum score
![Page 35: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/35.jpg)
Multiple sequence alignment
Given k (k > 2) sequences, s1,…, sk, each sequence
consisting of characters from an alphabet A multiple alignment is a a rectangular array, consisting of characters from the alphabet A’ (A + "-"), that satisfies the following 3 conditions:
1. There are exactly k rows. 2. Ignoring the gap character, row number i is
exactly the sequence si. 3. Each column contains at least one character
different from "-".
![Page 36: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/36.jpg)
Consensus
Plurality - minimum number of votes for a consensusThreshold - scoring matrix value below which a symbol may not vote for a coalition.Sensitivity - minimum score to select consensusProfiles - blocks of prealigned sequences
![Page 37: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/37.jpg)
Multiple alignment algorithm
1. Pairwise alignments (progressive pairwise alignments) 2. Distance matrix calculation3. Guide tree creation (hierarchical clustering)4. New sequence addition
![Page 38: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/38.jpg)
Scoring system (distances)
Sreal(ij) - Srand(ij)
Siden(ij) - Srand(ij)x 100D(ij) = -ln
Sreal(ij) - observed similarity score for two aligned sequences i and j
Siden(ij) - average of the two scores for each sequence aligned with itself
Srand(ij) - average score determined from 100 global randomizations of the two sequences
The distances D(ij) are used to generate the distance matrix from which the approximate guide tree is generated.
![Page 39: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/39.jpg)
Multiple alignment
![Page 40: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/40.jpg)
Multiple alignment
Segment - line joining two vertices
Each unit m-dimensional cube in the lattice contains 2m -1 segments
A
B
(0,0)
(1,1)
(0,1)
(1,0)
A
B
C
(0,0,0)
(1,1,1)
![Page 41: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/41.jpg)
Multiple alignment
Alignment Path for 3 Sequences(0,0,0), (1,0,0), (2,1,0), (3,2,0), (3,3,1), (4,3,2)
![Page 42: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/42.jpg)
Multiple alignment
Pairwise Projections of the Alignment
V S N - S- S N A -- - - A S
![Page 43: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/43.jpg)
Alignment statistics
Rablpb Humcetp Rabcetp Bovbpi Humlbpa Ratlbp Maccetp Humbpi 1 2 3 4 5 6 7 8
478 67% 65% 19% 19% 18% 42% 43%1 0 82% 80% 39% 39% 36% 64% 65% 0 1% 0% 5% 5% 12% 2% 2%
327 483 58% 16% 16% 16% 39% 41%2 400 0 75% 38% 38% 35% 62% 63% 5 0 0% 5% 5% 12% 1% 1%
318 284 482 18% 18% 17% 40% 43%3 390 367 0 38% 38% 35% 64% 64% 4 1 0 5% 5% 12% 1% 1%
96 84 95 494 95% 74% 20% 21%4 198 192 194 0 98% 84% 40% 41% 30 29 28 0 0% 7% 6% 5%
![Page 44: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/44.jpg)
Alignment score
Rablpb Humcetp Rabcetp Bovbpi Humlbpa Ratlbp Maccetp Humbpi 1 2 3 4 5 6 7 8
1 4077
2 5358 4129
3 5323 5650 4096
4 8103 8229 8112 4210
5 8109 8243 8118 4332 4219
6 8535 8672 8575 5511 5519 4261
7 6474 6531 6500 8103 8119 8572 4103
8 6392 6434 6378 8033 8035 8520 5508 4083
1 2 3 4 5 6 7 8
![Page 45: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/45.jpg)
Alignment visualization Humlbpa : Rablpb : Ratlbp : Humcetp : Maccetp : Rabcetp : Humbpi : Bovbpi :
* * * * 50 M---MGALARALPS-ILLALLLTSTPEALGA-NPGLVARITDKGLQYAAQEGLLALQM---MGTWARALLGSTLLSLLLAAAPGALGT-NPGLITRITDKGLEYAAREGLLALQM---MKSATGPLLP-TLLGLLLLSIPRTQGV-NPAMVVRITDKGLEYAAKEGLLSLQM---MLAATVLT---LALLGNAHACSKGTSH-EAGIVCRITKPALLVLNHETAKVIQM---MLAATVLT---LALLGNVHACSKGTSH-KAGIVCRITKPALLVLNQETAKVIQ-----------------------ACPKGASY-EAGIVCRITKPALLVLNQETAKVVQMRENMARGPCNAPRWVSLMVLVAIGTAVTAAVNPGVVVRISQKGLDYASQQGTAALQM---MARGPDTARRWATLVVLAALGTAVTTT-NPGIVARITQKGLDYACQQGVLTLQm m l g66 RI3 L 2 6Q
: 52 : 53 : 52 : 50 : 50 : 33 : 57 : 53
Humlbpa : Rablpb : Ratlbp : Humcetp : Maccetp : Rabcetp : Humbpi : Bovbpi :
: 130 : 131 : 130 : 131 : 131 : 114 : 135 : 131
Identity
Summary view
![Page 46: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/46.jpg)
Alignment visualization Humlbpa : Rablpb : Ratlbp : Humcetp : Maccetp : Rabcetp : Humbpi : Bovbpi :
* * * * 50 M---MGALARALPS-ILLALLLTSTPEALGA-NPGLVARITDKGLQYAAQEGLLALQM---MGTWARALLGSTLLSLLLAAAPGALGT-NPGLITRITDKGLEYAAREGLLALQM---MKSATGPLLP-TLLGLLLLSIPRTQGV-NPAMVVRITDKGLEYAAKEGLLSLQM---MLAATVLT---LALLGNAHACSKGTSH-EAGIVCRITKPALLVLNHETAKVIQM---MLAATVLT---LALLGNVHACSKGTSH-KAGIVCRITKPALLVLNQETAKVIQ-----------------------ACPKGASY-EAGIVCRITKPALLVLNQETAKVVQMRENMARGPCNAPRWVSLMVLVAIGTAVTAAVNPGVVVRISQKGLDYASQQGTAALQM---MARGPDTARRWATLVVLAALGTAVTTT-NPGIVARITQKGLDYACQQGVLTLQm m l g66 RI3 L 2 6Q
: 52 : 53 : 52 : 50 : 50 : 33 : 57 : 53
Physico-chemical properties Humlbpa : Rablpb : Ratlbp : Humcetp : Maccetp : Rabcetp : Humbpi : Bovbpi :
* * * * 50 .---.G.LA...PS-...A...TST.EALG.-.....A...D................---.GTWA....GST..S.....A.GALGT-.....T...D.......R........---.KS..GP..P-T..G...LSI..TQGV-..A..V...D.......K....S...---.L...VLT---.A..GNAH..S....H-EA........PA.LVLNH.TAKV...---.L...VLT---.A..GN.H..S....H-KA........PA.LVLN..TAKV..-----------------------.....A.Y-EA........PA.LVLN..TAKV........RGPCNAP...S......IGTAV.A.......V...Q...D..S...TA....---..RGPDTAR..AT....A.LGTAV..T-.....A...Q...D..C.....T..m m l g66 RI3 L 2 6Q
: 52 : 53 : 52 : 50 : 50 : 33 : 57 : 53
Differences mode
![Page 47: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/47.jpg)
Alignment visualization (tree)
![Page 48: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/48.jpg)
Sequence Logos: a quantitative graphical display for binding sites and proteins
Reference: Schneider, T.D. Meth. Enzym 274:445, 1996
![Page 49: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/49.jpg)
Sequence Logos
![Page 50: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/50.jpg)
Sequence Logos
![Page 51: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/51.jpg)
Multiple Alignment Programs
• Pileup (GCG): Needleman and Wunsch algorithm for pairwise alignment and UPGMA method for tree
construction
• CLUSTAL: Wilbur and Lipman algorithm for pairwise alignment (CABIOS 8:189, 1992)
• PIMA: pattern-matching based algorithm (PNAS 87:118, 1990)
• TreeAlign: phylogenetic algorithm (Meth. Enzymol. 18:626, 1990)
![Page 52: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/52.jpg)
Patterns in protein sequencesPatterns in protein sequences
![Page 53: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/53.jpg)
Regular ExpressionsPatterns described in a standard way are known as regular expressions
x-x or x-x-x
not D or E
I or L or V
x(2,3)
{DE}
[ILV]
END.
C-terminal>
N-terminal<
separator-
repetitions( )
NOT{ }
OR[ ]
ANYx
![Page 54: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/54.jpg)
Regular Expressions
[AC]-x-V-x(4)-{ED}.
[Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}
...LKHVAYVFQALIYWIK...
...AVEMAGVKYLQVQHGS...
...LYTGAIVTNNDGPYMA...
...KEYKCKVEKELTDICN...
![Page 55: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/55.jpg)
PROSITE Database Current version contains 1079 documentation entries that describe 1459 different patterns, rules and profiles/matrices
[ST]-x(2)-[DE] Casein kinase II phosphorylation site
[AG]-x(4)-G-K-[ST] ATP/GTP-binding site motif A (P-loop)
Y-x-[NQH]-K-[DE]-[IVA]-F-[LM]-R-[ED]Heat shock hsp90 proteins family signature
http://www.expasy.ch/prosite
![Page 56: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/56.jpg)
Blocks DatabaseBlocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins
DMA_VIBCH|Q08318 (85) SCTQWWPPF 77 HEMK_MYCLE|P45832 (181) DLFVAQPTL 100 MT57_ECOLI|P25240 (111) DGALGNPPF 13 MTC1_CHVN1|Q01511 (172) NFVFLDPPY 8 MTC1_COREQ|P42828 (71) QLSFSCPPF 49 MTH2_HAEHA|P00473 (32) KIAFFDPQY 52 MTH3_HAEIN|P43871 (23) HAIISDIPY 73 MTM1_MICAM|P50190 (306) AAVLTNPPF 14 MTM2_MORBO|P23192 (25) QLAVIDPPY 10 MTMU_MYCSP|P43641 (37) QVIYADPPW 13 MTR1_RHOSH|P14751 (60) QLIICDPPY 8....................................
N-6 Adenine-specific DNA methylases proteinswidth=9 seqs=78
http://www.blocks.fhcrc.org/
![Page 57: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/57.jpg)
Pfam DatabasePfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains
TYY1_HUMAN/383-407 YVCPF.DGCN...KKFAQSTNLKSHILT...H ZG52_XENLA/61-83 YTCT...QCN...KQFSHSAQLRAHIST...H KRUP_DROME/306-328 YTCE...ICD...GKFSDSNQLKSHMLV...H YKQ8_CAEEL/78-102 YKCT...VCR...KDISSSESLRTHMFKQ.HH DEFI_CHICK/268-292 YECP...NCK...KRFSHSGSYSSHISSK.KC ZFH1_DROME/389-413 FGCD...NCG...KRFSHSGSFSSHMTSK.KC YL57_CAEEL/42-65 YLCY...YCG...KTLSDRLEYQQHMLK..VH ZFA_MOUSE/542-564 FKCD...ICL...LTFSDTKEVQQHALV...H BASO_HUMAN/719-742 FQCD...ICK...KTFKNACSVKIHHKN..MH HUNB_DROME/297-319 FQCD...KCS...YTCVNKSMLNSHRKS...H SFP1_YEAST/598-623 FKCPV.IGCE...KTYKNQNGLKYHRLH..GH ZG29_XENLA/62-84 FVCT...VCG...KTYKYKHGLNTHLHS...H
Zinc finger, C2H2 type
http://pfam.wustl.edu/
![Page 58: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/58.jpg)
Other Motif Databases
PRINTS : a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein familyhttp://bioinf.man.ac.uk/dbbrowser/PRINTS/
DOMO : a protein domain databasehttp://www.infobiogen.fr/~gracy/domo/home.htm
ProDom : a protein domain database http://protein.toulouse.inra.fr/prodom.html
![Page 59: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/59.jpg)
InterPro Database
InterPro : integrated resource for the commonly used signature databases - Pfam, PRINTS, PROSITE, ProDom and SWISS-PROT + TrEMBL.
Current release of InterPro (3.2) contains 3939 entries, representing 1009 domains, 2850 families, 65 repeats and 15 post-translational modification sites.
http://www.ebi.ac.uk/interpro
![Page 60: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/60.jpg)
InterPro Database
![Page 61: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/61.jpg)
DNA
RNA
mRNA
TRANSCRIPTION
SPLICING
PROMOTERELEMENTS
PROTEIN
TRANSLATION
STARTCODON
STOPCODON
SPLICESITES
From genes to proteins
![Page 62: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/62.jpg)
From genes to proteins
![Page 63: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/63.jpg)
![Page 64: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/64.jpg)
Chr
omos
ome
19 g
ene
map
![Page 65: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/65.jpg)
Computational Gene Prediction
•Where the genes are unlikely to be located?
•How do transcription factors know where to bind a region of DNA?
•Where are the transcription, splicing, and translation start and stop
signals?
•What does coding region do (and non-coding regions do not) ?
•Can we learn from examples?
•Does this sequence look familiar?
![Page 66: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/66.jpg)
Measures of Prediction Accuracy
TN FPFN TN TNTPFNTP FN
REALITY
PREDICTION
PR
ED
ICT
ION
REALITY
TP
FN TN
FP
c
cnc
ncSn = TP / (TP + FN)
Sp = TP / (TP + FP)
Sensitivity
Specificity
Nucleotide Level
![Page 67: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/67.jpg)
Measures of Prediction Accuracy
REALITY
PREDICTION
Exon Level
WRONGEXON
CORRECTEXON
MISSINGEXON
Sn =Sensitivitynumber of correct exonsnumber of actual exons
Sp =Specificitynumber of correct exons
number of predicted exons
![Page 68: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/68.jpg)
Spliced Alignment (Procrustes)
•New genomic sequence
•Selection of candidate exonsAUG --- GU initial exonsAG --- GU internal exonsAG --- UAA or UAG or UGA terminal exons
•Filtration (based on the codon usge statistics)
•Construction of all possible chains of candidate exons
•Finding a chain with the maximum global similarity to the target protein
![Page 69: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/69.jpg)
Spliced Alignment (Procrustes)
![Page 70: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/70.jpg)
Predicted Exon Assembly(Procrustes)
![Page 71: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/71.jpg)
PCR Primers Prediction (GenePrimer)
Exon 1085..1182 (98) hit using first 2 primers Exon 1628..1676 (49) missed Exon 1900..2001 (102) hit using first 8 primers Exon 2110..2184 (75) missed Exon 2516..2722 (207) hit using first 4 primers Exon 3385..3472 (88) missed Exon 3546..3746 (201) hit using first primer ...
![Page 72: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/72.jpg)
GRAIL gene identification program
POSSIBLE EXONSREFINED EXON
POSITIONSFINAL EXON CANDIDATES
![Page 73: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/73.jpg)
Suboptimal Solutions for the Human Growth Hormone Gene (GeneParser)
![Page 74: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/74.jpg)
GeneMark Accuracy Evaluation
![Page 75: Iosif Vaisman](https://reader033.fdocuments.us/reader033/viewer/2022061501/56815922550346895dc64af0/html5/thumbnails/75.jpg)
Gene Discovery Exercisehttp://metalab.unc.edu/pharmacy/Bioinfo/Gene
Bibliographyhttp://linkage.rockefeller.edu/wli/gene/list.html
andhttp://www-hto.usc.edu/software/procrustes/fans_ref/