BITS: Basics of sequence analysis
description
Transcript of BITS: Basics of sequence analysis
![Page 1: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/1.jpg)
Basic bioinformatics concepts, databases and tools
Module 3
Sequence analysisJoachim Jacob
http://www.bits.vib.be
Updated Feb 2012http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod3-intro_H1_2012_SeqAn.pdf
![Page 2: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/2.jpg)
In this third module, we will discuss the possible analyses of sequences
Module 1
Sequence databases and keyword searching
Module 2
Sequence similarity
Module 3
Sequence analysis: types, interpretation, results
![Page 3: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/3.jpg)
In this third module, we will discuss the possible analyses of sequences
![Page 4: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/4.jpg)
Sequence analysis tries to read sequences to infer biological properties
AGCTACTACGGACTACTAGCAGCTACCTCTCTG
- is this coding sequence?
- can this sequence bind a certain TF?
- what is the melting temperature?
- what is the GC content?
- does it fold into a stable secondary structure?
…
![Page 5: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/5.jpg)
Tools that can predict a biological feature are trained with examples
Automatic annotation
vs.
experimentally verified annotations
- Training dataset of sequences (← exp. verified)
- An algorithm defines parameters used for prediction
- The algorithm determines/classifies whether the sequence(s) contains the feature (→ automatic annotation)
![Page 6: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/6.jpg)
The assumption to being able to read biological function is the central paradigm
DNA → protein sequence → structure → activity (binding, enzymatic activity, regulatory,...)
So the premise to do analysis: biological function can be read from the (DNA) sequence.
Predictions always serve as a basis for further experiments.
![Page 7: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/7.jpg)
Protein− Metrics (e.g. how many alanines in my seq)
− Modifications and other predictions
− Domains and motifs
DNA/RNA− Metrics (e.g. how many GC)
− Predicting Gene prediction Promotor Structure
Analysis can be as simple as measuring properties or predicting features
![Page 8: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/8.jpg)
One might be interested in: pI (isoelectric point) prediction Composition metrics Hydrophobicity calculation Reverse translation (protein → dna) Occurrence of simple patterns (e.g. does
KDL occurs and how many times) ...
Simple protein sequence analysis
http://www.sigmaaldrich.com/life-science/metabolomics/learning-center/amino-acid-reference-chart.htmlhttp://en.wikipedia.org/wiki/Hydrophobicity_scales
![Page 9: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/9.jpg)
Protein sequence analysis tools are gathered on Expasy
http://www.expasy.org/tools (SIB)
Others: http://www.ebi.ac.uk/Tools/protein.html http://bioweb.pasteur.fr/protein/intro-en.html SMS2
![Page 10: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/10.jpg)
Never trust a tool's output blindly Interpreting depends on the kind of output
When a prediction result is obtained, the question arises 'Is it true?' (in biological sense)
Programs giving a 'binary' result: 1 or 0, a hit or a miss.
Approach: You should comparing different prediction programs for higher confidence.
E.g. SignalP for signal peptide prediction.
Programs giving score/P-value result: the chance that the 'result' is 'not real' → the lower, the better
Approach: asses the p-value
E.g. ScanProsite for a motif
![Page 11: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/11.jpg)
The basis for the prediction of features is nearly always a sequence alignment
Based on experimentally verified sequence annotations, a multiple sequence alignment is constructed
Different methods exist to capture the information gained from this multiple sequence alignment
![Page 12: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/12.jpg)
Alignment reveals similar residues which can indicate identical structure
Most protein pairs with more than 25-30 out of 100 identical residues were found to be structurally similar.Also proteins with <10% identity can have similar structure.http://peds.oxfordjournals.org/content/12/2/85.long
Same structure, hence most likely same function
Chances are that the structure is not the same
![Page 13: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/13.jpg)
The structure of a protein sequence determines his biological function
Primary = AA chain
Feb 2012: ~ 535 000 in Swissprot
Secondary = structural entities
(helix, beta-strands, beta-sheets, loops)
Tertiary = 3D
Nov 2011: ~ 80 000 in PDB
Quaternary = interactions
Number of Reportedstructures
http://en.wikipedia.org/wiki/Protein_structure
![Page 14: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/14.jpg)
Degree of similarity with other sequences varies over the length
Homologous Histone H1 protein sequences
More conserved
![Page 15: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/15.jpg)
Protein sequences can consist of structurally different parts
Domain
part of the tertiary structure of a protein that can exist, function and evolve independently of the rest, linked to a certain biological function
Motif
part (not necessarily contiguous) of the primary structure of a protein that corresponds to the signature of a biological function. Can be associated with a domain.
Feature
part of the sequence for which some annotation has been added. Some features correspond to domain or motif assignments.
![Page 16: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/16.jpg)
Based on motifs and domains, proteins are assigned to families
Nearly synonymous with gene family
Evolutionary related proteins
Significant structural similarity of domains is reflected in sequence similarity, and is due to a common ancestral sequence part, resulting in domain families.
![Page 17: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/17.jpg)
Domains and motifs are represented by simple and complex methods
Motif/domain in silico can be represented by
1. Regular expression / pattern
2. Frequency matrix / profile
3. Machine learning techniques : Hidden Markov Model
Gapped alignment
domain
http://bioinfo.uncc.edu/zhx/binf8312/lecture-7-SequenceAnalyses.pdf
![Page 18: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/18.jpg)
Regular expressions / patterns are the simplest way to represent motifs
A representation of all residues with equal probability.
123456ATPKAEKKPKAAAKPKAKTKPKPAAKPKT-AKPAAKKLPKADAKPKAA
Consensus: AKPKAA
1. 2. 3. 4. 5. 6.
[AKT] [AKLT] P [AK] [APT] [ADEKT-]
V V V V V V
V V V X V V
V V V V V V
Position:
? Does this sequence match: AKPKTE
? And this sequence: KKPETE
? And what about this one: TLPATE
For every position the mostFrequently occurring residue
![Page 19: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/19.jpg)
Frequency matrices or profiles include the chance of observing the residues
For every position of a motif, a list of all amino acids is made with their frequency. Position-specific weight/scoring matrix or profile. More sensitive way.
123456ATPKAEKKPKAAAKPKAKTKPKPAAKPKT-AKPAAKKLPKADAKPKAA
Consensus: AKPKA-
1. 2. 3. 4. 5. 6.
A 0.625 0 0 1/8 6/8 3/8D 0 0 0 0 0 1/8E 0 0 0 0 0 1/8K 0.25 6/8 0 7/8 0 2/8L 0 1/8 0 0 0 0P 0 0 1 0 1/8 0T 1/8 1/8 0 0 1/8 0- 0 0 0 0 0 1/8Sum 1 1 1 1 1 1
Position:
? Query: AKPKTE
? Query: KKPETE
? Query: TLPATE
http://prosite.expasy.org/prosuser.html#meth2Example: http://expasy.org/prosite/PS51092
Profile
![Page 20: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/20.jpg)
How good a sequence matches a profile is reported with a score
123456ATPKAEKKPKAAAKPKAKTKPKPAAKPKT-AKPAAKKLPKADAKPKAA
Consensus: AKPKA-
1. 2. 3. 4. 5. 6.
A 2.377 -2.358 -2.358 0.257 2.631 1.676D -2.358 -2.358 -2.358 -2.358 -2.358 0.257E -2.358 -2.358 -2.358 -2.358 -2.358 0.257K 1.134 2.631 -2.358 2.847 -2.358 1.134L -2.358 0.257 -2.358 -2.358 -2.358 -2.358P -2.358 -2.358 0.257 -2.358 0.257 -2.358T 0.257 0.257 -2.358 -2.358 0.257 -2.358
Position:
? Query: AKPKTE Score = 11.4
? Query: KKPETE Score = 5.0
? Query: TLPATE Score = 4.3
PSWM: scores
http://prosite.expasy.org/prosuser.html#meth2
![Page 21: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/21.jpg)
A hidden Markov Model takes also into account the gaps in an alignment
The schematic representation of a HMM
http://www.myoops.org/twocw/mit/NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-895Fall-2005/E096327C-7C77-4D23-BEBA-C28B087A9280/0/lecture6.pdf
![Page 22: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/22.jpg)
Building a HMM from a multiple sequence alignment
![Page 23: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/23.jpg)
Use HMMER to very sensitively search protein database with a HMM
You can search with a profile in a sequence database
![Page 24: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/24.jpg)
Some profile adjustments to the BLAST protocol exist for particular purposes
PSI-BLAST to identify distantly related proteins
PSI-BLAST (position specific iterated)
After a search result, a profile is made of the similar sequences, and this is used again to search a database
PHI-BLAST protein with matching of a pattern
PHI-BLAST (pattern hit initiated): you provide a pattern, which all BLAST results should satisfy.
CSI-BLAST is more sensitive than PSI-BLAST in identifying distantly related proteins
PSI BLAST http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE=Proteins&PROGRAM=blastp&RUN_PSIBLAST=on PHI BLAST http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE=Proteins&PROGRAM=blastp&RUN_PSIBLAST=on CSI BLAST http://toolkit.tuebingen.mpg.de/cs_blast
![Page 25: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/25.jpg)
Many databases exist that keep patterns, profiles or models related to function
Motif / domain databases (see NCBI bookshelf for good overview)
http://www.ebi.ac.uk/interpro/ - integrated db
http://expasy.org/prosite/ (motifs)
PFAM – hidden markov profiles (domains)
CDD (Conserved domains database) (NCBI - integrated)
Prodom (domain) (automatic extraction)
SMART (domain)
PRINTS (motif) sets of local alignments without gaps, used as frequency matrices, made by searching manually made "seed alignments" against UniProt sequences
![Page 26: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/26.jpg)
Prosite is a database gathering patterns from sequence alignments
ScanProsite tool : search the prosite database for a pattern ( present or not )
Example : [DE](2)-H-S-{P}-x(2)-P-x(2,4)-C>
You can retrieve sequences which correspond to a pattern, you made up yourself, observed in an alignment or an known one. The syntax is specific, but not difficult: see link below!
http://prosite.expasy.org/scanprosite/scanprosite-doc.html#pattern_syntax
![Page 27: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/27.jpg)
Interpro classifies the protein data into families based on the domain and motifs
Interpro takes all existing motif and domains databases as input ('signatures'), and aligns them to create protein domain families. This reduces redundancy. Each domain is than given an identifier IPRxxxxxxx.
Uneven size of motifs and families between families are handled by 'relations' :
parent - child and contains - found in
Families,... Regions, domains, ...
http://www.ebi.ac.uk/interpro/user_manual.html#type
![Page 28: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/28.jpg)
Interpro summarizes domains and motifs from a dozen of domain databases
ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/README.html#2
http://www.ebi.ac.uk/interpro/databases.html
![Page 29: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/29.jpg)
InterPro entries are grouped in types
Family
Entries span complete sequence
Domain
Biologically functional units
Repeat
Region
Conserved site
Active site
Binding site
PTM site
![Page 30: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/30.jpg)
InterPro entries are grouped in types
![Page 31: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/31.jpg)
You can search your sequence for known domains on InterProScan
Interproscan http://www.ebi.ac.uk/Tools/pfa/iprscan/
![Page 32: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/32.jpg)
A sequence logo provides a visual summary of a motif
Creating a sequence logo
Create a nicely looking logo of a motif sequence: size of letters indicated frequency.
Weblogo - a basic web application to create colorful logo's
IceLogo - a powerful web application to create customized logo's
![Page 33: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/33.jpg)
A sequence logo provides a visual summary of a motif
iceLogo
123456ATPKAEKKPKAAAKPKAKTKPKPAAKPKT-AKPAAKKLPKADAKPKAA
Consensus: AKPKA-
http://www.bits.vib.be/wiki/index.php/Exercises_on_multiple_sequence_alignment#Sequence_logo
![Page 34: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/34.jpg)
True negatives
True positives
ScoreThreshold
Number of
matches
Ideal situation
True negatives
True positives
False positivesFalse negatives
ScoreThreshold
Number of
matches
Reality of the databases
There is always a chance that a prediction of a feature by a tool is false
![Page 35: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/35.jpg)
Assessing the performance of categorizing tools with sensitivity and specificity
Sequence contains feature
Sequence does NOT contain feature
Feature ispredicted
Feature is NOT predicted
False positive“Type II error”
FalseNegatives
“Type I error”
Truepositive
Truenegative
“Confusion matrix”
TRUTH
PREDICTION
![Page 36: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/36.jpg)
Assessing the performance of categorizing tools with sensitivity and specificity
Sequence contains feature
Sequence does NOT contain feature
Feature ispredicted
Feature is NOT predicted
False positive
Truenegative
SensitivityTrue positives/(TP + FN)
“Confusion matrix”
TRUTH
PREDICTION
![Page 37: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/37.jpg)
Assessing the performance of categorizing tools with sensitivity and specificity
Sequence contains feature
Sequence does NOT contain feature
Feature ispredicted
Feature is NOT predicted
Selectivity or SpecificityTN/(FP + TN)
“Confusion matrix”
PREDICTION
TRUTH
![Page 38: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/38.jpg)
Assessing the performance of categorizing tools with sensitivity and specificity
Sequence contains feature
Sequence does NOT contain feature
Feature ispredicted
Feature is NOT predicted
error rate*FP+FN/total
“Confusion matrix”
* misclassification rate
TRUTH
PREDICTION
![Page 39: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/39.jpg)
Assessing the performance of categorizing tools with sensitivity and specificity
Sequence contains feature
Sequence does NOT contain feature
Feature ispredicted
Feature is NOT predicted
AccuracyTP+TN/total
“Confusion matrix”
PREDICTION
TRUTH
![Page 40: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/40.jpg)
Protein sequences can be searched for potential modifications
http://www.expasy.org/tools/ e.g. modification (phosphorylation, acetylation,...)
To deal with the confidence in the results, try different tools, and make a graph (venn diagram) to compare the results
E.g. predict secreted proteins by signalP and RPSP, combine results in Venn
− http://bioinformatics.psb.ugent.be/webtools/Venn/
− http://www.cmbi.ru.nl/cdd/biovenn/
Overview SignalPeptide prediction tools: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2788353/
![Page 41: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/41.jpg)
Protein sequences can be searched for secondary structural elements
Based on know structures, machine learning models of secondary structure elements are made and can be searched for.
See http://bioinf.cs.ucl.ac.uk/psipred/
![Page 42: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/42.jpg)
Better
In case of multiple analyses on multiple sequences, mark instead of filter
WorseStarting set of sequences
Analysis filter 1
Analysis filter 2
Analysis filter 3
Analysis filter 1
Analysis filter 2
Analysis filter 3
After performing all analyses on all sequences, different filters on the results can be applied (e.g. secreted sequence, phosphorylated and containing a motif)!
![Page 43: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/43.jpg)
NA sequences
![Page 44: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/44.jpg)
NA sequence analyses
GC% http://mobyle.pasteur.fr/cgi-bin/portal.py?#forms::geecee
Melting temperature
For primer development, such as with Primer3
Structure
Codon usage
Codon usage table with cusp
Codon adaptation index calculation with cai
...
A lot of tools can be found at the Mobyle Portal:
![Page 45: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/45.jpg)
Profiles and models are being used to model biological function in NA seqs
To detect Transcription factor binding sitesTRANSFAC : commercial (BIOBASE, Wolfenbüttel, Germany), started as
work of Edgard Wingender, contains eukaryotic binding sites as consensus sequences and as PSSMs. Also TRANSCompel with modules of binding sites.
ooTFD : commercial (IFTI, Pittsburgh PA, USA), started as work of David Gosh, contains prokaryotic and eukaroytic binding sites as consensus sequences and as PSSMs.
JASPAR : open access, only representative sets of higher eukaryote binding sites as PSSMs. Can be searched against sequence or sequence pair at Consite.
OregAnno : open access, collection of individual eukaryotic binding sites with their localization in the genome
PAZAR : collection of open access TF databanks
![Page 46: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/46.jpg)
Sequence logos can give an insight in the important residues of binding sites DNA: an entry from JASPAR: tata box
A [ 61 16 352 3 354 268 360 222 155 56 83 82 82 68 77 ]C [145 46 0 10 0 0 3 2 44 135 147 127 118 107 101 ]G [152 18 2 2 5 0 10 44 157 150 128 128 128 139 140 ]T [ 31 309 35 374 30 121 6 121 33 48 31 52 61 75 71 ]
![Page 47: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/47.jpg)
The RNA world has the Vienna servers
http://rna.tbi.univie.ac.at/
− secondary structure prediction of ribosomal sequences
− siRNA design
![Page 48: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/48.jpg)
RNA families can be modeled by conserved bases and structure
RNA motifs (http://rfam.sanger.ac.uk/search)
Rfam is a databank of RNA motifs and families. It is made at the Sanger Centre (Hinxton, UK), from a subset of EMBL (well-annotated standard sequences excluding synthetic sequences + the WGS) using the INFERNAL suite of Soan Eddy. It contains local alignments with gaps with included secondary structure annotation + CMs.
![Page 49: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/49.jpg)
Some interesting links
Nucleic acid structure
Unafold - Program accessible through webinterface
After designing primers, you might want to check whether the primer product does (not) adapt a stable secondary structure.
Some collections of links− Good overview at http://www.imb-jena.de/RNA.html
− European Ribosomale RNA database (VIB PSB)
![Page 50: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/50.jpg)
Prediction of genes in genomes rely on the integration of multiple signals
Signals surrounding the gene (transcription factor binding sites, promoters, transcription terminators, splice sites, polyA sites, ribosome binding sites,...)
→ profile matching
Differences in composition between coding and noncoding DNA (codon preference), the presence of an Open Reading Frame (ORF)
→ compositional analyses
Similarity with known genes, aligning ESTs and (in translation) similarity with known proteins and the presence of protein motifs
→ similarity searches
![Page 51: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/51.jpg)
Signals
Composition
Similarity
e.g. potential methylation sites (profiles)
GC
Alignment of ESTs
Prediction of genes in genomes rely on the integration of multiple signals
![Page 52: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/52.jpg)
Software for prediction genes
EMBOSS
− simple software under EMBOSS : syco (codon frequency), wobble (%GC 3rd base), tcode (Ficketstatistic : correlation between bases at distance 3)
Examples of software using HMM model of gene :
Wise2 : using also similarity with known proteins http://www.ebi.ac.uk/Tools/Wise2
GENSCAN : commercial (Chris Burge, Stanford U.) but free for academics, has models for human/A. thaliana/maize, used at EBI and NCBI for genome annotation http://mobyle.pasteur.fr/cgi-bin/portal.py?#forms::genscan
GeneMark : commercial (GeneProbe, Atlanta GA, USA) but free for academic users, developed by Mark Borodovsky, has models for many prokaryotic and eukaryotic organisms http://exon.gatech.edu
Tutorial on gene prediction http://www.embl.de/~seqanal/courses/spring00/GenePred.00.html
![Page 53: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/53.jpg)
Short addendum about downloading files
FTP, e.g. ftp://ftp.ebi.ac.uk/pub/databases/interpro/
– 'file transfer protocol'
– Most browsers have integrated ftp 'client'
– Free, easy to download files, possibility to resume after fails
HTTP, e.g. http://www.ncbi.nlm.nih.gov/entrez
Standard protocol for internet traffic,
Slowest method
Aspera – for large datasets (>10GB) downloads
In use in the short read archive (SRA)
Fastest method available currently
![Page 54: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/54.jpg)
Conclusion
Prediction vs. experimental verified Different algorithms need to be compared Predictions need to be validated by independent
method
Software <-> Databases Questions? Get social!
→ www.seqanswers.com
→ http://biostar.stackexchange.com Always only basis for further wet-lab research
![Page 55: BITS: Basics of sequence analysis](https://reader034.fdocuments.us/reader034/viewer/2022042613/548532565806b5d6588b474c/html5/thumbnails/55.jpg)
Summary In this third module, we will discuss the possible analyses of sequences
Sequence analysis tries to read sequences to infer biological properties
Tools that can predict a biological feature are trained with examples
The assumption to being able to read biological function is the central paradigm
Analysis can be as simple as measuring properties or predicting features
Protein sequence analysis tools are gathered on Expasy
Never trust a tool's output blindly
The basis for the prediction of features is nearly always a sequence alignment
Alignment reveals similar residues which can indicate identical structure
The structure of a protein sequence determines his biological function
Degree of similarity with other sequences varies over the length
Protein sequences can consist of structurally different parts
Based on motifs and domains, proteins are assigned to families
Domains and motifs are represented by simple and complex methods
Regular expressions / patterns are the simplest way to represent motifs
Frequency matrices or profiles include the chance of observing the residues
How good a sequence matches a profile is reported with a score
A hidden Markov Model takes also into account the gaps in an alignment
Use HMMER to very sensitively search protein database with a HMM
Some profile adjustments to the BLAST protocol exist for particular purposes
Many databases exist that keep patterns, profiles or models related to function
Prosite is a database gathering patterns from sequence alignments
Interpro classifies the protein data into families based on the domain and motifs
Interpro summarizes domains and motifs from a dozen of domain databases
You can search your sequence for known domains on InterProScan
A sequence logo can provide a visual summary of a motif
Protein sequences can be searched for potential modifications
Protein sequences can be searched for secondary structural elements
In case of multiple analyses on multiple sequences, mark instead of filter
Profiles and models are being used to model biological function in NA seqs
Sequence logos can give an insight in the important residues of binding sites
The RNA world has the Vienna servers
RNA families can be modeled by conserved bases and structure
Prediction of genes in genomes rely on the integration of multiple signals