Part I: Identifying sequences with … Speaker : S. Gaj Date 11-01-2005.
-
Upload
christian-newman -
Category
Documents
-
view
216 -
download
1
Transcript of Part I: Identifying sequences with … Speaker : S. Gaj Date 11-01-2005.
Annotation
Annotation• Best possible description available for a given
sequence at the current time.
How to annotate?• Combining
• Alignment Tools • Databases• Datamining (scripts)B
ackg
roun
d
Introduction
Global alignment• Optimal alignment between two sequences
containing as much characters of the query as possible.
Ex: predicting evolutionary relationship between genes, …
Local alignment• Optimal alignment between two sequences
identifying identical area(s)Ex: Identifying key molecular structures (S-bonds, - helices, …)
Bac
kgro
und
Introduction
Basic Local Alignment Search Tool• Aligning an unknown sequence (query) against all
sequences present in a chosen database based on a score-value.
• Aim : Obtaining structural or functional information on the unknown sequence.
BLA
ST
Programs
• Different BLAST programs available
• Usable criteria:• E-Value, Gap Opening Penalty (GOP), Gap Extension Penalty
(GEP), …
• Terms• Query Sequence which will be aligned• Subject Sequence present in database• Hit Alignment result.
BLA
ST
Nucleic Protein
Nucleic BlastN BlastX
Protein - BlastP
Common BLAST problems
• BlastN
BLA
ST
C G A T A GC C CG CC A G G A T A T A
C G A T A GC C C - CC A G G A T A T A
Sequencing Error
Clone seq
mRNA
• Solution:
Low penalty for GOP and GEP = 1
| | | | | | | | | | | | | | | | || |
Translation Problems
• 6-Frame translation
BLA
ST
>embl|J03801|HSLSZ Human lysozyme mRNA, complete cds with an Alu repeat in the 3' flank.
ctagcactctgacctagcagtcaacatgaaggctctcattgttctggggct...
+1 L A L * P S S Q H E G S H C S G A
Translation Problems
• 6-Frame translation
BLA
ST
>embl|J03801|HSLSZ Human lysozyme mRNA, complete cds with an Alu repeat in the 3' flank.
ctagcactctgacctagcagtcaacatgaaggctctcattgttctggggct...
+1
+2
+3
-3
-2
-1
L A L * P S S Q H E G S H C S G A
* H S D L A V N M K A L I V L G
Common BLAST problems
BLA
ST
mRNA
Clones derived from mRNA
Coding region
Non-coding region
BlastX against protein sequence
3 possible hit-situations
Common BLAST problems
BLA
ST
Yields no protein hit
Aligns with protein in 1 of the 6 frames.
Part perfect alignment
Coding region
Non-coding region
or
Introduction
Primary database:– DNA Sequence (EMBL, GenBank, … )– AminoAcid Sequence (SwissProt, PIR, …)– Protein Structure (PDB, …)
Secondary database:– Derived from primary DB– DNA Sequence (UniGene, RefSeq, …)– Combination of all (LocusLink, ENSEMBL, …)
Structure:– Flat file databases
Dat
abas
es
Primary Databases
EMBL:– DNA Sequence– Human: 4.126.190.851 nucleotides in 292.205 entries– Clones, mRNA, (Riken) cDNA, …
– New sequences can be admitted by everyone.– No curative check before admittance.
Dat
abas
es
Primary Databases
SwissProt:– Amino Acid sequence– Human: – Contains protein information– SwissProt (EU) PIR (USA)
– Crosslinks to most informative DB (PDB, OMIM)– Part of UniProt consortium.
– Each addition needs validation by appointed curators.– Highly curated
Dat
abas
es
Secondary Databases
TrEMBL:– Translated EMBL– Hypothetical proteins
– After careful assessment SpTrEMBL SwissProt
Dat
abas
es
Secondary Databases
UniGene:– Automated clustering of sequences with high similarity– Derived from GenBank / EMBL– 1 consensus-sequence– Species-specific
Dat
abas
es
Secondary Databases
LocusLink:– Curated sequences– Descriptive information about genetic loci
RefSeq:– Non-redundant set of sequences.– Genomic DNA, mRNA, Protein– Stable reference for gene identification and
characterization.– High curation
Dat
abas
es
Database Quality?
Dat
abas
es
mRNA Protein
EMBL SwissProt
Submitter
Database Manager
Submitter
Database Manager
Curators
DNA
How to Annotate?
BlastN against random nucleotide DB– EST’s
BlastN against structured nucleotide DB (UniGene, RefSeq)– mRNA hits– Sometimes not annotated at all– Best information
Dat
abas
es
What do we have?
Probe sequence
Alignment Tools (e.g. BLAST)
Databases
!?! What to choose ?!?
Ann
otat
ion
Possibilities?
1. Do it like everyone else does.
2. Make use of curative properties of certain databases
Goal: Annotate as many genes with as much information as possible (e.g. SwissProt ID)
Ann
otat
ion
1st Approach - General
“Done by most array manufacturers”
Step-by-step approach:– BLAST sequences against nucleic database
(preferably UniGene)
– Extract high quality (HQ) hits (>95%)
– For each HQ hit search crosslinks.
– Find a well-described (SwissProt) ID for each sequence.
Ann
otat
ion
Tec
hniq
ues
2nd Approach - General
“Make use of present database curation”
Other way around:– Use SwissProt to clean out EMBL
– Result:“Cleaned” EMBL database with direct SP crosslinks
– BLAST against cEMBL
– Extract high quality alignment hits (>95%)
– Convert EMBL ID to SP ID.
Ann
otat
ion
Tec
hniq
ues
Annotating Incyte Reporters
Total: 13.497
cEMBL-approach: 2.898 (21,47%) SP-IDs
DM approach: 10.013 (74,18%) UG-IDs in whichM = 4.723 (34,9%) SP-IDs ; MR = 5.147 (38,1%) SP-IDs; MRH = 6.641 (49,2%) SP-IDs
Res
ults
Annotating Incyte Reporters
All reporters present on “Incyte Mouse UniGene 1” convertedTotal: 9.596 reporters
Old annotation : 9.370 (97,6%) UG-IDs in whichNon-existing UG-IDs = 5.713 (59,5%); M = 1.939 (20,2%) SP-IDs;
MR = 2.096 (21,8%) SP-IDs; MRH = 2.582 (26,9%) SP-IDsDatamining approach : 8.532 (88,9%) UG-IDs in which
M = 4.145 (43,2%) SP-IDs ; MR = 4.499 (38,1%) SP-IDs; MRH = 5.576 (60,1%) SP-IDsCustom EMBL-approach : 2.898 (30,2%) SP-IDs
Res
ults
Annotating Incyte Reporters
Combined methods “Incyte Mouse UniGene 1” reportersTotal: 9.596 reporters
No annotation : 1.062 (11%) reportersAnnotated with SP-ID : 5.895 (61,3%) reporters of which
2.184 (22,7%) identical SP-IDs; 532 (5%) reporters with improved SP-IDs by EMBL-method;174 (1,8%) reporters with different mouse SP-IDs; 5 reporters found only by EMBL-method
Res
ults
Conclusions
• Annotation is much needed Array sequences can point to different genes
• Direct translation into protein not best option: Sequencing errors Addition or deletion of nucleotides 6-Frame window
• Public nucleotide databases are redundant. Sequencing errors Differences in sequence-length Attachment of vector-sequenceC
oncl
usio
ns