CDD – a conserved domain database Aron Marchler-Bauer NCBI, National Library of Medicine, NIH
description
Transcript of CDD – a conserved domain database Aron Marchler-Bauer NCBI, National Library of Medicine, NIH
CDD – a conserved domain database
Aron Marchler-Bauer
NCBI, National Library of Medicine, NIH
DIMACS Workshop on Protein Domains: Identification, Classification and Evolution
February 27-28, 2003
CDD: a collection of domain multiple alignments linkedto protein 3D structure
• imported alignment models mirrored ‘as-is’, sources are Pfam, Smart, COGs (close to 10,000)
• curated alignment models (about 300)
• part of NCBI’s Entrez query/retrieval system
• RPS-Blast to search PSSMs derived from alignment models
Entrez with CDD …
Protein Sequences
MEDLINEAbstracts BLAST
SequenceSimilarity
3D Structures
Conserved Domains
ProteinSequences
Term FrequencyStatistics
RelatedConservedDomains
VASTStructureSimilarity
DomainArchitecture
Similarity
CD-proteinLinks
Conserved Domains as part of Entrez to …
.. annotate three-dimensional structures
Conserved Domains as part of Entrez to …
.. annotate protein sequences
Conserved Domains as part of Entrez to …
.. neighbor proteins by domain architecture
Currently (CDD v1.60): ~5 Mio protein-CD links
RPS-Blast (Reverse Position-Specific Blast)
(Psi)-Blast RPS-Blast
Search query (and its PSSM) against Search query protein sequence sequence database against a database of PSSMs
Lookup table holds possible word Lookup table holds possible wordmatches to query, database sequences matches to database PSSMs, queryare scanned for single or multiple sequences are scanned for single orword matches, which are then multiple word matches, which areextended to identify statistically then extended to identify statisticallysignificant alignments. significant alignments.
How does it compare?
Test set: Smart v3.3, 569 Domain Families / Alignments / PSSMs
23736 protein sequences used in alignments
14100 protein sequences from the initial Drosophila genome set.
The effect of the search heuristics can be measured directly against IMPALA, a similar program using the rigorous Smith-Waterman algorithm.
The effect of the search heuristics and the differences in alignment model encoding can be measured against HMMer
Test set: Smart v3.3, 569 Domain Families / Alignments / PSSMs
23736 protein sequences used in alignments
14100 protein sequences from the initial Drosophila genome set.
Word Score Threshold
9.0 9.2 9.4 9.6 9.8 10.0 10.2 10.4 10.6 10.8 11.0
95
96
97
98
99
100
6
12
18
24
30
36
Sensitivity at 1e-4Speed ratio
Sensitivity unscaledSpeed ratio unscaledSensitivity one-hitSpeed ratio one-hit
RPS-Blast vs. IMPALA, Speed vs. Sensitivity
Self-recognition: Fraction of sequence fragments used to build up the alignment model, which yield significant scores when compared with the search model.
Information content: sum Sp•log(p/q) across aligned columns
•The average alignment information content for 568 models used in the test is 240 bits.
• for 26 families – about 5% - self-recognition works better with IMPALA (detectable heuristics effects).
• the average alignment information content for these 26 models is 100 bits.
• for 542 models – about 95% - we did not detect heuristics effects in a self-recognition test.
02
06
01
00
25 50 75 100 125 150 175 200 225 250
Information
Se
lfRe
cog
niti
on
RPS-Blast
02
06
01
00
25 50 75 100 125 150 175 200 225 250
Information
Se
lfRe
cog
niti
on
IMPALA
02
06
01
00
25 50 75 100 125 150 175 200 225 250
Information
Se
lfRe
cog
niti
on
HMMer
• In 65 families (~11%) more than 5% difference in self recognition between HMMer and RPS-Blast
• Their mean information content is 65 bits
• In 503 families (89%) less than 5% difference in self recognition.
Conclusions:
• differences, maybe not too surprising
• affecting a fairly small subset of the models at the lower end of the ‘informativeness’ spectrum
• can optimize PSSM calculation, but might see diminishing returns
• it may be more effective to deal with scope of models
Need to do something about the model collection
… curation of alignment models
Recording conserved features in CDD …
Conserved Features in CDs:
• catalytic, binding, interaction- and regulatory sites
• explain observed patterns of sequence conservation
• annotate if applicable to all aligned members
• annotate if evidence is available (3D structure, citation)
Collection has become redundant: Search results for 2SRC (Tyrosine Kinase)
Right now: about 9400 CD-CD links in Entrez
Collection has become redundant: Search results for 1G291 (Malk)
Many ATP-ase domains are sequence-similar to each other, and possibly related by descent from a common ancestor
How to explain thisredundancy?
Curation:
• literature check
• examination of the conserved domain extent
• examination of the multiple alignment, identification of a core substructure, establishment of a block-based alignment in agreement with 3D-structure data
• Feature annotation and recording of evidence
• Investigation of ‘related’ domains and their apparent relationships, resolving and recording the family hierarchy
• Update of CD alignment models with new sequences and 3D-structure data
Curation needs to deal with:
• noise from sequence data (gene models, annotation)
• noise from alignments / alignment methods
Block alignment model and family hierarchies:
Parent alignment
Children:
• Membership consistency
• Alignment consistency
Rizzi and Schindelin,
Curr. Opin. Struct. Biol. 2002,
12:709-720
.. sequences used in the alignment hit a variety of models in CD-Search:
… examine domain architectures as recorded in CDART:
… validate sequences, validate alignment block structure, and examine sequence tree:
Pfam
COGs
CDD
PF0994 MoCF_biosynth
MoeA_N
MoeA_C
cd00758
MoaB
CinA
MoeA
cd00758_b (MoeA)
cd00758_c (CinA)cd00758_a (MoaB)
Concept borrowed from COGs – pattern of phylogenetic distribution as evidence for functional divergence after gene duplication events
Principles for establishing CD-Hierarchies:
• Economy – too many families slow down search system
• Search performance – flat alignment models must be split
• Domain age – we’re primarily interested in sets of ancient conserved domains
• Domain architectures
• Subgroup-specific features
Plants Animals Archaea Alpha-proteobact. Gram+
3.5 bio
1.7 bio
2.6 bio
Future directions: ability to describe complex hierarchies, which will allow modeling of fusion events
ABC DEFG
ABC_2ABC_1 DEFG_2DEFG_1
ABC_2 DEFG_2
Credits:Steve Bryant
Lewis Geer
Siqian He
David Hurwitz
Christopher Lanczycki
Charlie Liu
Tom Madej
Anna Panchenko
Ben Shoemaker
Vahan Simonyan
Paul Thiessen
Yanli Wang
John Anderson
Natalie Fedorova
John Jackson
Aviva Jacobs
Cynthia Liebert
Gabriele Marchler
Raja Mazumder
B. Sridhar Rao
Carol DeWeese-Scott
James Song
Sona Vasudevan
Roxanne Yamashita
Jodie Yin
PFAM
SMART
COGs
BLAST team
Entrez team
Taxonomy team
NCBI Help-Desk