CDD – a conserved domain database Aron Marchler-Bauer NCBI, National Library of Medicine, NIH

CDD – a conserved domain database

Aron Marchler-Bauer

NCBI, National Library of Medicine, NIH

DIMACS Workshop on Protein Domains: Identification, Classification and Evolution

February 27-28, 2003

CDD: a collection of domain multiple alignments linkedto protein 3D structure

• imported alignment models mirrored ‘as-is’, sources are Pfam, Smart, COGs (close to 10,000)

• curated alignment models (about 300)

• part of NCBI’s Entrez query/retrieval system

• RPS-Blast to search PSSMs derived from alignment models

Entrez with CDD …

Protein Sequences

MEDLINEAbstracts BLAST

SequenceSimilarity

3D Structures

Conserved Domains

ProteinSequences

Term FrequencyStatistics

RelatedConservedDomains

VASTStructureSimilarity

DomainArchitecture

Similarity

CD-proteinLinks

Conserved Domains as part of Entrez to …

.. annotate three-dimensional structures


.. annotate protein sequences


.. neighbor proteins by domain architecture

Currently (CDD v1.60): ~5 Mio protein-CD links

RPS-Blast (Reverse Position-Specific Blast)

(Psi)-Blast RPS-Blast

Search query (and its PSSM) against Search query protein sequence sequence database against a database of PSSMs

Lookup table holds possible word Lookup table holds possible wordmatches to query, database sequences matches to database PSSMs, queryare scanned for single or multiple sequences are scanned for single orword matches, which are then multiple word matches, which areextended to identify statistically then extended to identify statisticallysignificant alignments. significant alignments.

How does it compare?

Test set: Smart v3.3, 569 Domain Families / Alignments / PSSMs

23736 protein sequences used in alignments

14100 protein sequences from the initial Drosophila genome set.

The effect of the search heuristics can be measured directly against IMPALA, a similar program using the rigorous Smith-Waterman algorithm.

The effect of the search heuristics and the differences in alignment model encoding can be measured against HMMer

Test set: Smart v3.3, 569 Domain Families / Alignments / PSSMs

23736 protein sequences used in alignments

14100 protein sequences from the initial Drosophila genome set.

Word Score Threshold

9.0 9.2 9.4 9.6 9.8 10.0 10.2 10.4 10.6 10.8 11.0

95

96

97

98

99

100

6

12

18

24

30

36

Sensitivity at 1e-4Speed ratio

Sensitivity unscaledSpeed ratio unscaledSensitivity one-hitSpeed ratio one-hit

RPS-Blast vs. IMPALA, Speed vs. Sensitivity

Self-recognition: Fraction of sequence fragments used to build up the alignment model, which yield significant scores when compared with the search model.

Information content: sum Sp•log(p/q) across aligned columns

•The average alignment information content for 568 models used in the test is 240 bits.

• for 26 families – about 5% - self-recognition works better with IMPALA (detectable heuristics effects).

• the average alignment information content for these 26 models is 100 bits.

• for 542 models – about 95% - we did not detect heuristics effects in a self-recognition test.

02

06

01

00

25 50 75 100 125 150 175 200 225 250

Information

Se

lfRe

cog

niti

on

RPS-Blast

02

06

01

00

25 50 75 100 125 150 175 200 225 250

Information

Se

lfRe

cog

niti

on

IMPALA

02

06

01

00

25 50 75 100 125 150 175 200 225 250

Information

Se

lfRe

cog

niti

on

HMMer

• In 65 families (~11%) more than 5% difference in self recognition between HMMer and RPS-Blast

• Their mean information content is 65 bits

• In 503 families (89%) less than 5% difference in self recognition.

Conclusions:

• differences, maybe not too surprising

• affecting a fairly small subset of the models at the lower end of the ‘informativeness’ spectrum

• can optimize PSSM calculation, but might see diminishing returns

• it may be more effective to deal with scope of models

Need to do something about the model collection

… curation of alignment models

Recording conserved features in CDD …

C:\users\bauer\dde\cn3d.exe C:\Users\bauer\documents\talks\dimacs_cd00066.val

Conserved Features in CDs:

• catalytic, binding, interaction- and regulatory sites

• explain observed patterns of sequence conservation

• annotate if applicable to all aligned members

• annotate if evidence is available (3D structure, citation)

Collection has become redundant: Search results for 2SRC (Tyrosine Kinase)

Right now: about 9400 CD-CD links in Entrez

Collection has become redundant: Search results for 1G291 (Malk)

Many ATP-ase domains are sequence-similar to each other, and possibly related by descent from a common ancestor

How to explain thisredundancy?

Curation:

• literature check

• examination of the conserved domain extent

• examination of the multiple alignment, identification of a core substructure, establishment of a block-based alignment in agreement with 3D-structure data

• Feature annotation and recording of evidence

• Investigation of ‘related’ domains and their apparent relationships, resolving and recording the family hierarchy

• Update of CD alignment models with new sequences and 3D-structure data

Curation needs to deal with:

• noise from sequence data (gene models, annotation)

• noise from alignments / alignment methods

Block alignment model and family hierarchies:

Parent alignment

Children:

• Membership consistency

• Alignment consistency

Rizzi and Schindelin,

Curr. Opin. Struct. Biol. 2002,

12:709-720

.. sequences used in the alignment hit a variety of models in CD-Search:

… examine domain architectures as recorded in CDART:

… validate sequences, validate alignment block structure, and examine sequence tree:

Pfam

COGs

CDD

PF0994 MoCF_biosynth

MoeA_N

MoeA_C

cd00758

MoaB

CinA

MoeA

cd00758_b (MoeA)

cd00758_c (CinA)cd00758_a (MoaB)

C:\users\bauer\dde\cn3d.exe C:\Users\bauer\DDE\cd00758_b.acd

C:\users\bauer\dde\cn3d.exe C:\Users\bauer\DDE\cd00758_a.acd

Concept borrowed from COGs – pattern of phylogenetic distribution as evidence for functional divergence after gene duplication events

Principles for establishing CD-Hierarchies:

• Economy – too many families slow down search system

• Search performance – flat alignment models must be split

• Domain age – we’re primarily interested in sets of ancient conserved domains

• Domain architectures

• Subgroup-specific features

Plants Animals Archaea Alpha-proteobact. Gram+

3.5 bio

1.7 bio

2.6 bio

Future directions: ability to describe complex hierarchies, which will allow modeling of fusion events

ABC DEFG

ABC_2ABC_1 DEFG_2DEFG_1

ABC_2 DEFG_2

Credits:Steve Bryant

Lewis Geer

Siqian He

David Hurwitz

Christopher Lanczycki

Charlie Liu

Tom Madej

Anna Panchenko

Ben Shoemaker

Vahan Simonyan

Paul Thiessen

Yanli Wang

John Anderson

Natalie Fedorova

John Jackson

Aviva Jacobs

Cynthia Liebert

Gabriele Marchler

Raja Mazumder

B. Sridhar Rao

Carol DeWeese-Scott

James Song

Sona Vasudevan

Roxanne Yamashita

Jodie Yin

PFAM

SMART

COGs

BLAST team

Entrez team

Taxonomy team

NCBI Help-Desk

CDD – a conserved domain database Aron Marchler-Bauer NCBI, National Library of Medicine, NIH

Documents

Transcript of CDD – a conserved domain database Aron Marchler-Bauer NCBI, National Library of Medicine, NIH