P a t t e r n d a t a b a s e s

57
P P a a t t t t e e r r n n d d a a t t a a b b a a s s e e s s Gopalan Vivek

description

P a t t e r n d a t a b a s e s. Gopalan Vivek. Pattern databases - topics. Definition Applications Classifications Common Databases Conclusions. Pattern databases. Definition Applications Classifications Common Databases Conclusions. Pattern databases – definition. - PowerPoint PPT Presentation

Transcript of P a t t e r n d a t a b a s e s

Page 1: P a t t e r n d a t a b a s e s

PPaatttteerrnn ddaattaabbaasseess

Gopalan Vivek

Page 2: P a t t e r n d a t a b a s e s

Pattern databases - topics

Definition Applications Classifications Common Databases Conclusions

Page 3: P a t t e r n d a t a b a s e s

Pattern databases

Definition Applications Classifications Common Databases Conclusions

Page 4: P a t t e r n d a t a b a s e s

Secondary databases derived from conserved obtained from multiple sequence alignment of primary databases such as GenBank, EMBL,DDBJ, SP/TrEMBL,PIR,etc

Pattern databases – definition

Page 5: P a t t e r n d a t a b a s e s

Primary databases(SWISS-PROT - Protein

GenBank - DNA)

Millions of sequences

Pattern databases

Pattern Extraction - Multiple sequence alignment

Thousands of patterns

Page 6: P a t t e r n d a t a b a s e s

Pattern databases

Definition Applications Classifications Common Databases Conclusions

Page 7: P a t t e r n d a t a b a s e s

Pattern Databases - Applications

Function prediction of protein/ nucleotide sequences even when sequence similarity is low (<25%).

Useful for classification of protein sequences into families.

It takes less time to search the pattern than the primary database.– Since “patterns” is the compact representation of

features of many sequences.

Page 8: P a t t e r n d a t a b a s e s

Pattern databases

Definition Applications Classifications Common Databases Conclusions

Page 9: P a t t e r n d a t a b a s e s

Multiple Sequence Alignment (MSA)

Family based databases – considers full MSA

Motif -3Motif -1

Motif based databases – considers local regions in MSA

Page 10: P a t t e r n d a t a b a s e s

Pattern Databases – Protein

Motif based PROSITE PRINTS BLOCKS

Family based ProDom PIR-ALN ProtoMap DOMO ProClass Pfam SMART TIGRFAMs SBASE SYSTERS

Page 11: P a t t e r n d a t a b a s e s

InterPro - Integrated resources of protein families and sites PROSITE PRINTS BLOCKS Pfam ProDom

InterPro

Page 12: P a t t e r n d a t a b a s e s

Pattern databases

Definition Applications Classifications Common Databases

– PROSITE, PRINTS, BLOCKS & SMART (motif based)

– MetaFam, InterPro (Integrated databases)

Conclusions

Page 13: P a t t e r n d a t a b a s e s

Databases – General Tips

1. Source

2. Input formats & parameters

3. Output formats

4. Quality of the data

5. Other details – updates, coverage, speed, download, reference, methods etc.

Page 14: P a t t e r n d a t a b a s e s

Focus To search pattern databases using the text

or keyword search options in them for “Alkaline phosphatase” enzyme.

To analyze the quality of results from each of these database– Sensitivity, specificity.

Sequence & Pattern searches- In the afternoon’s practical.

Page 15: P a t t e r n d a t a b a s e s

PROSITE http://www.expasy.org/prosite/

consists of biologically significant protein sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs.

Based on SWISSPROT/TrEMBL

Page 16: P a t t e r n d a t a b a s e s

Text Search

Sequence Scanner

ID and text Search

http://www.expasy.org/prosite/

Page 17: P a t t e r n d a t a b a s e s
Page 18: P a t t e r n d a t a b a s e s

Details about the pattern/profileDetails about the pattern/profile

PROSITE IDPROSITE ID

PROSITE PatternPROSITE Pattern

Result: PROSITE Documentaion pageResult: PROSITE Documentaion page

[IV]-x-D-S-[GAS]-[GASC]-[GAST]-[GA]-T [S is the active site residue]

Page 19: P a t t e r n d a t a b a s e s

Numerical ResultsNumerical Results

PROSITE PatternPROSITE Pattern

Detailed View - page 1Detailed View - page 1

Page 20: P a t t e r n d a t a b a s e s

Detailed View - page 2Detailed View - page 2

True PositivesTrue Positives

False PositivesFalse Positives

View entry in raw text format (no links)

Page 21: P a t t e r n d a t a b a s e s

Raw Text Format – PROSITE FormatRaw Text Format – PROSITE Format

Page 22: P a t t e r n d a t a b a s e s

ID Identification AC Accession number DT Date DE Short descriptionPA Pattern MA Matrix/profileRU RuleNR Numerical resultsCC CommentsDR Cross-references to SWISS-PROT3D Cross-references to PDBDO Pointer to the documentation file

// Termination line

Page 23: P a t t e r n d a t a b a s e s

PROSITE Profiles

Page 24: P a t t e r n d a t a b a s e s

Highly degenerate protein structural and functional domains– immunoglobulin domains, SH2 and SH3 domains.

Consensus sequences of repetitive DNA elements– SINEs, LINEs

Basic gene expression signals– promoter elements, RNA processing signals,

translational initiation sites.

DNA-binding protein motifs. Protein and nucleic acid compositional

domains– glutamine-rich activation domains, CpG islands.

Page 25: P a t t e r n d a t a b a s e s

PROSITE - features

Completeness High specificity Documentation Periodic reviewing Parallel update with SWISS-

PROT(primary database)

Page 26: P a t t e r n d a t a b a s e s

Multiple Sequence Alignment

Find 4-5 functionally conserved residues

cydeggiscyedggiscyeeggitcyhgdggscyrgdgnt

C-Y-x2-[DG]-G-x-[ST] CORE PATTERN

SWISS-PROT

MoreFALSE POSITIVES ?

Increase the sequence length of the pattern

PROSITE DBYES NO

motif

Page 27: P a t t e r n d a t a b a s e s

http://bioinf.man.ac.uk/dbbrowser/PRINTS/

Protein fingerprint database Fingerprint - set of motifs used that

represent the most conserved regions of multiple sequence alignment.

Improved diagnostic reliability than single motif methods

Source – SWISSPROT/TrEMBL

Page 28: P a t t e r n d a t a b a s e s

Multiple Sequence Alignment

Identification of ALL the conserved regions

cydeggiscyedggiscyeeggitcyhgdggs

Creation of frequency matrices

SWISS-PROT/ Tr-EMBL

PRINTS DB

xxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxx

Frequency matricesFrequency matrices

motif

fingerprint

Iterative database scanning of the frequency matrices with protein databases till convergence

Page 29: P a t t e r n d a t a b a s e s

http://bioinf.man.ac.uk/dbbrowser/PRINTS/

Database ID , no. of motifs and text Search

Motif scanner (for searching a sequence or pattern against PRINTS database)

Page 30: P a t t e r n d a t a b a s e s
Page 31: P a t t e r n d a t a b a s e s

Page 1 for ‘alkaline phosphatase’ entry in PRINTSPage 1 for ‘alkaline phosphatase’ entry in PRINTS

Documentation,Links & references

Documentation,Links & references

Page 32: P a t t e r n d a t a b a s e s

Page 2Page 2

Fingerprint detailsFingerprint details

Sequence SummarySequence Summary

Page 33: P a t t e r n d a t a b a s e s

Page 3Page 3

Motif no. 1Motif no. 1

Motif no. 2Motif no. 2

“Raw” motif“Raw” motif

SWISSPROT -IDsSWISSPROT -IDs

Start and Interval between motifs in the fingerprintStart and Interval between motifs in the fingerprint

Page 34: P a t t e r n d a t a b a s e s

BLOCKS http://blocks.fhcrc.org/blocks/

Blocks are multiple aligned ungapped segments corresponding to the most highly conserved regions of proteins

The BLOCKS database is a collection of blocks representing known protein families that can be used to compare a protein or DNA sequence with documented families of proteins.

Page 35: P a t t e r n d a t a b a s e s

Blocks Making

Blocks are produced by the automated PROTOMAT system (Henikoff and Henikoff, 1991), which applies a robust motif-finder to a set of related protein sequences.

Page 36: P a t t e r n d a t a b a s e s

http://blocks.fhcrc.org/blocks/blocksdiag.jpg

Page 37: P a t t e r n d a t a b a s e s

http://blocks.fhcrc.org/blocks/

Sequence, no. of blocksand text Searches

Blocks Maker

Page 38: P a t t e r n d a t a b a s e s

Page 1Page 1

SummarySummary

Search methods using blocksSearch methods using blocks

Page 39: P a t t e r n d a t a b a s e s

Page 2

BLOCK - 1BLOCK - 1

Represent start position of the blockRepresent start position of the block

SWISSPROT IDSWISSPROT ID

Weak Blocks - Strength < 1100 Strong Blocks - Strength >= 1100Weak Blocks - Strength < 1100 Strong Blocks - Strength >= 1100

Page 40: P a t t e r n d a t a b a s e s

Contains >500 domain families associated with signaling, extra-cellular and chromatin-associated proteins are found.

Each domain is extensively annotated with phyletic distributions, functional class, tertiary structures and functionally important residues.

http://smart.embl-heidelberg.de/

Page 41: P a t t e r n d a t a b a s e s

ID and text Search

ID & sequence Search

Domain & GO search

Alkaline Phosphatase

Page 42: P a t t e r n d a t a b a s e s
Page 43: P a t t e r n d a t a b a s e s
Page 44: P a t t e r n d a t a b a s e s

Results – Alkaline phosphatase “Signatures” PROSITE

– Represented as a single motif. PRINTS

– Represented as 5 motif regions. BLOCKS

– Represented as 6 block regions SMART

– Represented as a single profile

Page 45: P a t t e r n d a t a b a s e s

Composite Pattern Databases

MetaFam InterPro CDD (conserved Domain Database) IProClass

Page 46: P a t t e r n d a t a b a s e s

Metafam & PANAL

Metafam - http://metafam.ahc.umn.edu/

PANAL – Protein ANALysis tool page of Metafam http://mgd.ahc.umn.edu/panal/

Protein family classification built with Blocks+, DOMO, Pfam, PIR-ALN, PRINTS, Prosite, ProDom, SBASE, SYSTERS.

Page 47: P a t t e r n d a t a b a s e s

PANAL

Page 48: P a t t e r n d a t a b a s e s

Interpro http://www.ebi.ac.uk/interpro Built from PROSITE, PRINTS, Pfam,

ProDom, SMART, TIGRFAM, SWISS-PROT and TrEMBL

Text- and sequence-based searches.

Page 49: P a t t e r n d a t a b a s e s
Page 50: P a t t e r n d a t a b a s e s

http://www.ebi.ac.uk/interpro/

Page 51: P a t t e r n d a t a b a s e s

PRINTSPROSITEPfamPRODOMSMART

Detailed View - page 1Detailed View - page 1

Page 52: P a t t e r n d a t a b a s e s

Detailed View - page 2Detailed View - page 2

BLOCKS database link

Page 53: P a t t e r n d a t a b a s e s

PR – PRINTSPS – PROSITEPF – PfamPD – ProDomSM – SMART

Page 54: P a t t e r n d a t a b a s e s

Detailed View - page 2Detailed View - page 2

Page 55: P a t t e r n d a t a b a s e s

T – True PositiveF – False Positive

Range of the motif

Page 56: P a t t e r n d a t a b a s e s

Pattern databases

Definition Applications Classifications Common Databases

– PROSITE, PRINTS & BLOCKS (motif based)– MetaFam, InterPro (Integrated databases)

Conclusions

Page 57: P a t t e r n d a t a b a s e s

CONCLUSION

Diverse pattern databases from small patterns to profiles to complex HMM models

Different strength and weakness Different database formats

Best to combine and analyze results from different pattern databases.