Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

20
prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins

Transcript of Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

Page 1: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

prorepeat.bioinformatics.nl

ProRepeat a comprehensive directory of exact tandem repeats in proteins

Page 2: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

www.bioinformatics.nl

9 diseases causes by polyQ repeats- HD- DRPLA- SCA 1,2,3,6,7,17- Kennedy’s disease (SBMA)

PolyQ and neurodegenerative diseases

Page 3: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

www.bioinformatics.nl

Transcription Factor

-COOHNH3-

TRANSCRIPTIONAL REGULATIONDNA BINDING

HORMONE BINDING

T1 T2 T3

Region 1 Region 2 Region 3

Androgen receptor (AR)

polyQ tract length has important consequences■ shorter tracts : prostate cancer susceptibility■ longer tracts : feminization syndromes■ over 40 residues : SBMA (spinal and bulbar muscular atrophy) or Kennedy’s disease

polyQ tract length has important consequences■ shorter tracts : prostate cancer susceptibility■ longer tracts : feminization syndromes■ over 40 residues : SBMA (spinal and bulbar muscular atrophy) or Kennedy’s disease

9-35 residues, average of 20-25 depending on ethnic origin

9-35 residues, average of 20-25 depending on ethnic origin

Page 4: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

www.bioinformatics.nl

PolyQ in AR

Collection of polyQ repeats 792 human individuals

available from earlier study (Edwards, 1992)

26 armadillo individuals sequenced by CP

77 mammals and marsupials from protein database

Céline Poux, RUCéline Poux, RU

Page 5: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

www.bioinformatics.nl

What about repeats in other proteins?

ProRepeat database Data sources: UniProt and RefSeq Limited to exact tandem repeats

Standard, linear-time suffix tree algorithm Stored in Oracle 10g Interface in PHP5

unit length repetitions

1 ≥ 5

2 ≥ 4

3 ≥ 3

4 .. N ≥ 2Maarten van den Bosch, WURMaarten van den Bosch, WUR

Page 6: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

www.bioinformatics.nl

Simple query syntax:

e.g. “Q” or “DE”

Simple query syntax:

e.g. “Q” or “DE”

DE is equivalent to ED; DEF is equivalent to EFD and FDE

DE is equivalent to ED; DEF is equivalent to EFD and FDE

Page 7: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

www.bioinformatics.nl

Or use ProSite syntax:

e.g. “[DE]-{P}-X(0,1).”

Or use ProSite syntax:

e.g. “[DE]-{P}-X(0,1).”

Page 8: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

www.bioinformatics.nl

Taxonomic distributions of hits

Page 9: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

www.bioinformatics.nl

Page 10: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

www.bioinformatics.nl

Sorting/grouping options

Identifier Repeat unit Repetitions Unit length Length Start location End location Protein Taxonomy Ontology

Page 11: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

www.bioinformatics.nl

Link to DNA data

DNA coding sequences of available repeats also stored in the database Extracted from EMBL

and/or RefSeq

Hong Luo, WURHong Luo, WUR

Page 12: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

www.bioinformatics.nl

Link to DNA data / errors

Approximately 3% of corresponding nucleotide sequences cannot be retrieved

Errors caused by No links to nucleotide database (35%)

• NO_ANNOTATED_CDS• No EMBL links

Annotation errors in the nucleotide database (65%)

Page 13: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

www.bioinformatics.nl

Number of different units per unit size per proteome

0

100

200

300

400

500

600

700

800

900

Unit length

Nu

mb

er o

f d

iffe

ren

t u

nit

s

Hsapiens

Athaliana

Celegans

Cserevesiae

Ptroglodytes

Ggallus

Rnorvegicus

Mmusculus

Ecoli

Guido Kappé, RUGuido Kappé, RU

Page 14: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

www.bioinformatics.nl

Single amino acid (SAA) repeat length distribution in Homo sapiens

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 >20

Total SAA repeat length (aa)

Per

cen

tag

e (%

)

A B C D E F G H I K L M N P Q R S T U V W X Y Z

SS

QQ

PPGG

EEAA

TT

Page 15: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

www.bioinformatics.nl

Amino acid distribution Homo sapiens

0

5

10

15

20

25

30

A B C D E F G H I K L M N P Q R S T U V W X Y Z

Amino acid

Per

cen

tag

e (%

)

All prot. - Rep. Rep. - SAA SAA

Page 16: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

www.bioinformatics.nl

Amino acid distribution Arabidopsis thaliana

0

5

10

15

20

25

30

A B C D E F G H I K L M N P Q R S T U V W X Y Z

Amino acid

Per

cen

tag

e (%

)

All prot. - Rep. Rep. - SAA SAA

Page 17: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

www.bioinformatics.nl

Current work

Annotation of repeats versus function Adding imperfect tandem repeats - a.k.a.

approximate tandem repeats (ATR) – to the database

Offering remote access via web services (WSDL and BioMoby)

Expansion of the analysis capabilities of the interface

Page 18: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

www.bioinformatics.nl

PolyQ in AR (reprise)

Impure tracts longer and more variable than pure CAG tracts (mainly CAA, CCG, and CGG)

Presence of other codons better explained by codon duplication than multiple point mutations interrupting codons are part of elongation process,

rather than hampering their dynamics as proposed previously

Negative correlation between lengths of the different CAG tracts maximal expansion length that protein can handle

without being deleteriousCéline Poux, RUCéline Poux, RU

Page 19: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

www.bioinformatics.nl

Acknowledgements

Wageningen University and Research Centre Maarten van den Bosch Hong Luo Mark Kramer Harm Nijveen

Radboud University, Nijmegen Guido Kappé Céline Poux Wilfried W. de Jong

This work was supported in part by project grants from NWO/BMI (GK, CP) and the NBIC/BioAssist program (HN)

Page 20: Prorepeat.bioinformatics.nl ProRepeat a comprehensive directory of exact tandem repeats in proteins.

prorepeat.bioinformatics.nl

Thank you for your attention!See also our posters on phylogenetic domain visualisation (TreeDomViewer) and microarray (re)annotation at the ISMB

Post-doc positions available: contact [email protected] or [email protected]