Thesis def

44
Data Marts Integrate the Proteome Jay Vyas

description

Jays thesis defense on database federation, data marts and the power of integrated protein bioinformatics.

Transcript of Thesis def

Page 1: Thesis def

Data Marts Integrate the Proteome

Jay Vyas

Page 2: Thesis def

The Information Content of the Proteome

1) cdc2+, cyclinB+, Mitosis, 2) cdc2-, Arrest3) cdc2 Binds Importin alpha/beta.…

Knowledge

Information

Data

Page 3: Thesis def

Evolution of a Relational Proteome

Needleman Wunsch

PDB

SmithWaterman;NEWAT

PDGF-VSIS…

Atlas

SWISSPROT

NCBISCOP

HGP

REFSEQ

ProteinDomains

Insulin

1965 1975 1985 1995 2005

Page 4: Thesis def

http://bytesizebio.net/http://www.dna.affrc.go.jp/growth/images/P-grwth-entrs.gifPLoS Comput Biol. 2006 Aug 25;2(8):e114. Epub 2006 Jul 14.Genome Res. 2008 March; 18(3): 449–461. doi: 10.1101/gr.6943508.http://www.rcsb.org/pdb/statistics/contentGrowthChart.do?content=fold-scop

Data vs. Knowledge

Structures/Functions

Sequences

Data > Information

Page 5: Thesis def

An Integrated Framework for building Molecular Biological Data Marts

Putting the model to use …

Page 6: Thesis def

Data Marts : Targeted Integration FlatData Repositories

structurefunction

sequence

taxonomy

Page 7: Thesis def

A Family of Data Driven Molecular Biology Tools

Integrated of structure calculation via NMR.-hybrid methods, iterative processing, reproducibilityspectra,sequence,chemical shifts -> structure

Automated detection of signaling/binding motifs in a candidate protein.protein sequence -> biological activity

Filtration of “passenger” residues from specificity/functional residues on surfaces of protein structures .

sequence + structure - > function

“Multidimensional” Sequence Comparisonsequence + taxonomy -> evolution

Page 8: Thesis def

Sequence + Spectrum -> Structure

Page 9: Thesis def

CONNJUR WB integrates format conversion, data inspection, and integrative processing . . . .

RNMRTKNMRPIPE

CONNJURWB

Connjur-WB

J Bio. NMR, 2011

Page 10: Thesis def

Detection of functional subunits in proteins

TREMBL-SwissProtSwissProt vs Uniprot vs TREMBLMachine Learning

Spearmint (+)(Nuclear) Bacterial Proteins

Xanthippe (-)Snake proteins (can’t bind ATP)

Domain databases

Bioinformatics. 2004 Aug 4;20 Suppl 1:i342-7.http://pir.georgetown.edu/pirwww/about/doc/tutorials/uniprot_struc.gif Bioinformatics (2001) 17 (10): 920-926

?

Page 11: Thesis def

• How can we increase the size of our functional motif database without increasing the amount of false positives predicted ?

+

_

MinimotifMiner – a tool for predicting protein function via Short Sequence Peptide Motifs

Page 12: Thesis def

3000+ estimated motif publications / yr…

Variant Pubmed searches ‘x’

(("amino acid motif"[TIAB])) OR ((“protein motif” [TIAB])) AND (“<x>"[PDAT] : “<x>"[PDAT])

1975 1980 1985 1990 1995 2000 2005 2010 20150

500

1000

1500

2000

2500

3000

3500

4000

4500

Page 13: Thesis def

Relational Model of Functional Data - A Precise Model of Protein Functional Semantics.

BMC Genomics, 2009

Page 14: Thesis def

RMSD = .9

NCBI_FEDERATED + Mimosa

BMC Genomics , 2009

Page 15: Thesis def

A Peptide Annotation Pipeline

BMC Bioinformatics 2010, 11:328

Page 16: Thesis def

Further (GO) integration controls for the degenerate nature of motif searches

PLOS One, 2010

~400

~400

~900

Page 17: Thesis def

Short Sequences are degenerate

…Can they be merged with

structural and evolutionaryinformation ?

Chemistry & Biology, January 2000BMC Genomics, 2009

Page 18: Thesis def

Venn : An Integrated Application For Database Driven Homology Threading of Protein Structures ….

Nucleic Acids Research, 2009Trends in Plant sciences, 2010

Page 19: Thesis def

 VENN : "Twilight Zone"  Sequence Homology Threading

NAR, 2009

Page 20: Thesis def

Left to right …

1AZG (Human FYN) PRPLPVAP LYYGDWIPSNY1AVZ (Human FYN) TPQVPL YD … GDWPSNY1PRL (Chicken FYN) APPLPR YD ... WPNY (not shown)1H3H (Mouse GRB2) SRSTK ENPSWWTLPANY

VENN-InterfaceMiner : How do different SH3 binding peptides  functionally relate to one another ?

Page 21: Thesis def

Standard BLASTSearches

Page 22: Thesis def

SSPEs reside in the “Twilight Zone”

J. Bacteriology 2011

Page 23: Thesis def

What happens when a sequence is inherently noisy ?

max 100-250

 e val 

10E-3 ...  

word size3-5

 score matrix 

80,62,30  

gap?0,4  

  Q/N?

   

manskysktdvqqvkrqnqqsasgqgqygtefgsetdaqqvrkqnqsaeqnkqqns

Page 24: Thesis def

Sequence mining in 2D

Page 25: Thesis def

Use a hypersensitive sequence search(+), andexpand results into a 2nd dimension (-).

Combined with taxonomical information To pinpoint a first estimate of the gene’s appearance.

J. Bacteriology 2011

Page 26: Thesis def

R3 : A prototypical methodfor improved structure calculation.

Page 27: Thesis def

R3: Convergence is generally improved by reseeding

Page 28: Thesis def

Availability

Sequence , Structure

Sequence , Function

StructureSequenceTaxonomyFunction , Specificity

SequenceTaxonomy , Evolution

www.connjur.org

mnm.engr.uconn.edu

venn.vcell.uchc.edu

www.bio-toolkit.com

Page 29: Thesis def

RMSD = .9

NCBI_FEDERATED + EXPERT SYSTEM

BMC Genomics , 2009

Page 30: Thesis def

VENN : Fine grained analysis.

Nuc. Acids Research, 2009

Page 31: Thesis def

Residue enrichment profiles.

NCBI_FEDERATED : Taxonomy, Domain, Homologene & Refseq.

Page 32: Thesis def

VENN : Fine grained analysis of SH3 bound peptides--- reveals a similar interface for divergent sequences. Are the peptides similar to ?

Left to right …

1AZG (Human FYN) PRPLPVAP LYYGDWIPSNY1AVZ (Human FYN) TPQVPL YD … GDWPSNY1PRL (Chicken FYN) APPLPR YD ... WPNY1H3H (Mouse GRB2) SRSTK ENPSWWTLPANY

Page 33: Thesis def

Solution : Use an hypersensitive sequence search, and expand results into a 2nd dimension.

Combined with taxonomical information pinpoints a first estimate of the gene’s appearance.

Page 34: Thesis def

Gene Duplication, Domain Reuse, Functional Motifs, and Varaince of Structural Specificity

  - "Twilight Zone" homologies  - Structural Interfaces

- Binding Specificity

- Short Functional Motifs

       Vertebrates appear to have

arranged pre-existing components into a richer collection of domain architectures.                             Nature 2001

Page 35: Thesis def

Doolittle

* Functional Protein Bioinformatics    - CDD, MnM, Modular evolution of Proteins

* Database Normalization     - "Archival" -> low S/N ; unrepresentative * Protein-centric sequence searching    - Rous Sarcoma Discovery (DNA, lost in               translation)

***** All done before modern computing/database theory.

Page 36: Thesis def

The Modern Age    

Gen Bank  - archival   NCBI / EBI - sequence data curation

PDB/BMRB - structural data curation, deposition

GO - functional annotations 

...............................

Page 37: Thesis def

What is data modelling ?

- Ambiguety vs. Vagueness 

- "Text" vs "Syntax" 

- Biological Data : No clear "reference object".    Solution : CONTEXT

Page 38: Thesis def

Integration Strategies

Database FederationArchitectures

Data Warehousing   Data Marts

Page 39: Thesis def

When To Federate ?

* New Genomes... Draft sequences.

* Reproducibility is less important than insight. 

Page 40: Thesis def

Stark et Al.

Control of the G2/M Transition 2006

Page 41: Thesis def

Problem: There are hundreds of native peptides which possess subsequences which are predicted to have SH3 binding properties. For example [KR]..[KR] and P..P are known to interact with SH3 domains.  But there is no method for comparing the structural binding mechanisms behind these variant peptides.  This is necessary, given the fact that there are hundreds of SH3 domains in the human genome, with several diverse structures existing in the protein data bank, which cannot be collectively analyzed by eye.

Results

Left to right …

1AZG (Human FYN) PRPLPVAP binds LYYGDWIPSNY1AVZ (Human FYN) TPQVPL binds YD … GDWPSNY1PRL (Chicken FYN) APPLPR binds YD ... WPNY1H3H (Mouse GRB2) SRSTK binds ENPSWWTLPANY

Solution: Use the VENN program for homology titration to extract molecular interfaces from SH3 bound peptides.

1) For each atom “a1” in each peptide chain of a structureFor each atom in “a2” DIFFERENT chain of the same structure.Is “a1” close to “a2” ?If yes, store a1,a2.If no, keep going.

2) Now, create a “synthetic structure”, which extracts residues associated with only atoms stored in step (1), which ignores covalent peptide bonds entirely. This structure represents a molecular interface, where all non interacting residues are considered to be “extraneous noise”.

3) To test the biological relevance of the molecular interface, apply it to varying species : Is the same signature generated from different structures ?

Conclusion:Although the W/P/N/Y residues in SH3 domains are far apart and variably spaced in sequence distance, they may have evolved to possess a common feature : Conformance to a highly specific molecular interface.

Mouse GRB2 / Human FYN are completely different domains, in different species, which bind different peptides …. Yet surprisingly, their binding sites conform to the same interface.

Venn is available at http://sbtools.uchc.edu/venn.

Page 42: Thesis def

Orthologous Homology Threading : Course Grained Function . . .

Page 43: Thesis def

Do canonical binding motifs in proteins exhibit structural specificity before when unbound ?

8000 distinct pdb chains (out of 35000 total structures).

• SH3 Bound non PXXPs o 1AZG PLPV 137o 1AZG PRPL 107o 1PRL PPLP 150o 1PRL PLPR 154

• Non SH3 complexed PXXPso 2DJY PPPP 89o 1WA7 PGMP 111

• Non SH3 bound, non PXXPo 2ORU PATG 817

Page 44: Thesis def

historyHuman Genome - 2001

SCOP - 1994SwissProt/NCBI - 1986/1988NEWAT - 1981 PDGF ~ v sis - 1983Smith-Waterman - 1981                                                      PDB - 1973Needleman-Wunsch - 1970ATLAS - 1965Insulin Sequence - 1955Double Helix  - 1953