Can New Oracle10 g Search Features Help Bridge the Biological Discovery Gap?
description
Transcript of Can New Oracle10 g Search Features Help Bridge the Biological Discovery Gap?
proteomicsmyriad
Can New Oracle10Can New Oracle10gg Search Search Features Help Bridge the Features Help Bridge the
Biological Discovery Gap?Biological Discovery Gap?
Jake Y. Chen, Ph.D.Jake Y. Chen, Ph.D.Head of Computational Proteomics &Head of Computational Proteomics &Principle Bioinformatics ScientistPrinciple Bioinformatics Scientist
Marcel DavidsonMarcel DavidsonHead of Data ManagementHead of Data Management
proteomics
myriad
MessagesMessages
New Informatics Challenges in Protein New Informatics Challenges in Protein Interactomics R&DInteractomics R&D Scale, integration, discovery issuesScale, integration, discovery issues
A data-driven discovery-oriented frameworkA data-driven discovery-oriented framework
““Enabling” Features in 10Enabling” Features in 10gg?? Biological data integration?Biological data integration?
Biological data analysis integration?Biological data analysis integration?
proteomics
myriad
OutlineOutline
Data-driven Discovery-oriented Computational Framework
1010gg Regular Expression Case Studies Regular Expression Case Studies1010gg BLAST Case Studies BLAST Case Studies
proteomics
myriad
Why Myriad Maps Protein-Protein InteractionsWhy Myriad Maps Protein-Protein Interactions
Conventional Drug Discovery Post-Genomic Drug Discovery
Nucleus
GPCR
enzyme
hormone receptor
Nucleus
target validation
lead discovery,optimization
novel, more specific targetsnon-specific targetsnovel, druggable targets
enhanced pre-validationtarget pool
proteomics
myriad
Principle of the Yeast Two-Hybrid (Y2H) System
Reporter Gene
DNABindingDomain
Bait
ActivationDomain
Prey
( No Reporter Gene Activity )
Scenario B: Human Proteins X and Z do not Interact
Readout:No growth of yeast colonies
HumanProtein
Z
HumanProtein
X
DNA
Reporter Gene
DNABindingDomain
HumanProtein
X
Bait
HumanProtein
Y
ActivationDomainPrey
Scenario A: Human Proteins X and Y do Interact
Readout:Yeast colonies grow
DNA
proteomics
myriad
Data Collected from Y2H SystemData Collected from Y2H System
Search Experiment
Bait Fragment Prey Fragment
0001 TACACACCTCGGCGTCGTACACACCTCGGCGTCGCAGCTCTCGATCATCTCCCAGCTCTCGATCATCTCCGGAGCTAACAAGGAAGGGGAGCTAACAAGGAAGGCCGGACTGTCCCGTAGACCGGACTGTCCCGTAGAAGCCGCTCTGCAGCCGCTCTGC
TGTGAGCCGGGGACCATTGTGAGCCGGGGACCATGCAGCCCGAAACCTCCAGCAGCCCGAAACCTCCAGTCACTGCGCCCGGCAGGTCACTGCGCCCGGCAGGAGTCAGGAGCCAGGGAGAGTCAGGAGCCAGGGACTGTGCAGCCTGCTGTGCAGCCTG
0002 GCCGAGAAAGATCACACGCCGAGAAAGATCACACACAAGGCTGTCACTTCATACAAGGCTGTCACTTCATACTTGGAGAGTTGCACAACTTGGAGAGTTGCACAGCGGCGGGGCAGAGGAGCGGCGGGGCAGAGGAGCTCCTCACTTCGCTCCTCACTTC
TATCAAATTGAAGAAGTATATCAAATTGAAGAAGTATGACGGTCGAACCAAACTGACGGTCGAACCAAACCCATTCCAAACCATTCCAAA
Search Experiment
Bait Sequence Prey Sequence
0001 NM_016333.2 NM_016333.2 NM_021962.1NM_021962.1
0002 NM_016486.2NM_016486.2 NM_003134.1NM_003134.1
Perform BLASTAgainst HumanREFSEQ DB
proteomics
myriad
Protein Interaction Network Protein Interaction Network (Snapshot of ~8,000 interactions)(Snapshot of ~8,000 interactions)
proteomics
myriad
Knowledge Discovery (KD) ChallengesKnowledge Discovery (KD) Challenges
• ~80,000 unique interactions• ~100 biological data sourcesProtein Interaction DataProtein Interaction Data
• 1-10 novel drug targets per diseaseMarketable KnowledgeMarketable Knowledge
• >1.5million sequence fragments• ~250,000 search experiments performed• several TB data storage
Experimental MeasurementExperimental Measurement
• ~1000 relevant interactions for each interested pathways
DistilledDistilled InformationInformation
$$$, drugs, …$$$, drugs, …
??
• Data-driven• Discovery-oriented
proteomics
myriad
KD in Interaction-based ProteomicsKD in Interaction-based Proteomics
E-RDBMSE-RDBMS
Represent Interactions and Pathways
• Domain-specificData Modeling
Reduce Data Noise
• Data Cleansing• Statistical Data Analysis
Organize Data in Regulatory Pathways • DB Querying
• VisualizationSelect and Validate
Drug targets
• Knowledge Curation
• Data Integration
Genomics/Functional
Genomics Data
• LIMS Programming
Collect raw sequences and lab condition measurements
proteomics
myriad
Bioinformatics DB FrameworkBioinformatics DB Framework
Sequence Bank
SequenceAnnotation DB
Lab Sequence DB
R2H Analysis DataMart
Y2H LIMSDB
Base-callingResults
Base-callingResults
Base-callingResults
BLASTAnalysisResults
CuratedInteraction DB
Protein DomainDB
Pubmed
GeneOntology DB
Gene ExpressionDB
OMIM
Pronet DB
REFSEQDB
DomainAnalysisResults
EnsemblLocusLinkDB
Annotation DBAnnotation DBRefSeq, LocusLink, GO, RefSeq, LocusLink, GO, OMIM, CGAP, Protein OMIM, CGAP, Protein Kinase DB, GPCR DB, Kinase DB, GPCR DB, Ensemble, Curation, …Ensemble, Curation, …
Y2H Data Processing and Y2H Data Processing and Analysis DBAnalysis DB
Lab_Seq, Seq_Match, Lab_Seq, Seq_Match, Y2H_MartY2H_Mart
Y2H Interaction Data MartY2H Interaction Data MartY2h_MartY2h_Mart
proteomics
myriad
A Schema Fragment to Manage A Schema Fragment to Manage Sequence Similarity ResultsSequence Similarity Results
Jake Yue Chen and John Carlis (2003) Genomic Data Modeling. Information Systems Journal, 28(4), p287-310.
proteomics
myriad
Interaction Matrix using Randomly Interaction Matrix using Randomly Ordered Locus IDsOrdered Locus IDs
•12,958 unique Interactions•1955 bait loci•2766 prey loci
Jake Yue Chen, Jake Yue Chen, et alet al (2003) (2003) Proceedings of the IEEE Computer Science Society Proceedings of the IEEE Computer Science Society Bioinformatics Conference 2003Bioinformatics Conference 2003. Stanford University, Stanford, CA.. Stanford University, Stanford, CA.
proteomics
myriad
OutlineOutline
Data-driven Discovery-oriented Computational Framework
1010gg Regular Expression Case Regular Expression Case StudiesStudies
1010gg BLAST Case Studies BLAST Case Studies
proteomics
myriad
Oracle10Oracle10gg Regular Expressions: Regular Expressions: Powerful String ProcessingPowerful String Processing
RE new tools in Oracle10RE new tools in Oracle10gg Search and manipulate data strings of Search and manipulate data strings of
arbitrary complexityarbitrary complexity Prior database solutionsPrior database solutions
SQL SQL LIKELIKE operator operator Java stored procedures, C external librariesJava stored procedures, C external libraries
Prior non-database solutions: AWK, SED, Prior non-database solutions: AWK, SED, GREP, PERL, etc.GREP, PERL, etc.
Done now inside databaseDone now inside database Facilitates rapid data-centric analysisFacilitates rapid data-centric analysis
proteomics
myriad
Case1: Retrieving Protein data from Case1: Retrieving Protein data from SGD SGD (Saccharomyces Genome Database)(Saccharomyces Genome Database)
ORF Identifier
Associated Amino Acid
Sequence
proteomics
myriad
HTTP Raw DataHTTP Raw Data</script></script></head><body><body bgcolor='#FFFFFF'></head><body><body bgcolor='#FFFFFF'><table cellpadding="2" width="100%" cellspacing="0" border="0"><tr><td colspan="4"><hr width="100%" /></td></tr><tr><td valign="middle" <table cellpadding="2" width="100%" cellspacing="0" border="0"><tr><td colspan="4"><hr width="100%" /></td></tr><tr><td valign="middle" align="right"><a href="http://www.yeastgenome.org/"><img alt="SGD" border="0" src="http://www.yeastgenome.org/images/SGD-to.gif" align="right"><a href="http://www.yeastgenome.org/"><img alt="SGD" border="0" src="http://www.yeastgenome.org/images/SGD-to.gif" /></a></td><th valign="middle" nowrap="1">Quick Search:</th><td valign="middle" align="left"><form method="post" /></a></td><th valign="middle" nowrap="1">Quick Search:</th><td valign="middle" align="left"><form method="post" action="http://db.yeastgenome.org/cgi-bin/SGD/search/quickSearch" enctype="application/x-www-form-urlencoded">action="http://db.yeastgenome.org/cgi-bin/SGD/search/quickSearch" enctype="application/x-www-form-urlencoded"><input type="text" name="query" size="13" /><input type="submit" name="Submit" value="Submit" /><input type="text" name="query" size="13" /><input type="submit" name="Submit" value="Submit" /></form></td><th valign="middle" align="left"><a href="http://www.yeastgenome.org/sitemap.html">Site Map</a> | <a </form></td><th valign="middle" align="left"><a href="http://www.yeastgenome.org/sitemap.html">Site Map</a> | <a href="http://www.yeastgenome.org/HelpContents.shtml">Help</a> | <a href="http://www.yeastgenome.org/SearchContents.shtml">Full Search</a> href="http://www.yeastgenome.org/HelpContents.shtml">Help</a> | <a href="http://www.yeastgenome.org/SearchContents.shtml">Full Search</a> | <a href="http://www.yeastgenome.org/">Home</a></th></tr><tr><td align="left" colspan="4"><table cellpadding="1" width="100%" | <a href="http://www.yeastgenome.org/">Home</a></th></tr><tr><td align="left" colspan="4"><table cellpadding="1" width="100%" cellspacing="0" border="0"><tr align="center" bgcolor="navajowhite"><td><font size="-1"><a cellspacing="0" border="0"><tr align="center" bgcolor="navajowhite"><td><font size="-1"><a href="http://www.yeastgenome.org/ComContents.shtml">Community Info</a></font></td><td><font size="-1"><a href="http://www.yeastgenome.org/ComContents.shtml">Community Info</a></font></td><td><font size="-1"><a href="http://www.yeastgenome.org/SubmitContents.shtml">Submit Data</a></font></td><td><font size="-1"><a href="http://www.yeastgenome.org/SubmitContents.shtml">Submit Data</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast2sgd">BLAST</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast2sgd">BLAST</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/web-primer">Primers</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/web-primer">Primers</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/PATMATCH/nph-patmatch">PatMatch</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/PATMATCH/nph-patmatch">PatMatch</a></font></td><td><font size="-1"><a href="http://db.yeastgenome.org/cgi-bin/SGD/seqTools">Gene/Seq Resources</a></font></td><td><font size="-1"><a href="http://db.yeastgenome.org/cgi-bin/SGD/seqTools">Gene/Seq Resources</a></font></td><td><font size="-1"><a href="http://www.yeastgenome.org/Vl-yeast.shtml">Virtual Library</a></font></td><td><font size="-1"><a href="http://www.yeastgenome.org/Vl-yeast.shtml">Virtual Library</a></font></td><td><font size="-1"><a href="http://db.yeastgenome.org/cgi-bin/SGD/suggestion">Contact SGD</a></font></td></tr></table></td></tr><tr><td colspan="4"><hr href="http://db.yeastgenome.org/cgi-bin/SGD/suggestion">Contact SGD</a></font></td></tr></table></td></tr><tr><td colspan="4"><hr width="100%" /></td></tr></table><table cellpadding="0" width="100%" cellspacing="0" border="0"><tr><td width="10%"><br /></td><td width="100%" /></td></tr></table><table cellpadding="0" width="100%" cellspacing="0" border="0"><tr><td width="10%"><br /></td><td valign="middle" align="center" width="80%"><h1>Sequence for a region of YDR099W/BMH2</h1></td><td valign="middle" align="right" valign="middle" align="center" width="80%"><h1>Sequence for a region of YDR099W/BMH2</h1></td><td valign="middle" align="right" width="10%"></td></tr></table><p /><center><a target="infowin" href="http://db.yeastgenome.org/cgi-bin/SGD/suggestion">Send questions or width="10%"></td></tr></table><p /><center><a target="infowin" href="http://db.yeastgenome.org/cgi-bin/SGD/suggestion">Send questions or suggestions to SGD</a></center><p /><p /><center><a target="infowin" href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast2sgd?suggestions to SGD</a></center><p /><p /><center><a target="infowin" href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast2sgd?name=YDR099W&suffix=prot">BLAST search</a> | <a target="infowin" href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-fastasgd?name=YDR099W&suffix=prot">BLAST search</a> | <a target="infowin" href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-fastasgd?name=YDR099W&suffix=prot">FASTA search</a></center><p /><center><hr width="35%" /></center><p /><font color="FF0000"><strong>Protein name=YDR099W&suffix=prot">FASTA search</a></center><p /><center><hr width="35%" /></center><p /><font color="FF0000"><strong>Protein translation of the coding sequence.</strong></font><p /><p />Other Formats Available: <a translation of the coding sequence.</strong></font><p /><p />Other Formats Available: <a href="http://db.yeastgenome.org/cgi-bin/SGD/getSeq?map=pmap&seq=YDR099W&flankl=0&flankr=0&rev=">GCG</a><pre>>YDR099W Chr href="http://db.yeastgenome.org/cgi-bin/SGD/getSeq?map=pmap&seq=YDR099W&flankl=0&flankr=0&rev=">GCG</a><pre>>YDR099W Chr 4 4 MSQTREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEERNLLSVAYKNVIGARRASMSQTREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEERNLLSVAYKNVIGARRASWRIVSSIEQKEESKEKSEHQVELIRSYRSKIETELTKISDDILSVLDSHLIPSATTGESKWRIVSSIEQKEESKEKSEHQVELIRSYRSKIETELTKISDDILSVLDSHLIPSATTGESKVFYYKMKGDYHRYLAEFSSGDAREKATNSSLEAYKTASEIATTELPPTHPIRLGLALNFSVFYYKMKGDYHRYLAEFSSGDAREKATNSSLEAYKTASEIATTELPPTHPIRLGLALNFSVFYYEIQNSPDKACHLAKQAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDISESVFYYEIQNSPDKACHLAKQAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDISESGQEDQQQQQQQQQQQQQQQQQAPAEQTQGEPTK*GQEDQQQQQQQQQQQQQQQQQAPAEQTQGEPTK*</pre><hr size="2" width="75%"></pre><hr size="2" width="75%"><table width="100%"><tr><td valign="top" align="left"><a href="http://www.yeastgenome.org/"><img border="0" <table width="100%"><tr><td valign="top" align="left"><a href="http://www.yeastgenome.org/"><img border="0" src="http://www.yeastgenome.org/images/arrow.small.up.gif" />Return to SGD</a></td><td valign="bottom" align="right"><form method="post" src="http://www.yeastgenome.org/images/arrow.small.up.gif" />Return to SGD</a></td><td valign="bottom" align="right"><form method="post" action="http://db.yeastgenome.org/cgi-bin/SGD/suggestion" enctype="application/x-www-form-urlencoded" target="infowin" name="suggestion">action="http://db.yeastgenome.org/cgi-bin/SGD/suggestion" enctype="application/x-www-form-urlencoded" target="infowin" name="suggestion"><input type="hidden" name="script_name" value="/cgi-bin/SGD/getSeq" /><input type="hidden" name="server_name" <input type="hidden" name="script_name" value="/cgi-bin/SGD/getSeq" /><input type="hidden" name="server_name" value="db.yeastgenome.org" /><input type="hidden" name="query_string" value="seq=YDR099W&flankl=0&flankr=0&map=p3map" /><a value="db.yeastgenome.org" /><input type="hidden" name="query_string" value="seq=YDR099W&flankl=0&flankr=0&map=p3map" /><a href="javascript:document.suggestion.submit()">Send a Message to the SGD Curators<img border="0" href="javascript:document.suggestion.submit()">Send a Message to the SGD Curators<img border="0" src="http://www.yeastgenome.org/images/mail.gif" /></a>src="http://www.yeastgenome.org/images/mail.gif" /></a></form></td></tr></table></body></html></form></td></tr></table></body></html>
Need to parse out embedded AA Sequence
proteomics
myriad
Function to Return AA Sequence Function to Return AA Sequence Given ORFGiven ORF
create or replace function orf2seq (create or replace function orf2seq ( p_orf in varchar2p_orf in varchar2) return varchar2 is) return varchar2 is v_stream clob;v_stream clob; strt number;strt number;beginbegin -- Retrieve the HTTP stream:-- Retrieve the HTTP stream: v_stream := httpuritype.getclob(httpuritype.createuri(v_stream := httpuritype.getclob(httpuritype.createuri( 'http://db.yeastgenome.org/cgi-bin/SGD/getSeq?seq='||p_orf||'http://db.yeastgenome.org/cgi-bin/SGD/getSeq?seq='||p_orf|| '&flankl=0&flankr=0&map=p3map''&flankl=0&flankr=0&map=p3map')) ););
-- Trim off the head of the stream:-- Trim off the head of the stream: strt := dbms_lob.instr(v_stream, 'Submit', 1, 1);strt := dbms_lob.instr(v_stream, 'Submit', 1, 1);
-- Strip out control characters, new lines, etc.:-- Strip out control characters, new lines, etc.: v_stream := regexp_replace(dbms_lob.substr(v_stream, 4000, strt), v_stream := regexp_replace(dbms_lob.substr(v_stream, 4000, strt), '[[:cntrl:]]''[[:cntrl:]]', ,
'');'');
-- Return the AA sequence:-- Return the AA sequence: return(regexp_substr(dbms_lob.substr(v_stream, 4000, strt), return(regexp_substr(dbms_lob.substr(v_stream, 4000, strt), '[[:upper:]]{10,}''[[:upper:]]{10,}'));));end;end;
Web site URL
RegExp to remove control
chars from HTTP stream
Parameterized ORF Id
RegExp to extract AA sequence
proteomics
myriad
Amino Acid Sequence for ORF Amino Acid Sequence for ORF ‘YDR099W’‘YDR099W’
SQL> select orf2seq('YDR099W') from dual;SQL> select orf2seq('YDR099W') from dual;
ORF2SEQ('YDR099W')ORF2SEQ('YDR099W')----------------------------------------------------------------------------------------------------------------------------------------------------------------MSQTREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEERNLLSVAYKNVIGARRASWRIVSSIEQKEESKEKSEHQMSQTREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEERNLLSVAYKNVIGARRASWRIVSSIEQKEESKEKSEHQVELIRSYRSKIETELTKISDDILSVLDSHLIPSATTGESKVFYYKMKGDYHRYLAEFSSGDAREKATNSSLEAYKTASEIVELIRSYRSKIETELTKISDDILSVLDSHLIPSATTGESKVFYYKMKGDYHRYLAEFSSGDAREKATNSSLEAYKTASEIATTELPPTHPIRLGLALNFSVFYYEIQNSPDKACHLAKQAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDISESATTELPPTHPIRLGLALNFSVFYYEIQNSPDKACHLAKQAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDISESGQEDQQQQQQQQQQQQQQQQQAPAEQTQGEPTKGQEDQQQQQQQQQQQQQQQQQAPAEQTQGEPTK
Elapsed: 00:00:01.24Elapsed: 00:00:01.24Elapsed time <2
sec. (network latency)
SQL> insert into pseq (orf_id, sequence)SQL> insert into pseq (orf_id, sequence)2 values ('YDR099W', orf2seq('YDR099W'));2 values ('YDR099W', orf2seq('YDR099W'));
proteomics
myriad
Case 2: Motif Searching in ProteinsCase 2: Motif Searching in Proteins
PROSITE database of protein sequence motifsPROSITE database of protein sequence motifsID TYR_PHOSPHO_SITE; PATTERN. ID TYR_PHOSPHO_SITE; PATTERN. AC PS00007; AC PS00007; DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); APR-1990 (INFO UPDATE). DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); APR-1990 (INFO UPDATE). DE DE Tyrosine kinase phosphorylation site.Tyrosine kinase phosphorylation site. PA PA [RK]-x(2,3)-[DE]-x(2,3)-Y.[RK]-x(2,3)-[DE]-x(2,3)-Y. CC /TAXO-RANGE=??E?V; CC /SITE=5,phosphorylation; CC /TAXO-RANGE=??E?V; CC /SITE=5,phosphorylation; CC /SKIP-FLAG=TRUE; CC /SKIP-FLAG=TRUE; DO PDOC00007; DO PDOC00007;
Source: Source: http://www.expasy.org/prosite/ps_frequent_patterns.txthttp://www.expasy.org/prosite/ps_frequent_patterns.txt
TKP Pattern: TKP Pattern: [RK]-x(2,3)-[DE]-x(2,3)-Y.[RK]-x(2,3)-[DE]-x(2,3)-Y. R=Arginine, K=Lysine, D=Aspartate, E=Glutamate, R=Arginine, K=Lysine, D=Aspartate, E=Glutamate,
Y=Tyrosine, x=any AAY=Tyrosine, x=any AA Oracle10gOracle10g Regular Expression Equivalent Regular Expression Equivalent
[RK].{2,3}[DE].{2,3}[Y][RK].{2,3}[DE].{2,3}[Y]
TKP TKP motif pattern
1 Arginine or Lysine
2 – 3 Any1 Aspartate
or Glutamate
1 Tyrosine
2 – 3 Any
proteomics
myriad
SQL Example: Retrieving all SQL Example: Retrieving all Interacting Proteins with TKPInteracting Proteins with TKP
select distinctselect distinct substr(a.refseq_id, 1, 9) refseq_id,substr(a.refseq_id, 1, 9) refseq_id, length(a.seq_string_varchar) seq_length,length(a.seq_string_varchar) seq_length, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 1)regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 1) motif_offs1, motif_offs1, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 2)regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 2) motif_offs2, motif_offs2, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 3)regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 3) motif_offs3, motif_offs3, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 4)regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 4) motif_offs4 motif_offs4fromfrom target_db a,target_db a, y2h_interaction_p by2h_interaction_p bwherewhere a.refseq_id like 'NP%'a.refseq_id like 'NP%' and and regexp_like(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]')regexp_like(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]') and (substr(a.refseq_id,1,9) = b.bait_refseq or substr(a.refseq_id,1,9) = and (substr(a.refseq_id,1,9) = b.bait_refseq or substr(a.refseq_id,1,9) =
b.prey_refseq)b.prey_refseq);;
Returns all rows with TKP site
Returns first 4 instances of TKP in
each sequence
proteomics
myriad
SQL Example OutputSQL Example Output
REFSEQ_ID SEQ_LENGTH MOTIFREFSEQ_ID SEQ_LENGTH MOTIF11_OFFS MOTIF_OFFS MOTIF22_OFFS MOTIF_OFFS MOTIF33_OFFS MOTIF_OFFS MOTIF44_OFFS_OFFS------------ ---------- ----------- ----------- ----------- ----------------------- ---------- ----------- ----------- ----------- -----------NP_003961 1465 14 202 347 537NP_003961 1465 14 202 347 537NP_003968 330 241 0 0 0NP_003968 330 241 0 0 0NP_003983 490 8 50 62 93NP_003983 490 8 50 62 93NP_004001 3562 3085 0 0 0NP_004001 3562 3085 0 0 0......
MHHCKRYMHHCKRYRRSPSPEEPDPPDPYYLSYRWKRRRSYSREHEGRLRYPSRREPPPRRSRSLSYRWKRRRSYSREHEGRLRYPSRREPPPRRSRSRRSHSHDDRLPRLPYYQRRYQRRYRRERRERRDDSDTSDTYYRCEERSPSFGRCEERSPSFGEDYYGPSRSRHREDYYGPSRSRHRRRRSRRSREERGPRGPYYRTRKHAHHCHKRRTRSCSSASSRSQQSSKRTGRSVEDDKEGHLVCRIGDWLQERYEIVGRTRKHAHHCHKRRTRSCSSASSRSQQSSKRTGRSVEDDKEGHLVCRIGDWLQERYEIVGNLGEGTFGKVVECLDHARGKSQVALKIIRNVGKYREAARLEINVLKKIKEKDKENKFLCVLMSDWFNFHGHMCIAFELLGNLGEGTFGKVVECLDHARGKSQVALKIIRNVGKYREAARLEINVLKKIKEKDKENKFLCVLMSDWFNFHGHMCIAFELLGKNTFEFLKENNFQPYPLPHVRHMAYQLCHALRFLHENQLTHTDLKPENILFVNSEFETLYNEHKSCEEKSVKNTSIRVADKNTFEFLKENNFQPYPLPHVRHMAYQLCHALRFLHENQLTHTDLKPENILFVNSEFETLYNEHKSCEEKSVKNTSIRVADFGSATFDHEHHTTIVATRHYRPPEVILELGWAQPCDVWSIGCILFEYYRGFTLFQTHENREHLVMMEKILGPIPSHMIHRFGSATFDHEHHTTIVATRHYRPPEVILELGWAQPCDVWSIGCILFEYYRGFTLFQTHENREHLVMMEKILGPIPSHMIHRTRKQKYFYKGGLVWDENSSDGRYVKENCKPLKSYMLQDSLEHVQLFDLMRRMLEFDPAQRITLAEALLHPFFAGLTPEERTRKQKYFYKGGLVWDENSSDGRYVKENCKPLKSYMLQDSLEHVQLFDLMRRMLEFDPAQRITLAEALLHPFFAGLTPEERSFHTSRNPSRSFHTSRNPSR
[RK][RK].{2,3}.{2,3}[DE][DE].{2,3}.{2,3}[Y][Y]
Motif #1 at offset 8
Motif #2 at offset
50
Motif #3 at offset
62
Motif #4 at offset 93
Result: 702 (56%) interacting proteins with TKP siteResult: 702 (56%) interacting proteins with TKP site
proteomics
myriad
Is 56% TKP in interacting proteins Is 56% TKP in interacting proteins significant?significant?
All Curated All Curated ProteinsProteins
Curated Curated Proteins w/ Proteins w/ TKPTKP
Percent with Percent with TKPTKP
Total NP Total NP EntriesEntries
16,90816,908 6,9916,991 41%41%
Myriad Myriad Proteomics Proteomics Interaction Interaction SubsetSubset
1,2481,248 702702 56%56%
Random sample test of all NP entries
• N = 33 random samples
• Sample size 7.4% (~1251)
• Sample mean = 515
• SD = 17.2
• Significance level < 1E-30
proteomics
myriad
OutlineOutline
Data-driven Discovery-oriented Computational Framework
10g Regular Expression Case Studies
1010gg BLAST Case Studies BLAST Case Studies
proteomics
myriad
Similarity Search (Sequence Similarity Search (Sequence Comparison): A Routine Biology TaskComparison): A Routine Biology Task
A Query Sequencen Target
Sequencesk Pair-wise
Comparison Results
Similarity Search has not been integrated into the DB system.
proteomics
myriad
proteomics
myriad
Using BLAST can be a laborious Using BLAST can be a laborious process & a data-management hellprocess & a data-management hell
Custom setup of BLAST target databaseCustom setup of BLAST target databaseIterate through query sequences: “Batch Iterate through query sequences: “Batch
BLAST”BLAST”Export/parse/filter/import data <-> DBMSExport/parse/filter/import data <-> DBMSIntegration of results with external dataIntegration of results with external data
proteomics
myriad
Case 1: Oracle 10g BLASTN as a Case 1: Oracle 10g BLASTN as a sequence identification toolsequence identification tool
SELECT SELECT t.t_seq_id, t.expectt.t_seq_id, t.expectFROM FROM TABLE (TABLE (BLASTN_MATCHBLASTN_MATCH ( ( (select sequence FROM query_db where sequence_id = 100),(select sequence FROM query_db where sequence_id = 100), CURSOR (select refseq_id, sequence_string FROM target_db CURSOR (select refseq_id, sequence_string FROM target_db
where refseq_id like 'NM_%')where refseq_id like 'NM_%') ) ) t ) ) t WHERE WHERE t.expect < 1E-20;t.expect < 1E-20;
-- A sequence fragment with a sequence_id = 100-- Sequence is stored in the query_db table.TACACACCTCGGCGTCGCAGCTCTCGATCATCTCCGGAGCTAACAAGGAAGGCCGGACTGTCCCGTAGAAGCCGCTCTGC
T_SEQ_ID EXPECT----------------- --------------NM_016333.2 0
proteomics
myriad
Case 2: Discovering “Interlogs”Case 2: Discovering “Interlogs”
A
B
X
Y
C
Z
Yeast Protein Interactome Human Protein Interactome
Homology Mapping
Interlogs: (A|X, B|Y) and (A|X, B|Z)
proteomics
myriad
A Computational Intensive TaskA Computational Intensive Task
Data to useData to use Yeast Protein-Protein Interaction DataYeast Protein-Protein Interaction Data Yeast Protein SequencesYeast Protein Sequences Human Protein-Protein Interaction DataHuman Protein-Protein Interaction Data Human Protein Sequences & AnnotationsHuman Protein Sequences & Annotations
Analysis to prepareAnalysis to prepare Homology search: yeast vs. human proteinsHomology search: yeast vs. human proteins
Things to considerThings to consider Collect/parse public data from webCollect/parse public data from web Import/export data for BLASTImport/export data for BLAST Connect analysis result to internal dataConnect analysis result to internal data
Missing Data
Laborious
Traditional way?Or inside DBMS?
proteomics
myriad
Pipelining Missing Data Directly into Pipelining Missing Data Directly into BLASTP SearchesBLASTP Searches
insert into yeast_human_homologinsert into yeast_human_homologselect select 'YDR099W‘ 'YDR099W‘ Yeast_ORF_name, Yeast_ORF_name,
t.t_seq_id t.t_seq_id Human_refseq,Human_refseq,t.expectt.expect E_ValueE_Value
from TABLE (from TABLE (BLASTP_MATCHBLASTP_MATCH ( ( ((SELECT orf2seq ('YDR099W') FROM dualSELECT orf2seq ('YDR099W') FROM dual),), CURSOR CURSOR ((SELECT SELECT refseq_id, sequence_string refseq_id, sequence_string FROM FROM target_dbtarget_db WHERE WHERE refseq_id LIKE 'NP_%'refseq_id LIKE 'NP_%')) )) ) t) tWHERE t.expect < 0.0001 WHERE t.expect < 0.0001 ;;
BLAST in DBMS
Online Data Integration
BLAST Target DB Customization
-- Note: Iterate through Yeast ORF Names to perform batch BLAST.
proteomics
myriad
Mission Impossible: AccomplishedMission Impossible: Accomplished
SELECTSELECTa.orf_1, a.orf_2, b.human_refseq, b.e_value , c.human_refseq, c.e_valuea.orf_1, a.orf_2, b.human_refseq, b.e_value , c.human_refseq, c.e_value
FROMFROM yeast_interaction a, yeast_interaction a, yeast_human_homolog b, yeast_human_homolog b, yeast_human_homolog c, yeast_human_homolog c, y2h_interaction_p dy2h_interaction_p d
WHEREWHERE a.orf_1 = b.yeast_ORF_name anda.orf_1 = b.yeast_ORF_name anda.orf_2 = c.yeast_ORF_name anda.orf_2 = c.yeast_ORF_name and(((b.human_refseq = d.bait_refseq and (b.human_refseq = d.bait_refseq and c.human_refseq = d.prey_refseq)c.human_refseq = d.prey_refseq)oror(b.human_refseq = d.prey_refseq and (b.human_refseq = d.prey_refseq and c.human_refseq = d.bait_refseq)c.human_refseq = d.bait_refseq)))
;;
ORF_1 ORF_2 HUMAN_REFSEQ E_VALUE HUMAN_REFSEQ E_VALUE------------------------- ------------------------- --------------- ---------- --------------- ----------YCR002C YHR107C NP_xxxxx1 5.9279E-44 NP_yyyyy1 3.7130E-46YCR002C YJR076C NP_xxxxx2 5.9279E-44 NP_yyyyy2 1.7807E-48YJR076C YHR107C NP_xxxxx3 1.9734E-39 NP_yyyyy3 3.7130E-46YCR002C YHR107C NP_xxxxx4 2.3257E-48 NP_yyyyy4 7.4988E-39YCR002C YJR076C NP_xxxxx5 2.3257E-48 NP_yyyyy5 1.9734E-39YJR076C YHR107C NP_xxxxx6 1.7807E-48 NP_yyyyy6 7.4988E-39
proteomics
myriad
ConclusionConclusion
Data-driven discovery-oriented Data-driven discovery-oriented bioinformatics framework demands rich bioinformatics framework demands rich bio-specific DBMS supportbio-specific DBMS support
1010gg Regular Expression and BLAST in Regular Expression and BLAST in DBMS features benefit our scientific DBMS features benefit our scientific discovery tasks in interactome studiesdiscovery tasks in interactome studies
Additional enhancementsAdditional enhancements
proteomics
myriad
ReferencesReferences
Jake Yue Chen, Jake Yue Chen, et alet al (2003) (2003) Initial Large-scale Exploration of Initial Large-scale Exploration of Protein-protein Interactions in the Human Brain.Protein-protein Interactions in the Human Brain. Proceedings of Proceedings of the IEEE Computer Science Society Bioinformatics Conference the IEEE Computer Science Society Bioinformatics Conference 20032003. Stanford University, Stanford, CA.. Stanford University, Stanford, CA.
Sudhir Sahasrabudhe and Chen, Jake Yue (2003) Sudhir Sahasrabudhe and Chen, Jake Yue (2003) Extracting Extracting Biological Information from System-scale Protein Interactome Biological Information from System-scale Protein Interactome Data.Data. Tutorial at the 11th International Conference on Intelligent Tutorial at the 11th International Conference on Intelligent Systems in Molecular BiologySystems in Molecular Biology. Brisbane, Australia.. Brisbane, Australia.
Jake Yue Chen and John Carlis (2003) Jake Yue Chen and John Carlis (2003) Similar_Join: Extending Similar_Join: Extending DBMS with a Bio-specific Operator.DBMS with a Bio-specific Operator. Proceedings of the 2003 ACM Proceedings of the 2003 ACM Symposium on Applied ComputingSymposium on Applied Computing. Melbourne, Florida.. Melbourne, Florida.
Jake Yue Chen and John Carlis (2003) Jake Yue Chen and John Carlis (2003) Genomic Data Modeling.Genomic Data Modeling. Information Systems, Vol 28, issue 4: p287-310Information Systems, Vol 28, issue 4: p287-310..