Can New Oracle10 g Search Features Help Bridge the Biological Discovery Gap?

proteomicsmyriad

Can New Oracle10Can New Oracle10gg Search Search Features Help Bridge the Features Help Bridge the

Biological Discovery Gap?Biological Discovery Gap?

Jake Y. Chen, Ph.D.Jake Y. Chen, Ph.D.Head of Computational Proteomics &Head of Computational Proteomics &Principle Bioinformatics ScientistPrinciple Bioinformatics Scientist

Marcel DavidsonMarcel DavidsonHead of Data ManagementHead of Data Management

proteomics

myriad

MessagesMessages

New Informatics Challenges in Protein New Informatics Challenges in Protein Interactomics R&DInteractomics R&D Scale, integration, discovery issuesScale, integration, discovery issues

A data-driven discovery-oriented frameworkA data-driven discovery-oriented framework

““Enabling” Features in 10Enabling” Features in 10gg?? Biological data integration?Biological data integration?

Biological data analysis integration?Biological data analysis integration?

proteomics

myriad

OutlineOutline

Data-driven Discovery-oriented Computational Framework

1010gg Regular Expression Case Studies Regular Expression Case Studies1010gg BLAST Case Studies BLAST Case Studies

proteomics

myriad

Why Myriad Maps Protein-Protein InteractionsWhy Myriad Maps Protein-Protein Interactions

Conventional Drug Discovery Post-Genomic Drug Discovery

Nucleus

GPCR

enzyme

hormone receptor

Nucleus

target validation

lead discovery,optimization

novel, more specific targetsnon-specific targetsnovel, druggable targets

enhanced pre-validationtarget pool

proteomics

myriad

Principle of the Yeast Two-Hybrid (Y2H) System

Reporter Gene

DNABindingDomain

Bait

ActivationDomain

Prey

( No Reporter Gene Activity )

Scenario B: Human Proteins X and Z do not Interact

Readout:No growth of yeast colonies

HumanProtein

Z

HumanProtein

X

DNA

Reporter Gene

DNABindingDomain

HumanProtein

X

Bait

HumanProtein

Y

ActivationDomainPrey

Scenario A: Human Proteins X and Y do Interact

Readout:Yeast colonies grow

DNA

proteomics

myriad

Data Collected from Y2H SystemData Collected from Y2H System

Search Experiment

Bait Fragment Prey Fragment

0001 TACACACCTCGGCGTCGTACACACCTCGGCGTCGCAGCTCTCGATCATCTCCCAGCTCTCGATCATCTCCGGAGCTAACAAGGAAGGGGAGCTAACAAGGAAGGCCGGACTGTCCCGTAGACCGGACTGTCCCGTAGAAGCCGCTCTGCAGCCGCTCTGC

TGTGAGCCGGGGACCATTGTGAGCCGGGGACCATGCAGCCCGAAACCTCCAGCAGCCCGAAACCTCCAGTCACTGCGCCCGGCAGGTCACTGCGCCCGGCAGGAGTCAGGAGCCAGGGAGAGTCAGGAGCCAGGGACTGTGCAGCCTGCTGTGCAGCCTG

0002 GCCGAGAAAGATCACACGCCGAGAAAGATCACACACAAGGCTGTCACTTCATACAAGGCTGTCACTTCATACTTGGAGAGTTGCACAACTTGGAGAGTTGCACAGCGGCGGGGCAGAGGAGCGGCGGGGCAGAGGAGCTCCTCACTTCGCTCCTCACTTC

TATCAAATTGAAGAAGTATATCAAATTGAAGAAGTATGACGGTCGAACCAAACTGACGGTCGAACCAAACCCATTCCAAACCATTCCAAA

Search Experiment

Bait Sequence Prey Sequence

0001 NM_016333.2 NM_016333.2 NM_021962.1NM_021962.1

0002 NM_016486.2NM_016486.2 NM_003134.1NM_003134.1

Perform BLASTAgainst HumanREFSEQ DB

proteomics

myriad

Protein Interaction Network Protein Interaction Network (Snapshot of ~8,000 interactions)(Snapshot of ~8,000 interactions)

proteomics

myriad

Knowledge Discovery (KD) ChallengesKnowledge Discovery (KD) Challenges

• ~80,000 unique interactions• ~100 biological data sourcesProtein Interaction DataProtein Interaction Data

• 1-10 novel drug targets per diseaseMarketable KnowledgeMarketable Knowledge

• >1.5million sequence fragments• ~250,000 search experiments performed• several TB data storage

Experimental MeasurementExperimental Measurement

• ~1000 relevant interactions for each interested pathways

DistilledDistilled InformationInformation

$$$, drugs, …$$$, drugs, …

??

• Data-driven• Discovery-oriented

proteomics

myriad

KD in Interaction-based ProteomicsKD in Interaction-based Proteomics

E-RDBMSE-RDBMS

Represent Interactions and Pathways

• Domain-specificData Modeling

Reduce Data Noise

• Data Cleansing• Statistical Data Analysis

Organize Data in Regulatory Pathways • DB Querying

• VisualizationSelect and Validate

Drug targets

• Knowledge Curation

• Data Integration

Genomics/Functional

Genomics Data

• LIMS Programming

Collect raw sequences and lab condition measurements

proteomics

myriad

Bioinformatics DB FrameworkBioinformatics DB Framework

Sequence Bank

SequenceAnnotation DB

Lab Sequence DB

R2H Analysis DataMart

Y2H LIMSDB

Base-callingResults

Base-callingResults

Base-callingResults

BLASTAnalysisResults

CuratedInteraction DB

Protein DomainDB

Pubmed

GeneOntology DB

Gene ExpressionDB

OMIM

Pronet DB

REFSEQDB

DomainAnalysisResults

EnsemblLocusLinkDB

Annotation DBAnnotation DBRefSeq, LocusLink, GO, RefSeq, LocusLink, GO, OMIM, CGAP, Protein OMIM, CGAP, Protein Kinase DB, GPCR DB, Kinase DB, GPCR DB, Ensemble, Curation, …Ensemble, Curation, …

Y2H Data Processing and Y2H Data Processing and Analysis DBAnalysis DB

Lab_Seq, Seq_Match, Lab_Seq, Seq_Match, Y2H_MartY2H_Mart

Y2H Interaction Data MartY2H Interaction Data MartY2h_MartY2h_Mart

proteomics

myriad

A Schema Fragment to Manage A Schema Fragment to Manage Sequence Similarity ResultsSequence Similarity Results

Jake Yue Chen and John Carlis (2003) Genomic Data Modeling. Information Systems Journal, 28(4), p287-310.

proteomics

myriad

Interaction Matrix using Randomly Interaction Matrix using Randomly Ordered Locus IDsOrdered Locus IDs

•12,958 unique Interactions•1955 bait loci•2766 prey loci

Jake Yue Chen, Jake Yue Chen, et alet al (2003) (2003) Proceedings of the IEEE Computer Science Society Proceedings of the IEEE Computer Science Society Bioinformatics Conference 2003Bioinformatics Conference 2003. Stanford University, Stanford, CA.. Stanford University, Stanford, CA.

proteomics

myriad

OutlineOutline


1010gg Regular Expression Case Regular Expression Case StudiesStudies

1010gg BLAST Case Studies BLAST Case Studies

proteomics

myriad

Oracle10Oracle10gg Regular Expressions: Regular Expressions: Powerful String ProcessingPowerful String Processing

RE new tools in Oracle10RE new tools in Oracle10gg Search and manipulate data strings of Search and manipulate data strings of

arbitrary complexityarbitrary complexity Prior database solutionsPrior database solutions

SQL SQL LIKELIKE operator operator Java stored procedures, C external librariesJava stored procedures, C external libraries

Prior non-database solutions: AWK, SED, Prior non-database solutions: AWK, SED, GREP, PERL, etc.GREP, PERL, etc.

Done now inside databaseDone now inside database Facilitates rapid data-centric analysisFacilitates rapid data-centric analysis

proteomics

myriad

Case1: Retrieving Protein data from Case1: Retrieving Protein data from SGD SGD (Saccharomyces Genome Database)(Saccharomyces Genome Database)

ORF Identifier

Associated Amino Acid

Sequence

proteomics

myriad

HTTP Raw DataHTTP Raw Data</script></script></head><body><body bgcolor='#FFFFFF'></head><body><body bgcolor='#FFFFFF'><table cellpadding="2" width="100%" cellspacing="0" border="0"><tr><td colspan="4"><hr width="100%" /></td></tr><tr><td valign="middle" <table cellpadding="2" width="100%" cellspacing="0" border="0"><tr><td colspan="4"><hr width="100%" /></td></tr><tr><td valign="middle" align="right"><a href="http://www.yeastgenome.org/"><img alt="SGD" border="0" src="http://www.yeastgenome.org/images/SGD-to.gif" align="right"><a href="http://www.yeastgenome.org/"><img alt="SGD" border="0" src="http://www.yeastgenome.org/images/SGD-to.gif" /></a></td><th valign="middle" nowrap="1">Quick Search:</th><td valign="middle" align="left"><form method="post" /></a></td><th valign="middle" nowrap="1">Quick Search:</th><td valign="middle" align="left"><form method="post" action="http://db.yeastgenome.org/cgi-bin/SGD/search/quickSearch" enctype="application/x-www-form-urlencoded">action="http://db.yeastgenome.org/cgi-bin/SGD/search/quickSearch" enctype="application/x-www-form-urlencoded"><input type="text" name="query" size="13" /><input type="submit" name="Submit" value="Submit" /><input type="text" name="query" size="13" /><input type="submit" name="Submit" value="Submit" /></form></td><th valign="middle" align="left"><a href="http://www.yeastgenome.org/sitemap.html">Site Map</a> | <a </form></td><th valign="middle" align="left"><a href="http://www.yeastgenome.org/sitemap.html">Site Map</a> | <a href="http://www.yeastgenome.org/HelpContents.shtml">Help</a> | <a href="http://www.yeastgenome.org/SearchContents.shtml">Full Search</a> href="http://www.yeastgenome.org/HelpContents.shtml">Help</a> | <a href="http://www.yeastgenome.org/SearchContents.shtml">Full Search</a> | <a href="http://www.yeastgenome.org/">Home</a></th></tr><tr><td align="left" colspan="4"><table cellpadding="1" width="100%" | <a href="http://www.yeastgenome.org/">Home</a></th></tr><tr><td align="left" colspan="4"><table cellpadding="1" width="100%" cellspacing="0" border="0"><tr align="center" bgcolor="navajowhite"><td><font size="-1"><a cellspacing="0" border="0"><tr align="center" bgcolor="navajowhite"><td><font size="-1"><a href="http://www.yeastgenome.org/ComContents.shtml">Community Info</a></font></td><td><font size="-1"><a href="http://www.yeastgenome.org/ComContents.shtml">Community Info</a></font></td><td><font size="-1"><a href="http://www.yeastgenome.org/SubmitContents.shtml">Submit Data</a></font></td><td><font size="-1"><a href="http://www.yeastgenome.org/SubmitContents.shtml">Submit Data</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast2sgd">BLAST</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast2sgd">BLAST</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/web-primer">Primers</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/web-primer">Primers</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/PATMATCH/nph-patmatch">PatMatch</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/PATMATCH/nph-patmatch">PatMatch</a></font></td><td><font size="-1"><a href="http://db.yeastgenome.org/cgi-bin/SGD/seqTools">Gene/Seq Resources</a></font></td><td><font size="-1"><a href="http://db.yeastgenome.org/cgi-bin/SGD/seqTools">Gene/Seq Resources</a></font></td><td><font size="-1"><a href="http://www.yeastgenome.org/Vl-yeast.shtml">Virtual Library</a></font></td><td><font size="-1"><a href="http://www.yeastgenome.org/Vl-yeast.shtml">Virtual Library</a></font></td><td><font size="-1"><a href="http://db.yeastgenome.org/cgi-bin/SGD/suggestion">Contact SGD</a></font></td></tr></table></td></tr><tr><td colspan="4"><hr href="http://db.yeastgenome.org/cgi-bin/SGD/suggestion">Contact SGD</a></font></td></tr></table></td></tr><tr><td colspan="4"><hr width="100%" /></td></tr></table><table cellpadding="0" width="100%" cellspacing="0" border="0"><tr><td width="10%"><br /></td><td width="100%" /></td></tr></table><table cellpadding="0" width="100%" cellspacing="0" border="0"><tr><td width="10%"><br /></td><td valign="middle" align="center" width="80%"><h1>Sequence for a region of YDR099W/BMH2</h1></td><td valign="middle" align="right" valign="middle" align="center" width="80%"><h1>Sequence for a region of YDR099W/BMH2</h1></td><td valign="middle" align="right" width="10%"></td></tr></table><p /><center><a target="infowin" href="http://db.yeastgenome.org/cgi-bin/SGD/suggestion">Send questions or width="10%"></td></tr></table><p /><center><a target="infowin" href="http://db.yeastgenome.org/cgi-bin/SGD/suggestion">Send questions or suggestions to SGD</a></center><p /><p /><center><a target="infowin" href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast2sgd?suggestions to SGD</a></center><p /><p /><center><a target="infowin" href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast2sgd?name=YDR099W&suffix=prot">BLAST search</a> | <a target="infowin" href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-fastasgd?name=YDR099W&suffix=prot">BLAST search</a> | <a target="infowin" href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-fastasgd?name=YDR099W&suffix=prot">FASTA search</a></center><p /><center><hr width="35%" /></center><p /><font color="FF0000"><strong>Protein name=YDR099W&suffix=prot">FASTA search</a></center><p /><center><hr width="35%" /></center><p /><font color="FF0000"><strong>Protein translation of the coding sequence.</strong></font><p /><p />Other Formats Available: <a translation of the coding sequence.</strong></font><p /><p />Other Formats Available: <a href="http://db.yeastgenome.org/cgi-bin/SGD/getSeq?map=pmap&seq=YDR099W&flankl=0&flankr=0&rev=">GCG</a><pre>>YDR099W Chr href="http://db.yeastgenome.org/cgi-bin/SGD/getSeq?map=pmap&seq=YDR099W&flankl=0&flankr=0&rev=">GCG</a><pre>>YDR099W Chr 4 4 MSQTREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEERNLLSVAYKNVIGARRASMSQTREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEERNLLSVAYKNVIGARRASWRIVSSIEQKEESKEKSEHQVELIRSYRSKIETELTKISDDILSVLDSHLIPSATTGESKWRIVSSIEQKEESKEKSEHQVELIRSYRSKIETELTKISDDILSVLDSHLIPSATTGESKVFYYKMKGDYHRYLAEFSSGDAREKATNSSLEAYKTASEIATTELPPTHPIRLGLALNFSVFYYKMKGDYHRYLAEFSSGDAREKATNSSLEAYKTASEIATTELPPTHPIRLGLALNFSVFYYEIQNSPDKACHLAKQAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDISESVFYYEIQNSPDKACHLAKQAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDISESGQEDQQQQQQQQQQQQQQQQQAPAEQTQGEPTK*GQEDQQQQQQQQQQQQQQQQQAPAEQTQGEPTK*</pre><hr size="2" width="75%"></pre><hr size="2" width="75%"><table width="100%"><tr><td valign="top" align="left"><a href="http://www.yeastgenome.org/"><img border="0" <table width="100%"><tr><td valign="top" align="left"><a href="http://www.yeastgenome.org/"><img border="0" src="http://www.yeastgenome.org/images/arrow.small.up.gif" />Return to SGD</a></td><td valign="bottom" align="right"><form method="post" src="http://www.yeastgenome.org/images/arrow.small.up.gif" />Return to SGD</a></td><td valign="bottom" align="right"><form method="post" action="http://db.yeastgenome.org/cgi-bin/SGD/suggestion" enctype="application/x-www-form-urlencoded" target="infowin" name="suggestion">action="http://db.yeastgenome.org/cgi-bin/SGD/suggestion" enctype="application/x-www-form-urlencoded" target="infowin" name="suggestion"><input type="hidden" name="script_name" value="/cgi-bin/SGD/getSeq" /><input type="hidden" name="server_name" <input type="hidden" name="script_name" value="/cgi-bin/SGD/getSeq" /><input type="hidden" name="server_name" value="db.yeastgenome.org" /><input type="hidden" name="query_string" value="seq=YDR099W&flankl=0&flankr=0&map=p3map" /><a value="db.yeastgenome.org" /><input type="hidden" name="query_string" value="seq=YDR099W&flankl=0&flankr=0&map=p3map" /><a href="javascript:document.suggestion.submit()">Send a Message to the SGD Curators<img border="0" href="javascript:document.suggestion.submit()">Send a Message to the SGD Curators<img border="0" src="http://www.yeastgenome.org/images/mail.gif" /></a>src="http://www.yeastgenome.org/images/mail.gif" /></a></form></td></tr></table></body></html></form></td></tr></table></body></html>

Need to parse out embedded AA Sequence

proteomics

myriad

Function to Return AA Sequence Function to Return AA Sequence Given ORFGiven ORF

create or replace function orf2seq (create or replace function orf2seq ( p_orf in varchar2p_orf in varchar2) return varchar2 is) return varchar2 is v_stream clob;v_stream clob; strt number;strt number;beginbegin -- Retrieve the HTTP stream:-- Retrieve the HTTP stream: v_stream := httpuritype.getclob(httpuritype.createuri(v_stream := httpuritype.getclob(httpuritype.createuri( 'http://db.yeastgenome.org/cgi-bin/SGD/getSeq?seq='||p_orf||'http://db.yeastgenome.org/cgi-bin/SGD/getSeq?seq='||p_orf|| '&flankl=0&flankr=0&map=p3map''&flankl=0&flankr=0&map=p3map')) ););

-- Trim off the head of the stream:-- Trim off the head of the stream: strt := dbms_lob.instr(v_stream, 'Submit', 1, 1);strt := dbms_lob.instr(v_stream, 'Submit', 1, 1);

-- Strip out control characters, new lines, etc.:-- Strip out control characters, new lines, etc.: v_stream := regexp_replace(dbms_lob.substr(v_stream, 4000, strt), v_stream := regexp_replace(dbms_lob.substr(v_stream, 4000, strt), '[[:cntrl:]]''[[:cntrl:]]', ,

'');'');

-- Return the AA sequence:-- Return the AA sequence: return(regexp_substr(dbms_lob.substr(v_stream, 4000, strt), return(regexp_substr(dbms_lob.substr(v_stream, 4000, strt), '[[:upper:]]{10,}''[[:upper:]]{10,}'));));end;end;

Web site URL

RegExp to remove control

chars from HTTP stream

Parameterized ORF Id

RegExp to extract AA sequence

proteomics

myriad

Amino Acid Sequence for ORF Amino Acid Sequence for ORF ‘YDR099W’‘YDR099W’

SQL> select orf2seq('YDR099W') from dual;SQL> select orf2seq('YDR099W') from dual;

ORF2SEQ('YDR099W')ORF2SEQ('YDR099W')----------------------------------------------------------------------------------------------------------------------------------------------------------------MSQTREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEERNLLSVAYKNVIGARRASWRIVSSIEQKEESKEKSEHQMSQTREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEERNLLSVAYKNVIGARRASWRIVSSIEQKEESKEKSEHQVELIRSYRSKIETELTKISDDILSVLDSHLIPSATTGESKVFYYKMKGDYHRYLAEFSSGDAREKATNSSLEAYKTASEIVELIRSYRSKIETELTKISDDILSVLDSHLIPSATTGESKVFYYKMKGDYHRYLAEFSSGDAREKATNSSLEAYKTASEIATTELPPTHPIRLGLALNFSVFYYEIQNSPDKACHLAKQAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDISESATTELPPTHPIRLGLALNFSVFYYEIQNSPDKACHLAKQAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDISESGQEDQQQQQQQQQQQQQQQQQAPAEQTQGEPTKGQEDQQQQQQQQQQQQQQQQQAPAEQTQGEPTK

Elapsed: 00:00:01.24Elapsed: 00:00:01.24Elapsed time <2

sec. (network latency)

SQL> insert into pseq (orf_id, sequence)SQL> insert into pseq (orf_id, sequence)2 values ('YDR099W', orf2seq('YDR099W'));2 values ('YDR099W', orf2seq('YDR099W'));

proteomics

myriad

Case 2: Motif Searching in ProteinsCase 2: Motif Searching in Proteins

PROSITE database of protein sequence motifsPROSITE database of protein sequence motifsID TYR_PHOSPHO_SITE; PATTERN. ID TYR_PHOSPHO_SITE; PATTERN. AC PS00007; AC PS00007; DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); APR-1990 (INFO UPDATE). DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); APR-1990 (INFO UPDATE). DE DE Tyrosine kinase phosphorylation site.Tyrosine kinase phosphorylation site. PA PA [RK]-x(2,3)-[DE]-x(2,3)-Y.[RK]-x(2,3)-[DE]-x(2,3)-Y. CC /TAXO-RANGE=??E?V; CC /SITE=5,phosphorylation; CC /TAXO-RANGE=??E?V; CC /SITE=5,phosphorylation; CC /SKIP-FLAG=TRUE; CC /SKIP-FLAG=TRUE; DO PDOC00007; DO PDOC00007;

Source: Source: http://www.expasy.org/prosite/ps_frequent_patterns.txthttp://www.expasy.org/prosite/ps_frequent_patterns.txt

TKP Pattern: TKP Pattern: [RK]-x(2,3)-[DE]-x(2,3)-Y.[RK]-x(2,3)-[DE]-x(2,3)-Y. R=Arginine, K=Lysine, D=Aspartate, E=Glutamate, R=Arginine, K=Lysine, D=Aspartate, E=Glutamate,

Y=Tyrosine, x=any AAY=Tyrosine, x=any AA Oracle10gOracle10g Regular Expression Equivalent Regular Expression Equivalent

[RK].{2,3}[DE].{2,3}[Y][RK].{2,3}[DE].{2,3}[Y]

TKP TKP motif pattern

1 Arginine or Lysine

2 – 3 Any1 Aspartate

or Glutamate

1 Tyrosine

2 – 3 Any

proteomics

myriad

SQL Example: Retrieving all SQL Example: Retrieving all Interacting Proteins with TKPInteracting Proteins with TKP

select distinctselect distinct substr(a.refseq_id, 1, 9) refseq_id,substr(a.refseq_id, 1, 9) refseq_id, length(a.seq_string_varchar) seq_length,length(a.seq_string_varchar) seq_length, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 1)regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 1) motif_offs1, motif_offs1, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 2)regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 2) motif_offs2, motif_offs2, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 3)regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 3) motif_offs3, motif_offs3, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 4)regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 4) motif_offs4 motif_offs4fromfrom target_db a,target_db a, y2h_interaction_p by2h_interaction_p bwherewhere a.refseq_id like 'NP%'a.refseq_id like 'NP%' and and regexp_like(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]')regexp_like(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]') and (substr(a.refseq_id,1,9) = b.bait_refseq or substr(a.refseq_id,1,9) = and (substr(a.refseq_id,1,9) = b.bait_refseq or substr(a.refseq_id,1,9) =

b.prey_refseq)b.prey_refseq);;

Returns all rows with TKP site

Returns first 4 instances of TKP in

each sequence

proteomics

myriad

SQL Example OutputSQL Example Output

REFSEQ_ID SEQ_LENGTH MOTIFREFSEQ_ID SEQ_LENGTH MOTIF11_OFFS MOTIF_OFFS MOTIF22_OFFS MOTIF_OFFS MOTIF33_OFFS MOTIF_OFFS MOTIF44_OFFS_OFFS------------ ---------- ----------- ----------- ----------- ----------------------- ---------- ----------- ----------- ----------- -----------NP_003961 1465 14 202 347 537NP_003961 1465 14 202 347 537NP_003968 330 241 0 0 0NP_003968 330 241 0 0 0NP_003983 490 8 50 62 93NP_003983 490 8 50 62 93NP_004001 3562 3085 0 0 0NP_004001 3562 3085 0 0 0......

MHHCKRYMHHCKRYRRSPSPEEPDPPDPYYLSYRWKRRRSYSREHEGRLRYPSRREPPPRRSRSLSYRWKRRRSYSREHEGRLRYPSRREPPPRRSRSRRSHSHDDRLPRLPYYQRRYQRRYRRERRERRDDSDTSDTYYRCEERSPSFGRCEERSPSFGEDYYGPSRSRHREDYYGPSRSRHRRRRSRRSREERGPRGPYYRTRKHAHHCHKRRTRSCSSASSRSQQSSKRTGRSVEDDKEGHLVCRIGDWLQERYEIVGRTRKHAHHCHKRRTRSCSSASSRSQQSSKRTGRSVEDDKEGHLVCRIGDWLQERYEIVGNLGEGTFGKVVECLDHARGKSQVALKIIRNVGKYREAARLEINVLKKIKEKDKENKFLCVLMSDWFNFHGHMCIAFELLGNLGEGTFGKVVECLDHARGKSQVALKIIRNVGKYREAARLEINVLKKIKEKDKENKFLCVLMSDWFNFHGHMCIAFELLGKNTFEFLKENNFQPYPLPHVRHMAYQLCHALRFLHENQLTHTDLKPENILFVNSEFETLYNEHKSCEEKSVKNTSIRVADKNTFEFLKENNFQPYPLPHVRHMAYQLCHALRFLHENQLTHTDLKPENILFVNSEFETLYNEHKSCEEKSVKNTSIRVADFGSATFDHEHHTTIVATRHYRPPEVILELGWAQPCDVWSIGCILFEYYRGFTLFQTHENREHLVMMEKILGPIPSHMIHRFGSATFDHEHHTTIVATRHYRPPEVILELGWAQPCDVWSIGCILFEYYRGFTLFQTHENREHLVMMEKILGPIPSHMIHRTRKQKYFYKGGLVWDENSSDGRYVKENCKPLKSYMLQDSLEHVQLFDLMRRMLEFDPAQRITLAEALLHPFFAGLTPEERTRKQKYFYKGGLVWDENSSDGRYVKENCKPLKSYMLQDSLEHVQLFDLMRRMLEFDPAQRITLAEALLHPFFAGLTPEERSFHTSRNPSRSFHTSRNPSR

[RK][RK].{2,3}.{2,3}[DE][DE].{2,3}.{2,3}[Y][Y]

Motif #1 at offset 8

Motif #2 at offset

50

Motif #3 at offset

62

Motif #4 at offset 93

Result: 702 (56%) interacting proteins with TKP siteResult: 702 (56%) interacting proteins with TKP site

proteomics

myriad

Is 56% TKP in interacting proteins Is 56% TKP in interacting proteins significant?significant?

All Curated All Curated ProteinsProteins

Curated Curated Proteins w/ Proteins w/ TKPTKP

Percent with Percent with TKPTKP

Total NP Total NP EntriesEntries

16,90816,908 6,9916,991 41%41%

Myriad Myriad Proteomics Proteomics Interaction Interaction SubsetSubset

1,2481,248 702702 56%56%

Random sample test of all NP entries

• N = 33 random samples

• Sample size 7.4% (~1251)

• Sample mean = 515

• SD = 17.2

• Significance level < 1E-30

proteomics

myriad

OutlineOutline


10g Regular Expression Case Studies

1010gg BLAST Case Studies BLAST Case Studies

proteomics

myriad

Similarity Search (Sequence Similarity Search (Sequence Comparison): A Routine Biology TaskComparison): A Routine Biology Task

A Query Sequencen Target

Sequencesk Pair-wise

Comparison Results

Similarity Search has not been integrated into the DB system.

proteomics

myriad

proteomics

myriad

Using BLAST can be a laborious Using BLAST can be a laborious process & a data-management hellprocess & a data-management hell

Custom setup of BLAST target databaseCustom setup of BLAST target databaseIterate through query sequences: “Batch Iterate through query sequences: “Batch

BLAST”BLAST”Export/parse/filter/import data <-> DBMSExport/parse/filter/import data <-> DBMSIntegration of results with external dataIntegration of results with external data

proteomics

myriad

Case 1: Oracle 10g BLASTN as a Case 1: Oracle 10g BLASTN as a sequence identification toolsequence identification tool

SELECT SELECT t.t_seq_id, t.expectt.t_seq_id, t.expectFROM FROM TABLE (TABLE (BLASTN_MATCHBLASTN_MATCH ( ( (select sequence FROM query_db where sequence_id = 100),(select sequence FROM query_db where sequence_id = 100), CURSOR (select refseq_id, sequence_string FROM target_db CURSOR (select refseq_id, sequence_string FROM target_db

where refseq_id like 'NM_%')where refseq_id like 'NM_%') ) ) t ) ) t WHERE WHERE t.expect < 1E-20;t.expect < 1E-20;

-- A sequence fragment with a sequence_id = 100-- Sequence is stored in the query_db table.TACACACCTCGGCGTCGCAGCTCTCGATCATCTCCGGAGCTAACAAGGAAGGCCGGACTGTCCCGTAGAAGCCGCTCTGC

T_SEQ_ID EXPECT----------------- --------------NM_016333.2 0

proteomics

myriad

Case 2: Discovering “Interlogs”Case 2: Discovering “Interlogs”

A

B

X

Y

C

Z

Yeast Protein Interactome Human Protein Interactome

Homology Mapping

Interlogs: (A|X, B|Y) and (A|X, B|Z)

proteomics

myriad

A Computational Intensive TaskA Computational Intensive Task

Data to useData to use Yeast Protein-Protein Interaction DataYeast Protein-Protein Interaction Data Yeast Protein SequencesYeast Protein Sequences Human Protein-Protein Interaction DataHuman Protein-Protein Interaction Data Human Protein Sequences & AnnotationsHuman Protein Sequences & Annotations

Analysis to prepareAnalysis to prepare Homology search: yeast vs. human proteinsHomology search: yeast vs. human proteins

Things to considerThings to consider Collect/parse public data from webCollect/parse public data from web Import/export data for BLASTImport/export data for BLAST Connect analysis result to internal dataConnect analysis result to internal data

Missing Data

Laborious

Traditional way?Or inside DBMS?

proteomics

myriad

Pipelining Missing Data Directly into Pipelining Missing Data Directly into BLASTP SearchesBLASTP Searches

insert into yeast_human_homologinsert into yeast_human_homologselect select 'YDR099W‘ 'YDR099W‘ Yeast_ORF_name, Yeast_ORF_name,

t.t_seq_id t.t_seq_id Human_refseq,Human_refseq,t.expectt.expect E_ValueE_Value

from TABLE (from TABLE (BLASTP_MATCHBLASTP_MATCH ( ( ((SELECT orf2seq ('YDR099W') FROM dualSELECT orf2seq ('YDR099W') FROM dual),), CURSOR CURSOR ((SELECT SELECT refseq_id, sequence_string refseq_id, sequence_string FROM FROM target_dbtarget_db WHERE WHERE refseq_id LIKE 'NP_%'refseq_id LIKE 'NP_%')) )) ) t) tWHERE t.expect < 0.0001 WHERE t.expect < 0.0001 ;;

BLAST in DBMS

Online Data Integration

BLAST Target DB Customization

-- Note: Iterate through Yeast ORF Names to perform batch BLAST.

proteomics

myriad

Mission Impossible: AccomplishedMission Impossible: Accomplished

SELECTSELECTa.orf_1, a.orf_2, b.human_refseq, b.e_value , c.human_refseq, c.e_valuea.orf_1, a.orf_2, b.human_refseq, b.e_value , c.human_refseq, c.e_value

FROMFROM yeast_interaction a, yeast_interaction a, yeast_human_homolog b, yeast_human_homolog b, yeast_human_homolog c, yeast_human_homolog c, y2h_interaction_p dy2h_interaction_p d

WHEREWHERE a.orf_1 = b.yeast_ORF_name anda.orf_1 = b.yeast_ORF_name anda.orf_2 = c.yeast_ORF_name anda.orf_2 = c.yeast_ORF_name and(((b.human_refseq = d.bait_refseq and (b.human_refseq = d.bait_refseq and c.human_refseq = d.prey_refseq)c.human_refseq = d.prey_refseq)oror(b.human_refseq = d.prey_refseq and (b.human_refseq = d.prey_refseq and c.human_refseq = d.bait_refseq)c.human_refseq = d.bait_refseq)))

;;

ORF_1 ORF_2 HUMAN_REFSEQ E_VALUE HUMAN_REFSEQ E_VALUE------------------------- ------------------------- --------------- ---------- --------------- ----------YCR002C YHR107C NP_xxxxx1 5.9279E-44 NP_yyyyy1 3.7130E-46YCR002C YJR076C NP_xxxxx2 5.9279E-44 NP_yyyyy2 1.7807E-48YJR076C YHR107C NP_xxxxx3 1.9734E-39 NP_yyyyy3 3.7130E-46YCR002C YHR107C NP_xxxxx4 2.3257E-48 NP_yyyyy4 7.4988E-39YCR002C YJR076C NP_xxxxx5 2.3257E-48 NP_yyyyy5 1.9734E-39YJR076C YHR107C NP_xxxxx6 1.7807E-48 NP_yyyyy6 7.4988E-39

proteomics

myriad

ConclusionConclusion

Data-driven discovery-oriented Data-driven discovery-oriented bioinformatics framework demands rich bioinformatics framework demands rich bio-specific DBMS supportbio-specific DBMS support

1010gg Regular Expression and BLAST in Regular Expression and BLAST in DBMS features benefit our scientific DBMS features benefit our scientific discovery tasks in interactome studiesdiscovery tasks in interactome studies

Additional enhancementsAdditional enhancements

proteomics

myriad

ReferencesReferences

Jake Yue Chen, Jake Yue Chen, et alet al (2003) (2003) Initial Large-scale Exploration of Initial Large-scale Exploration of Protein-protein Interactions in the Human Brain.Protein-protein Interactions in the Human Brain. Proceedings of Proceedings of the IEEE Computer Science Society Bioinformatics Conference the IEEE Computer Science Society Bioinformatics Conference 20032003. Stanford University, Stanford, CA.. Stanford University, Stanford, CA.

Sudhir Sahasrabudhe and Chen, Jake Yue (2003) Sudhir Sahasrabudhe and Chen, Jake Yue (2003) Extracting Extracting Biological Information from System-scale Protein Interactome Biological Information from System-scale Protein Interactome Data.Data. Tutorial at the 11th International Conference on Intelligent Tutorial at the 11th International Conference on Intelligent Systems in Molecular BiologySystems in Molecular Biology. Brisbane, Australia.. Brisbane, Australia.

Jake Yue Chen and John Carlis (2003) Jake Yue Chen and John Carlis (2003) Similar_Join: Extending Similar_Join: Extending DBMS with a Bio-specific Operator.DBMS with a Bio-specific Operator. Proceedings of the 2003 ACM Proceedings of the 2003 ACM Symposium on Applied ComputingSymposium on Applied Computing. Melbourne, Florida.. Melbourne, Florida.

Jake Yue Chen and John Carlis (2003) Jake Yue Chen and John Carlis (2003) Genomic Data Modeling.Genomic Data Modeling. Information Systems, Vol 28, issue 4: p287-310Information Systems, Vol 28, issue 4: p287-310..

Can New Oracle10 g Search Features Help Bridge the Biological Discovery Gap?

Documents

Transcript of Can New Oracle10 g Search Features Help Bridge the Biological Discovery Gap?