Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.

Exploring and Exploiting the Biological Maze

Zoé Lacroix

Arizona State University

Data collection queries

Scientific protocol– Must be able to reproduce the process

Involve multiple resources– Data sources– Applications

Expressing scientific protocols

Scientific protocols mix design and implementation

Design – What the protocols does (tasks)– Scientific objects involved

Implementation – How the protocol is executed– Data sources and applications

Expressing scientific protocols

Scientific protocols are driven by their implementation– Scientists use the resources they know

• data (quality)• access to data• format, limits, etc.

– Scientists may not exploit better resources because they do not know them

Queries should be driven by the design, the implementation should meet the design needs

Example* - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs

The alternative splicing pipeline will provide a complete characterization of variations in proteins due to splice variation or SNPs evident in repositiories of contiguous genome sequence data and expressed sequence tags (ESTs). The pipeline applies secondary structure, tertiary structure, domain motif detection and sequence comparison tools to proteins encoded by genes with alternatively splice forms or SNPs.

*Courtesy of Dr. Marta Janer, Institute for Systems Biology

Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs

From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides.



Data sources



tools



tasks



Scientific objects

Pipeline Selecting Target Proteins*

SMART Swiss-Prot

BIND DIP CEY2H

sigpep

blast x D.mel

Step 1 = retrieve all proteins from SMART and Swiss-Prot with textual search with the keyword “apoptosis”Step 2 = retrieve all proteins from Swiss-Prot with a signal peptide feature and the keyword “apoptosis” Step 3 = retrieve their binding partners from DIP, BIND and the C.elegans datasetStep 4 = run through a signal peptide prediction program such as SigPep to check for the presence of signal peptides in each of the sequencesStep 5 = homology search using BLAST of the retrieved sequences with proteins predicted from the Drosophila melanogaster genome might yield additional candidatesOutput = final set of signal peptide proteins involved in apoptosis

*Courtesy of Dr. Terry Gaasterland, The Rockefeller University

Design and implementation

Step Task Implementation

Input Relevant keyword for which the proteins are required

Step 1All proteins with keyword and with signal feature peptide must be retrieved

SMART

Swissprot

Step 2Binding partners of all of these proteins are retrieved DIP

BIND

Step 3Integration into final set is run through a signal peptide prediction program

SigPep

Step 4Homology search of the retrieved sequences with proteins predicted from the specific genome yield additional candidates

BLAST

Expressing scientific pipelines with BioNavigation Queries are expressed at a conceptual

level (design)

DNA Seq.

Disease

GeneCitation

Protein Seq.

Conceptual level

Scientific classes

Conceptual graph

Labeled edges– Scientific meaningful edges

Gene

NucleotideSequence

DNA

RNA mRNA

Protein

isA

isA

isA

isA

transcribesTo

isTranscribedFrom

isTranslatedFrom

translatesTo

Conceptual graph

Gene

NucleotideSequence

DNA

RNA mRNA

Protein

isA

isA

isA

isA

transcribesTo

isTranscribedFrom

isTranslatedFrom

translatesToIsRelatedTo

IsRelatedTo

IsRelatedTo

IsRelatedTo

IsRelatedTo

IsRelatedTo

IsRelatedTo

Mapping to physical resources

OMIM

Gen-Bank

Pub-Med

HUGO

NCBIProtein

DNA Seq.

Disease

GeneCitation

Protein Seq.

Conceptual level

Physical level

Data Sources

Scientific classes

Exploring biological metadata “Return all citations that are related to some

disease or condition” Diabetes : 11 Aging : 71 Cancer : 391

OMIM

NUCLEOTIDE PROTEIN

PUBMED

(P1)(P2) (P3)

•Link: Entrez provides an index with the Links in the display option from each entry • Parse: Parsing each entry to retrieve its related entries

•All: Entrez provides an index with the Links in the display option which allows to look at a set of entries at a time

Selecting biological resources

3 resources that look the same – Are they the same?

3 paths that will retrieve PubMed entries related to citations– Do they have the same semantics?

Results for the disease conditions diabetes, aging and cancer

P1 P2 P3Diabetes Link 43,890

42,969

59,959

Parse 43,747

43,090

51,906 All 44,037

43,581

49,719

Aging Link 48,393

51,712

60,129 Parse 48,398

51,855

61,260

All 48,393

51,474

60,938 Cancer Link 56,315

54,487

62,686

Parse 56,315

54,607

63,367 All 56,532

52,488

60,033

Overlap results for the disease conditions diabetes

P1 P2 P3

Link

P1 100% 25.82% 21.95%P2 25.28% 100% 70.00%P3 29.98% 97.68% 100%

Parse

P1 100% 23.93% 22.87%P2 29.18% 100% 81.20%P3 33.60% 97.81% 100%

All

P1 100% 24.75% 24.29%P2 24.64% 100% 79.49%P3 27.42% 90.68% 100%

Evaluating resources

Similar applications– Different outputs

Similar data sources– Different output

Number of resources– Different output

Order of resources– Different output

Exploiting semantics of resources

Number of entries Characterization of entries (number of

attributes) Time

Exploiting the semantics of links

BioNavigation (joint work with Louiqa Raschid and Maria-Esther Vidal) Conceptual graph

– No labeled links Queries

– Regular expressions of concepts ESearch

– Path cardinality - number of instances of paths of the result. For a path of length 1 between two sources S1 and S2, it is the number of pairs (e1, e2) of entries e1 of S1 linked to an entry e2 of S2.

– Target Object Cardinality – number of distinct objects retrieved from the final data source.

– Evaluation Cost – cost of the evaluation plan, which involves both the local processing cost and remote network access delays.

Work in progress

Conceptual graph– Labeled links

Queries– Complex dataflows

Physical graph– Access to a BioMetaDatabase– Data sources– Applications

Representing the conceptual graph in Protégé

Visualization Limitations in Protégé

Using the GraphViz plugin– Shows only IsA hierarchy

TgiViz plugin

Conclusion

Scientists need support to select resources to express their protocols

Semantics of resources may be exploited to enhance the data collection process

Need for a repository of biological metadata (BioMetaDatabase)

Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.

Documents

Transcript of Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.