Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
-
Upload
lester-johns -
Category
Documents
-
view
212 -
download
0
Transcript of Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
Exploring and Exploiting the Biological Maze
Zoé Lacroix
Arizona State University
Data collection queries
Scientific protocol– Must be able to reproduce the process
Involve multiple resources– Data sources– Applications
Expressing scientific protocols
Scientific protocols mix design and implementation
Design – What the protocols does (tasks)– Scientific objects involved
Implementation – How the protocol is executed– Data sources and applications
Expressing scientific protocols
Scientific protocols are driven by their implementation– Scientists use the resources they know
• data (quality)• access to data• format, limits, etc.
– Scientists may not exploit better resources because they do not know them
Queries should be driven by the design, the implementation should meet the design needs
Example* - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs
The alternative splicing pipeline will provide a complete characterization of variations in proteins due to splice variation or SNPs evident in repositiories of contiguous genome sequence data and expressed sequence tags (ESTs). The pipeline applies secondary structure, tertiary structure, domain motif detection and sequence comparison tools to proteins encoded by genes with alternatively splice forms or SNPs.
*Courtesy of Dr. Marta Janer, Institute for Systems Biology
Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs
From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides.
Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs
From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides.
Data sources
Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs
From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides.
tools
Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs
From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides.
tasks
Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs
From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides.
Scientific objects
Pipeline Selecting Target Proteins*
SMART Swiss-Prot
BIND DIP CEY2H
sigpep
blast x D.mel
Step 1 = retrieve all proteins from SMART and Swiss-Prot with textual search with the keyword “apoptosis”Step 2 = retrieve all proteins from Swiss-Prot with a signal peptide feature and the keyword “apoptosis” Step 3 = retrieve their binding partners from DIP, BIND and the C.elegans datasetStep 4 = run through a signal peptide prediction program such as SigPep to check for the presence of signal peptides in each of the sequencesStep 5 = homology search using BLAST of the retrieved sequences with proteins predicted from the Drosophila melanogaster genome might yield additional candidatesOutput = final set of signal peptide proteins involved in apoptosis
*Courtesy of Dr. Terry Gaasterland, The Rockefeller University
Design and implementation
Step Task Implementation
Input Relevant keyword for which the proteins are required
Step 1All proteins with keyword and with signal feature peptide must be retrieved
SMART
Swissprot
Step 2Binding partners of all of these proteins are retrieved DIP
BIND
Step 3Integration into final set is run through a signal peptide prediction program
SigPep
Step 4Homology search of the retrieved sequences with proteins predicted from the specific genome yield additional candidates
BLAST
Expressing scientific pipelines with BioNavigation Queries are expressed at a conceptual
level (design)
DNA Seq.
Disease
GeneCitation
Protein Seq.
Conceptual level
Scientific classes
Conceptual graph
Labeled edges– Scientific meaningful edges
Gene
NucleotideSequence
DNA
RNA mRNA
Protein
isA
isA
isA
isA
transcribesTo
isTranscribedFrom
isTranslatedFrom
translatesTo
Conceptual graph
Gene
NucleotideSequence
DNA
RNA mRNA
Protein
isA
isA
isA
isA
transcribesTo
isTranscribedFrom
isTranslatedFrom
translatesToIsRelatedTo
IsRelatedTo
IsRelatedTo
IsRelatedTo
IsRelatedTo
IsRelatedTo
IsRelatedTo
Mapping to physical resources
OMIM
Gen-Bank
Pub-Med
HUGO
NCBIProtein
DNA Seq.
Disease
GeneCitation
Protein Seq.
Conceptual level
Physical level
Data Sources
Scientific classes
Mapping to physical resources
OMIM
Gen-Bank
Pub-Med
HUGO
NCBIProtein
DNA Seq.
Disease
GeneCitation
Protein Seq.
Conceptual level
Physical level
Data Sources
Scientific classes
Exploring biological metadata “Return all citations that are related to some
disease or condition” Diabetes : 11 Aging : 71 Cancer : 391
OMIM
NUCLEOTIDE PROTEIN
PUBMED
(P1)(P2) (P3)
•Link: Entrez provides an index with the Links in the display option from each entry • Parse: Parsing each entry to retrieve its related entries
•All: Entrez provides an index with the Links in the display option which allows to look at a set of entries at a time
Selecting biological resources
3 resources that look the same – Are they the same?
3 paths that will retrieve PubMed entries related to citations– Do they have the same semantics?
Results for the disease conditions diabetes, aging and cancer
P1 P2 P3Diabetes Link 43,890
42,969
59,959
Parse 43,747
43,090
51,906 All 44,037
43,581
49,719
Aging Link 48,393
51,712
60,129 Parse 48,398
51,855
61,260
All 48,393
51,474
60,938 Cancer Link 56,315
54,487
62,686
Parse 56,315
54,607
63,367 All 56,532
52,488
60,033
Overlap results for the disease conditions diabetes
P1 P2 P3
Link
P1 100% 25.82% 21.95%P2 25.28% 100% 70.00%P3 29.98% 97.68% 100%
Parse
P1 100% 23.93% 22.87%P2 29.18% 100% 81.20%P3 33.60% 97.81% 100%
All
P1 100% 24.75% 24.29%P2 24.64% 100% 79.49%P3 27.42% 90.68% 100%
Evaluating resources
Similar applications– Different outputs
Similar data sources– Different output
Number of resources– Different output
Order of resources– Different output
Exploiting semantics of resources
Number of entries Characterization of entries (number of
attributes) Time
Exploiting the semantics of links
BioNavigation (joint work with Louiqa Raschid and Maria-Esther Vidal) Conceptual graph
– No labeled links Queries
– Regular expressions of concepts ESearch
– Path cardinality - number of instances of paths of the result. For a path of length 1 between two sources S1 and S2, it is the number of pairs (e1, e2) of entries e1 of S1 linked to an entry e2 of S2.
– Target Object Cardinality – number of distinct objects retrieved from the final data source.
– Evaluation Cost – cost of the evaluation plan, which involves both the local processing cost and remote network access delays.
Work in progress
Conceptual graph– Labeled links
Queries– Complex dataflows
Physical graph– Access to a BioMetaDatabase– Data sources– Applications
Representing the conceptual graph in Protégé
Visualization Limitations in Protégé
Using the GraphViz plugin– Shows only IsA hierarchy
TgiViz plugin
Conclusion
Scientists need support to select resources to express their protocols
Semantics of resources may be exploited to enhance the data collection process
Need for a repository of biological metadata (BioMetaDatabase)