Scripps bioinformatics seminar_day_2
Transcript of Scripps bioinformatics seminar_day_2
Day 2 of Computing on the shoulders of
giants: how existing knowledge is represented and applied in
bioinformaticsBenjamin Good
[email protected] Professor of the Department of
Molecular and Experimental Medicine
Recap from Day 1• Make things (articles, genes,
antibodies, etc.) easier to find• Answer questions• Generate hypotheses
Controlled vocabularies (MeSH)Ontologies (Gene Ontology)
knowledge graphs on the Web: the SPARQL query language
knowledge plus computation = inference, the ABC model
Computing with knowledge• Challenges with knowledge graphs
• Too much data• ->> query, sort, visualize, interact
• Not enough data• ->> mine for more..
• Goal for practical day: Go beyond PubMed! • gain hands on experience using a knowledge graph
• either with tools built for the purpose or with your own code…
Assignment: knowledge graph to hypothesis• Option 1 Coding
• Implement and apply an ABC Model style hypothesis generating program (can adapt from example provided)
• explain its logic, explain how you used it to generate a hypothesis, explain the hypothesis (provide a visual)
• Option 2 Non-coding• Use a knowledge discovery application(s) (list provided) to define a new hypothesis• if you can’t think of where to start, try to explain why Metformin may contribute to cancer survival
• Assignment deliverables: a document containing • the inputs you gave to your program or the online tool(s) you used• what was generated in response and the underlying logic • an image and text describing the results, especially any hypothesis you could derive
• (for Option 1 also submit any code written or files generated as a tar or zip archive)
Online tools for knowledge discovery• http://knowledge.bio (* we make this one…)• http://www.biograph.be (this is a good tool, but often breaks down) • http://epiphanet.uth.tmc.edu (also on the flaky side, but can be good) • https://skr3.nlm.nih.gov/SemMed/ (works okay, requires a (free)
account) • http://arrowsmith.psych.uic.edu (ugly interface, but good tool)
Demos• http://knowledge.bio • http://www.biograph.be• http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/start.cgi
Example question: repurposing all drugs
http://tinyurl.com/hwm9388
?drug
?disease
interacts with
protein
geneencoded by genetic association
treats??
Example program (feel free to follow or adapt to your interest)• Example
• Input = a disease (A)• Output = a ranked list of drugs (C) that might be used for treatment
• Render the results of your workflow as a cytoscape network that illustrates the reasoning behind the predictions
• Implementation• Python• Use a SPARQL endpoint such as http://query.wikidata.org
• + identify and use another endpoint (e.g. EBI, UniProt)• ++ access pubmed articles and MeSH indexing
Python setup• pip install RDFLib, SPARQLWrapper, pandas…. • Hopefully Jupyter already installed ? else install it http://
jupyter.readthedocs.io/en/latest/install.html • get notebook from https://
github.com/SuLab/sparql_to_pandas/blob/master/SPARQL_pandas.ipynb • go to directory where you put the notebook• run it with• >jupyter notebook• should be ready to run
the notebook• will run a basic search for disease-gene-drug connections in wikidata• will sort the results by the number of intervening genes• will export the data to a tab-delimited file you can view in Excel, text
editor, or load into cytoscape• Your job:
• Run it and extend it by one or more of:• adapting the query• changing the way the results are sorted• working with the output in cytoscape to produce an informative visualization
example output rendered in cytoscape
Other queries from Day 1 (slides 48-54)• Drugs that target a cancer and impact a specific biological process
• http://tinyurl.com/j222k6g
• Drugs that target a new disease linked via biological pathway with shared genes to disease the drug is now used to treat
• http://tinyurl.com/gpfr9kj
Possible inputs for adaptations• Browse and examine wikidata.org to see what you might make use of
• e.g. • Type of physical interaction between gene and drug• Gene ontology annotation (what evidence codes?)• Disease ontology hierarchy• Drug characteristics
Other possible knowledge sources • SPARQL
• UniProt http://sparql.uniprot.org • EBI SPARQL https://www.ebi.ac.uk/rdf/documentation/sparql-endpoints • look for unique identifiers on genes and proteins that you can use to link
wikidata content to their content
• Text• use the NCBI the E-utils API to programmatically access pubmed articles and
MeSH indexing http://www.ncbi.nlm.nih.gov/books/NBK25501/ • Can use to build co-occurrence networks of e.g. MeSH terms
Good luck! Ask questions!
ABC ranking algorithms• Out of all C, which are most strongly
related to A?• Rank by N shared B concepts
• c2: 4• c4:3• c1: 1• c3: 1• c5:1• c6:1
• Next level: adjust to down-weight highly connected nodes
A B Cc1c2c3c4c5c6
ABC ranking algorithms – advanced (require large networks to be useful) • Wren – Average Minimum Weight (AMW) (Wren)
• http://bioinformatics.oxfordjournals.org/content/20/3/389.full.pdf
• Linking Term Count with Average Minimum Weight (LTC-AMW) (Yetisgen-Yildiz and Pratt)
• https://www.researchgate.net/publication/23759128_A_new_evaluation_methodology_for_literature-based_discovery_systems
• Predicate inter-dependence (Rastegar-Mojarad)• https://s3.amazonaws.com/uploads.hipchat.com/25885/154162/UaGvvQqbr
hPBAWN/A%20new%20method.pdf