Semantic Publishing with Nanopublications
-
Upload
tobias-kuhn -
Category
Science
-
view
277 -
download
0
Transcript of Semantic Publishing with Nanopublications
Semantic Publishing with Nanopublications
Tobias Kuhn
http://www.tkuhn.ch
@txkuhn
ETH Zurich
Research ColloquiumStanford Center for Biomedical Informatics Research
12 March 2015
Information Overload:>1M new Scientific Articles Per Year
• How can we helpscientists to stayup-to-date?
• How can we efficientlyretrieve and aggregatepublished scientificresults?
• How should we designthe system for thefuture of scientificcommunication?
Image from: Kuhn et al. Inheritance patterns in citation networks reveal scientific memes. Physical Review X 4. 2014.Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 2 / 30
Nanopublications:Provenance-Aware Semantic Publishing
assertion
provenance
publication info
nanopublication
http://nanopub.org / @nanopub org
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 3 / 30
Nanopublications:Provenance-Aware Semantic Publishing
Nanopub0001
Assertion:
opm:wasDerivedFrom d:DataSourceX
Provenance:
ns1:mosquito ns2:malaria
ns3:transmission
Publication Information:
dc:created “2013-01-01”pav:createdBy p:Isabelle_Dubois
http://nanopub.org / @nanopub org
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 4 / 30
Example: Nanopublication Converted from BEL
sub:assertion {
sub:_3 a rdf:Statement ;
rdf:subject schem:Adenosine%20triphosphate ;
rdf:predicate belv:decreases ; rdf:object sub:_1 ;
occursIn: obo:UBERON_0001134 , species:9606 .
sub:_1 a go:0003824 ; hasAgent: sub:_2 .
sub:_2 a Protein: ; geneProductOf: hgnc:12517 .
sub:assertion rdfs:label "a(SCHEM:\"Adenosine triphosphate\") -| cat(p(HGNC:UCP1))" .
}
sub:provenance {
sub:assertion prov:hadPrimarySource pubmed:9703368 ;
prov:wasDerivedFrom beldoc: , sub:_4 .
beldoc: dce:description "Approximately 61,000 statements." ;
dce:rights "Copyright (c) 2011-2012, Selventa. All rights reserved." ;
dce:title "BEL Framework Large Corpus Document" ;
pav:authoredBy sub:_5 ; pav:version "20131211" .
sub:_4 prov:value "UCP1 contains six potential transmembrane a-helices (72) and acts under the form of a homodimer (73). Its uncoupling activity is increased by FFA (7477) and by long chain fatty acyl CoA esters (78, 79), and decreased by purine nucleotide di- or tri-phosphates (12, 74)." ;
prov:wasQuotedFrom pubmed:9703368 .
sub:_5 rdfs:label "Selventa" .
}
sub:pubinfo {
this: dct:created "2014-07-03T14:34:13.226+02:00"^^xsd:dateTime ;
pav:createdBy orcid:0000-0001-6818-334X , orcid:0000-0002-1267-0234 .
}
https://github.com/tkuhn/bel2nanopubTobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 5 / 30
Vision: Changing Scholarly Communication
NowNarrative articles at the center
FutureNanopublications at the center
Images from Mons et al. The value of data. Nature genetics, 43(4):281–283, 2011
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 6 / 30
Open Problems
Three important open problems for nanopublications are:
• Identifiability: How can we ensure the immutability ofnanopublications and provide verifiable identifiers?
• Persistence: How can we ensure that publishednanopublications can be reliably retrieved by others and remainpermanently available?
• Representation: How can we deal with results that have nostraightforward RDF representation?
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 7 / 30
Problem:Replication and Re-Use of Research Results
Exemplary Situation: Sue publishes a script that should alloweverybody to replicate her scientific analysis:
# Download data:
wget http://some-third-party.org/dataset/1.4
# Analyze data:
...
Problems: What if the third party silently changes that version of thedataset? What if the resource becomes unavailable at this location?What if the web site later gets hacked and the data manipulated?
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 8 / 30
Identifiability Problem
http://some-third-party.org/dataset/1.4
m ?
Given a URI for a digital artifact, there is no reliable standardprocedure of checking whether a retrieved file really represents thecorrect and original state of that artifact.
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 9 / 30
Trusty URIs
Basic idea: Use of cryptographic hash values calculated on digitalartifacts.
Requirements:
• To allow for the verification of entire reference trees, the hashshould be part of the reference (i.e. the URI)
• To allow for meta-data, digital artifacts should be allowed tocontain self-references (i.e. their own URI)
• Format-independent hash for different kinds of content
• The complete approach should be decentralized and open
• We want to use them right away
Example:http://example.org/r1.RA5AbXdpz5DcaYXCh9l3eI9ruBosiL5XDU3rxBbBaUO70
Kuhn, Dumontier. Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data. ESWC 2014.
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 10 / 30
Trusty URIs: Range of Verifiability
With the hash as a part of the URI, the “range of verifiability”extends to referenced artifacts (if they also use trusty URIs):
http://...RAcbjcRI...
http://...RAQozo2w...
http://...RABMq4Wc...
http://...RAcbjcRI...
http://...RAQozo2w...
http://.../resource23
http://.../resource23...
http://...RAUx3Pqu...
http://.../resource55
http://...RABMq4Wc...
http://.../resource55http://...RARz0AX-...
...
http://...RAUx3Pqu......
http://...RARz0AX......
range of verifiability
Kuhn, Dumontier. Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data. ESWC 2014.
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 11 / 30
Verifiable — Immutable — Permanent
XWhether or not a given resource is the one a given trusty URI issupposed to represent can be verified with perfect confidence.
(assuming that the trusty URI for the required artifact is known, e.g. becauseanother artifact contains it as a link)
Kuhn, Dumontier. Making Digital Artifacts on the Web Verifiable and Reliable. IEEE TKDE. To appear. / Kuhn,Dumontier. Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data. ESWC 2014.
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 12 / 30
Verifiable — Immutable — Permanent
A trusty URI artifact is immutable, as any change in the contentalso changes its URI, thereby making it a new artifact.
(as soon as your trusty URI has been picked up by third parties, e.g. cached orlinked from other resources, every change will be noticed)
Kuhn, Dumontier. Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data. ESWC 2014.
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 13 / 30
Verifiable — Immutable — Permanent
�Trusty URI artifacts are permanent, as they can be retrieved fromthe cache of third-party websites if otherwise no longer available.
(if there are search engines and web archives regularly crawling and caching theartifacts on the web)
Kuhn, Dumontier. Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data. ESWC 2014.
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 14 / 30
Calculating Trusty URIs is Fast and Reliable
Kuhn, Dumontier. Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data. ESWC 2014.
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 15 / 30
Nanopublication Server Network
Nanopublicationswith Trusty URIs
Publication
Retrieval
Propagation / Archiving
http://npmonitor.inn.ac
Kuhn et al. Publishing without Publishers: a Decentralized Approach to Dissemination, Retrieval, and Archiving ofData. arXiv:1411.2749.
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 16 / 30
Decentralized — Open — Real-time
• No a central authority: Everybody can set up a server and jointhe network
• No restrictions on publication: Everybody can uploadnanopublications
• No delay between submission and publication: Nanopublicationsare made public immediately
• No updates: If a nanopublication is modified, that makes it anew nanopublication (enforced by trusty URIs)
• No queries: Only simple identifier-based lookup
Kuhn et al. Publishing without Publishers: a Decentralized Approach to Dissemination, Retrieval, and Archiving ofData. arXiv:1411.2749.
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 17 / 30
High Performance and High Availability
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 18 / 30
Defining Datasets with Nanopublication Indexes(which are themselves Nanopublications)
appends
has sub-index
has element
(a) (b)
(c) (f)
(d) (e)
Kuhn et al. Publishing without Publishers: a Decentralized Approach to Dissemination, Retrieval, and Archiving ofData. arXiv:1411.2749.
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 19 / 30
Using Nanopublication Datasets
Once published in the network, nanopublication indexes can be cited:
[7] Nanopubs converted from OpenBEL’s Small and Large Corpus 20131211.Nanopublication index, 4 March 2014,http://np.inn.ac/RAR5dwELYLKGSfrOclnWhjOj-2nGZN_8BW1JjxwFZINHw
Researchers can then fetch and reuse the data in a reliable andprefectly reproducible manner:
# Download data:
GetNanopub.sh -c RAR5dwELYLKGSfrOclnWhjOj-2nGZN 8BW1JjxwFZINHw
# Analyze data:
...
Existing data can be recombined in new indexes; and researchers canunambiguously refer to the used datasets for new results:
this: prov:wasDerivedFrom nps:RAR5dwELYLKGSfrOclnWhjOj-2nGZN 8BW1JjxwFZINHw
Kuhn et al. Publishing without Publishers: a Decentralized Approach to Dissemination, Retrieval, and Archiving ofData. arXiv:1411.2749.
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 20 / 30
Real-Time and Replicable Data Mining
authors
users
curators
structured data sources
unstructured data sources
ocean ofnanopublications
nanopublication portals
in silico experiments
network analyses
aggregated views
hypothesis generation
...bots
1
2
3
45
6
Kuhn et al. Broadening the Scope of Nanopublications. ESWC 2013.
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 21 / 30
How can we Represent Actual Scientific Resultsin RDF?
Example:
[...] the risk of developing neurodegenerative disease in idiopathicREM sleep behavior disorder is substantial, with the majority ofpatients developing Parkinson disease and Lewy body dementia.[PubMed 19109537]
This is difficult to represent in RDF, but it would correspond to twonanopublication assertions:
• “The risk of developing neurodegenerative disease in idiopathic REMsleep behavior disorder is substantial.”
• “The majority of patients with idiopathic REM sleep behavior disorderwho develop a neurodegenerative disease develop Parkinson diseaseand Lewy body dementia.”
Nanopublication provenance and metadata, on the other hand, wouldbe relatively easy to produce.
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 22 / 30
English Sentences as Anchors to StructureScientific Discourse
Malaria is transmitted by mosquitoes.
Mosquitoes transmit malaria.
Malaria is transmitted by female mosquitoes.
study A
same meaning
more specific meaning of
provides evidence for provides counter-
evidence against
study Bstudy C
study DMalaria is transmitted by moscitos.
corrected version of
Kuhn et al. Broadening the Scope of Nanopublications. ESWC 2013.
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 23 / 30
AIDA Sentences: Criteria
Single English sentences that are:
• Atomic: a sentence describing one thought that cannot befurther broken down in a practical way
• Independent: a sentence that can stand on its own, withoutexternal references like “this effect” or “we”
• Declarative: a complete sentence ending with a full stop thatcould in theory be either true or false
• Absolute: a sentence describing the core of a claim ignoring theuncertainty about its truth and how it was discovered (no“probably” or “evaluation showed”); typically in present tense
Kuhn et al. Broadening the Scope of Nanopublications. ESWC 2013.
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 24 / 30
Controlled Natural Language (CNL)
AIDA is a kind of Controlled Natural Language.
More about that in my talk next week (20 March 2014)in the Protege Seminar.
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 25 / 30
AIDA Nanopublications
Nanopublications can be combined with AIDA sentences to allow forinformal and underspecified assertions:
Malaria is transmitted by mosquitoes.
Malaria is transmitted by mosquitoes.
ns1:mosquito
ns3:transmission ns1:mosquito
Malaria is transmitted by mosquitoes.
ns2:malaria
ns3:transmission
They can be linked regardless of their degree of formality:
ns1:mosquito
Malaria is transmitted by mosquitoes.
ns2:malaria
ns3:transmission
Kuhn et al. Broadening the Scope of Nanopublications. ESWC 2013.
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 26 / 30
AIDA Sentences can be Efficiently Created
Manual creation of AIDA sentences from existing abstracts:
163total 100%
114perfect 70%
7typo etc. 4%
10inaccurate 6%
32not AIDA 20%
0 20 40 60 80 100 120 140 160sentences
Automatic extraction of AIDA sentences from existing GeneRIF dataset:
250total 100%
177perfect 71%
8typo etc. 3%
65not AIDA 26%
0 25 50 75 100 125 150 175 200 225 250sentences
Kuhn et al. Broadening the Scope of Nanopublications. ESWC 2013.
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 27 / 30
Examples: AIDA Sentences for Claims byFamous Computer Scientists
• Computers are more flexible and more powerful if data and programinstructions are both stored side by side in the same memory. [John vonNeumann]
• The problem of determining the provability of arbitrary propositions infirst-order logic is undecidable. [Alonso Church / Alan Turing]
• There is no general algorithm that can decide for all possible programs andinputs whether the program will terminate or run forever. [Alan Turing]
• Automated reasoning can be suitably implemented as a search tree with thehypothesis at the root and by applying heuristics to trim branches. [HerbertSimon and Allen Newell]
• Single-layer perceptrons cannot learn the XOR relation. [Marvin Minsky andSeymour Papert]
• The use of GOTO statements in programming is harmful to programstructure. [Edsger Dijkstra]
• Hypertext representations are more flexible, more general, and more naturalthan classical text structure for computer-supported information systems.[Ted Nelson]
https://github.com/tkuhn/aida/
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 28 / 30
Summary
• With Trusty URIs, digital artifacts on the Web can be madeimmutable and verifiable
• With a decentralized network of nanopublication servers, wecan build a reliable and trustworthy infrastructure for datapublishing
• With AIDA sentences, we can formally represent provenanceand metadata of informal or underspecified assertions
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 29 / 30
Thank you for your attention!
Questions?
Tobias Kuhn, ETH Zurich Semantic Publishing with Nanopublications 30 / 30