Semantic Web in Physical Science

58
The Semantic Web in Physical Science/Engineering Peter Murray-Rust University of Cambridge Open Knowledge Foundation Culham Laboratory, 2013-09-11, UK

description

Maths, Chemistry, Physics are very well suited for the Semantic Web, but very poorly represented. Here I show how valuable it can be and what (relatively little) needs to be dome

Transcript of Semantic Web in Physical Science

Page 1: Semantic Web in Physical Science

The Semantic Web in Physical Science/Engineering

Peter Murray-RustUniversity of Cambridge

Open Knowledge Foundation

Culham Laboratory, 2013-09-11, UK

Page 2: Semantic Web in Physical Science

Themes

To make the complete scientific literature accessible to machines and humans

• The Semantic Web.• The power and need for Open• Building Communities• Multidisciplinarity

Funding includes JISC, Unilever, EPSRC.

Page 3: Semantic Web in Physical Science

The Semantic Web

"The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation."

Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001

Page 4: Semantic Web in Physical Science

The scientist’s amanuensis• "The bane of my life is doing things I know computers could do for

me" (Dan Connolly, W3C)

Example: A semantic amanuensis could• Give me a daily digest of zeolite papers• Extract all the crystal structures from them• Compute physical properties with GULP and NWChem• Compare the results statistically• Preserve and distribute the complete operation• Prepare the results for publication

The semantic web is having a personal amanuensis

Page 5: Semantic Web in Physical Science

Linked Open Data – the world’s knowledge

very little physical science http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png

DBPedia

BIO

Comp

Lib

PDB

Ontologies

GOV

GOV.uk

Music,ArtLiterature

Social

Knowledgebases

RDF triples

Page 6: Semantic Web in Physical Science

“Which Rivers flow into the Rhine and are longer than 50 kilometers?” or “Which Skyscrapers in China have more than 50 floors and have been constructed before the year 2000?”

Open Crystallography?“Which countries where tropical diseases are endemic have published structures of chiral natural products?”

Linked Open data from Wikipedia

CC-BY-SA from Wikipedia

Page 7: Semantic Web in Physical Science

Semantics: (Things Take Time)*

• 1994 1st WWW Conference• 1994 , Chemical MIME , Chemical Markup

Language (Henry Rzepa, PMR)• 2001 UK eScience programme, eMinerals• 2005 Materials Grid (Martin Dove group)• 2006 Blue Obelisk (Open Source chemistry)• 2011 PNNL (US) meetings and visit• 2012 Semantic Physical Science (Cambridge)

*TTT: Piet Hein

Page 8: Semantic Web in Physical Science

Componentised approach liberates

Individual, manual, unreusable, flaky

Commodity, standard, reliable, re-usable

Page 9: Semantic Web in Physical Science

Representing Semantics

Interoperating approaches: Markup Languages (“hardcoded objects”) MathML, G(eo)ML, CellML, S(ys)B(io)ML, • CML (Chemistry and numeric science):

1. Molecules (atoms, bonds, coordinates, 2. Reactions, 3. Spectra, 4. Solid state, 5. Computation

RDF (relationships, annotations, linking).Ontologies (Dictionaries)

Page 10: Semantic Web in Physical Science

Humans and machines use different languages

Page 11: Semantic Web in Physical Science

Scalable Vector Graphics (SVG)

<?xml version="1.0" encoding="UTF-8"?><svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="1280" height="640" viewBox="0 0 30240 15120"><defs id="defs6"><polygon points="0,-9 1.735535,-3.6038755 7.0364833,-5.6114082 3.8997116,-0.89008374 8.7743512,2.0026884 3.1273259,2.4939592 3.9049537,8.1087198 0,4 -3.9049537,8.1087198 -3.1273259,2.4939592 -8.7743512,2.0026884 -3.8997116,-0.89008374 -7.0364833,-5.6114082 -1.735535,-3.6038755 0,-9 " id="Star7"/></defs><path d="M 0,0 L 30240,0 L 30240,15120 L 0,15120 L 0,0 z" style="fill:#00008b"/><use transform="matrix(252,0,0,252,7560,11340)" id="Commonwealth_Star" style="fill:#fff" xlink:href="#Star7"/><use transform="matrix(120,0,0,120,22680,12600)" id="Star_Alpha_Crucis" style="fill:#fff" xlink:href="#Star7"/><!– snipped 217,2520 L 10080,2520 L 15120,0 z" id="Red_Diagonals" style="fill:red"/><use transform="matrix(-1,0,0,-1,15120,7560)" id="Red_Diagonals_Rotated" style="fill:red" xlink:href="#Red_Diagonals"/></svg>

Human-friendly

Machine-friendly

Automatic!

Page 12: Semantic Web in Physical Science

Mathematics Markup LanguageEnergy of c.c.p lattice of argon

4 pages clippedHuman-friendly

Machine-friendly

Many editors and tools existWe used MathWeaver

Automatic!

MathML

Page 13: Semantic Web in Physical Science

CML (Chemical Markup Language)

Human-friendly Machine-friendly

Automatic!

Page 14: Semantic Web in Physical Science

Non-semantic data

Data extraction difficult and incomplete

Human readers

Current scientific information flow … is broken for data-rich science

PDFLineprinter output

Text files

Human input

Page 15: Semantic Web in Physical Science

Semantic network closes the loop

Data mined from document

Data available for e-science and re-use

ComputationMeasurement

SemanticAuthoring

Community

Analysis

Page 16: Semantic Web in Physical Science

The network grows autonomously

Machine-machine

Machine-human

Human-machine

Human-human

Page 17: Semantic Web in Physical Science

• Example: Materials 2012, 5, 27-46; doi:10.3390/ma5010027

Page 18: Semantic Web in Physical Science

CHEMICAL STRUCTURES

Page 19: Semantic Web in Physical Science

REACTIONS

ABBREVIATIONS

“… electron donor (ED), such as an electron rich, metal-based light absorber (LA), and electron acceptor (EA) sites.”

Page 20: Semantic Web in Physical Science

SPECTRA

Page 21: Semantic Web in Physical Science

TABLES

Page 22: Semantic Web in Physical Science

PROPERTIES (NAME-VALUE-UNITS)

Name Value UnitsNV U

NV U

N V

U

N

V

V V U

Note CML supports value ranges and errors

Page 23: Semantic Web in Physical Science

Mathematics

CML is being integrated with computable (content) MathML

Page 24: Semantic Web in Physical Science

Materials Search Challenge

• What would you like a “Google for materials” to find for you in the scientific literature?

Page 25: Semantic Web in Physical Science

TimBerners-Lee’s Open data http://5stardata.info

★ make your stuff available on the Web (whatever format) under an OPEN license

★★ make it available as structured data (i.e. NOT PDF)

★★★ use non-proprietary formats (e.g., CSV)

★★★★

use URIs to denote things, so that people can point at your stuff

★★★★★ link your data to other data to provide context

CIFDICACSIUCr

CRYSTALEYE

Page 26: Semantic Web in Physical Science
Page 28: Semantic Web in Physical Science

Creating semantic content

1. Authoring tools for humans2. Program output 3. Chemical databases 4. Content mining and Natural

Language Processing (Text) (NLP)5. Community

Page 29: Semantic Web in Physical Science

Semantic authoring IUCr• http://blogs.ch.cam.ac.uk/pmr/2012/01/23/brian-mcmahon-publishin

g-semantic-crystallography-every-science-data-publisher-should-watch-this-all-the-way-through/

• 1:08 CIF • 3:36 CIF Syntax and dataTypes • 4:30 Publishing with CIF • 6:41 Demonstration: CheckCIF • 12:02 Interactive Chemical validation • 14:42 Linking data to journal article and search for novelty of data • 15:08 Jmol display applet • 21:03 Supplementary data • 21:47 PublCIF a tool to merge data and text and annotate them

Page 30: Semantic Web in Physical Science

Semanticizing Logfiles: JumboConverters

LOGFILE

Page 31: Semantic Web in Physical Science

QUIXOTE: Semantic KnowledgeBase for Computational Chemistry

Page 32: Semantic Web in Physical Science

http://wwmm.ch.cam.ac.uk/chemicaltagger

Typical chemical synthesis

Content Mining of Chemistry

Page 33: Semantic Web in Physical Science

Automatic semantic markup of chemistry

Could be used for analytical, crystallization, etc.

Page 34: Semantic Web in Physical Science

Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.

Page 35: Semantic Web in Physical Science

Machines can interpret phylogenetic treesOpen Content Mining of FACTS

>100,000 diagrams in literature; cost 1,000,000,000 hours

Unusable FACT

Re-usable FACT

Page 36: Semantic Web in Physical Science

Crowdcrafting and Hackathons

Page 37: Semantic Web in Physical Science

Crowdcrafting for Aegis/CERN• Does antimatter fall down or up?• Help the AEgIS experiment at CERN to work out how antimatter is affected by

gravity. Just join the dots!• Antimatter• The observable universe is composed almost entirely of matter but we can produce

stuff called antimatter in the lab. Antimatter is material composed of antiparticles.• Antiparticles have the same mass as normal matter particles but the opposite

charge. When an antiparticle collides with an ordinary matter particle they both annihilate - producing a burst of other particles and radiation.

• Antiparticles should interact gravitationally just like particles of ordinary matter because Einstein's weak equivalence principle states that gravity doesn't depend on composition. But if they don't then gravity is much more complicated than our current understanding indicates.

• http://crowdcrafting.org http://crowdcrafting.org/antimatter

Page 38: Semantic Web in Physical Science

Crowdcrafting for CERN/Aegis

Page 39: Semantic Web in Physical Science

[at Research Data Alliance, we are entering a new “era of open science”, which will be “good for citizens, good for scientists and good for society”.She explicitly highlighted the transformative potential of open access, open data, open software and open educational resources – mentioning the EU’s policy requiring open access to all publications and data resulting from EU funded research.

http://blog.okfn.org/2013/03/21/we-are-entering-an-era-of-open-science-says-eu-vp-neelie-kroes/#sthash.3SWDXDE6.dpuf

RCUKWellcomeERCNSF …

requirefully OPEN

Page 40: Semantic Web in Physical Science

Open Definition• “A piece of data or content is open if anyone

is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.”

OPEN NOT OPEN

PDBCOD,Crystaleye

CCDC, ICSD

RSC/ACS/IUCr CIFs Elsevier/Wiley/Springer CIFs

Acta Cryst E Acta Cryst ABCD (default)

CIF dictionaries

Page 41: Semantic Web in Physical Science

From Saulius Grazulis

Page 42: Semantic Web in Physical Science

Crystaleye• A database of 200,000 crystal structures scraped from

publications CIF supplemental information• CML molecules and name-value pairs• Re-usable as fragment base

Nick Day, Jim Downing, Sam Adams, N. W. England and Peter Murray-Rust*J.Appl.Cryst. (2012). 45 , 316–323, doi:10.1107/S0021889812006462

http://wwmm.ch.cam.ac.uk/crystaleye

Page 43: Semantic Web in Physical Science

ACS

IUCr

RSC

Supplemental Information (CIFs) harvested from Publications

ELS

Page 44: Semantic Web in Physical Science

As-Cl Bond lengths

LongShort

Page 45: Semantic Web in Physical Science

Long

Page 46: Semantic Web in Physical Science

Short

Page 47: Semantic Web in Physical Science

Link to Journal

Page 48: Semantic Web in Physical Science

COD Letter to Editors 2012[We] have become aware of growing concerns regarding the publication, preservation and quality maintenance of crystallographic data. …However, we believe that completely open deposition of data and multiple checks can ensure the quality and wide availability of scientific data

[Please] recommend to your authors that, they also deposit their supplementary crystallographic data into the COD when they submit scientific papers to your journals.

Being open by its design, the COD enables the creation of multiple mirrors and backup copies. It provides, thus, archival storage of scientific data with adequate reliability. … services for reviewers and editors to facilitate the peer-review. …since our database follows the Open Access model, all material deposited into the COD is available to other databases. The COD team actually encourages the use of our data collection for any possible scientific or industrial application by putting the database into the public domain

Page 49: Semantic Web in Physical Science

Recommendations for Open Crystallography

• Require Open Crystal Data for all publications• Deposition of Open Data in COD• Integrate CIF dictionaries as RDF into Linked

Open Data• Integrate COD into Linked Open Data Cloud• CCDC/ICSD to publish RAW author CIFs Openly

Page 50: Semantic Web in Physical Science

Most “Open Access” is not re-usable

PRICE per article USD0 6000

CC-BY / Reusable

Restricted by licence or lack of clarity

Nothing/unclear

CC-ND

CC-NC

Ross Mounce Panton Fellow2012

Page 51: Semantic Web in Physical Science

Panton Principles for Open Data in Science

Why? Wanted to avoid the mess in OA• Peter Murray-Rust, Cameron

Neylon, Rufus Pollock, John Wilbanks

2008-> 2010 (launch) atPanton ArmsPanton Fellowships (2012)

JennyMolloy

JordanHatcher

RufusPollock

JohnWilbanks

CameronNeylon

PeterMurray-Rust

Launch 2010

“Licence STM Data as CC0”

Page 52: Semantic Web in Physical Science

Panton Fellows

Ross Mounce & Sophie Kershaw(Support from Open Society Foundations)

Page 53: Semantic Web in Physical Science

* Data should be open• Make your wishes clear• Use an appropriate licence

Page 54: Semantic Web in Physical Science

Open Mining Manifesto

1. Define ‘open content mining’ in a broad and useful manner‘Open Content Mining’ means the unrestricted right of subscribers to extract, process and republish content manually or by machine in whatever form (without prior specific permissions and subject only to community norms of responsible behaviour in the electronic age.Text Numbers Tables Diagrams Graphical representations of relationships between variablesImages and video and audio when it is the means of expressing a fact.Semantics (XML, RDF)

Page 55: Semantic Web in Physical Science

2. Urge publishers and institutional repositories to adhere to the following principles:

Principle 1: Right of Legitimate Accessors to MineWe assert that there is no legal, ethical or moral reason to refuse to allow legitimate accessors of research content (OA or otherwise) to use machines to analyse the published output of the research community. Researchers expect to access and process the full content of the research literature with their computer programs and should be able to use

their machines as they use their eyes. The right to read is the right to mine

Principle 2: Lightweight Processing Terms and ConditionsMining by legitimate subscribers should not be prohibited by contractual or other legal barriers. Publishers should add clarifying language in subscription agreements that content is available for information mining by download or by remote access. Where access is

through researcher-provided tools, no further cost should be required. Users and providers should encourage machine processing

Principle 3: UseResearchers can and will publish facts and excerpts which they discover by reading and processing documents. They expect to disseminate and aggregate statistical results as facts and context text as fair use excerpts, openly and with no restrictions other than attribution. Publisher efforts to claim rights in the results of mining further retard the advancement of science by making those results less available to the research community; Such claims should

be prohibited. Facts don’t belong to anyone.

Page 56: Semantic Web in Physical Science

3. Strategies

Assert the above rights by:Educating researchers and librarians about the potential of content mining and the current impediments to doing so, including alerting librarians to the need not to cede any of the above rights when signing contracts with publishers

Compiling a list of publishers and indicating what rights they currently permit, in order to highlight the gap between the rights here being asserted and what is currently possible

Urging governments and funders to promote and aid the enjoyment of the above rights.

Page 57: Semantic Web in Physical Science

Take-away messages

• Lost/unused STM* data costs 30-100Billion /yr [1]• Licence: DATA as CCZero and TEXT as CC-BY• Content Mining for DATA is a RIGHT• Apathy is our worst enemy• Trust and empower young people

“A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.”Une donnée est ouverte, si chacun est libre de l'utiliser, de la réutiliser et de la redistribuer

*Scientific Technical Medical [1] PMR: submission to UK Hargreaves process

Page 58: Semantic Web in Physical Science

To make the complete scientific literature accessible to machines and humans