Open Knowledge: Reproducibility in Cheminformatics with Open Data, Open Source, and Open Standards

Post on 27-Jan-2015

109 views 1 download

Tags:

description

My presentation at the "Open Drug Discovery and Open Notebook Science" session at the GDCh-Wissenschaftsforum Chemie 2009 in Frankfurt.

Transcript of Open Knowledge: Reproducibility in Cheminformatics with Open Data, Open Source, and Open Standards

Open Knowledge: Reproducibility inCheminformatics with Open Data, Open

Source and Open Standards

Egon Willighagen <http://chem-bla-ics.blogspot.com/>

Bioclipse & Proteochemometric Group (Prof. Wikberg)Department of Pharmaceutical Biosciences

Uppsala University

2009-08-31

Problem

Solution

Results

Discussions

Conclusion

The Setting...

1998: Organicchemistry...beatiful science!But ... why, how,what, ...

PJJA Buijnsters et al., Eur.J.Org.Chem, 2002, 1397–1406

2009-08-31 Bioclipse & Proteochemometric Group - 2 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

Reliable Knowledge: Trust

How to build Trusttrack record

2009-08-31 Bioclipse & Proteochemometric Group - 3 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

Knowledge: Trust

How to build Trusttrack recordtransparency: citation

2009-08-31 Bioclipse & Proteochemometric Group - 4 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

Knowledge: Trust

How to build Trusttrack recordtransparency: citationreproducibility: details

2009-08-31 Bioclipse & Proteochemometric Group - 5 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

Knowledge: Trust

How to build Trusttrack recordtransparency: citationreproducibility: details

Open {Data|Standards|Source|. . . }

2009-08-31 Bioclipse & Proteochemometric Group - 6 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

Knowledge Representation...

What are theorganic normalconditions?

2009-08-31 Bioclipse & Proteochemometric Group - 7 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

The Problem: Reproducibility...

Where reproducibility isseverely hampered:

recalculate basic atom andbond propertiesaccess to QSAR/QSPRdatawell-defined algorithmspublications destroyinformation

2009-08-31 Bioclipse & Proteochemometric Group - 8 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

Solutions...

Openesslicense that allowsmodification andredistributionhiding behind publicdomain is not helpful

Semantic Webbe explicit in what youmeanboth in facts and inalgorithms

2009-08-31 Bioclipse & Proteochemometric Group - 9 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

Reproducibility needs ODOSOS

Open DataNo Intellectual Monopoly

Open Sourcealgorithms are compleximplementations even morestrong interaction with representation

Open StandardsSemantic Webformatsunique identifiers

http: // en. wikipedia. org/ wiki/ Glyn_ Moody

2009-08-31 Bioclipse & Proteochemometric Group - 10 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

Jmol

Started in 1997 byDan Gezelter(Notre Dame)Leaders: BradlySmith, me, MiguelHoward, BobHanson

E.L. Willighagen, M. Howard, Nature Precedings, 2005http: // www. jmol. org/

2009-08-31 Bioclipse & Proteochemometric Group - 11 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

The Chemistry Development Kit

A Family of ProjectsCDK-Taverna (chemoinformatics workflows)JChemPaint (semantic 2D editor)ChemoJava (GPL-ed extension)

Goalslibrary of cheminformatics algorithmseducational

UsageCDK 2003: 75+ times cited in literatureBioclipse, KNIME, Jumbo (CML), AMBIT, ...

C. Steinbeck et al., J.Chem.Inf.Comput.Sci, 2003C. Steinbeck et al., Curr.Pharm.Design, 2006

2009-08-31 Bioclipse & Proteochemometric Group - 12 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

CDK: an Open Project

Featuresopen mailinglist and bugtrackeropen source repositoryrelease soon, release often

Offer Reviewsenior developers reviewpatches

2009-08-31 Bioclipse & Proteochemometric Group - 13 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

Bioclipse

O. Spjuth et al., BMC Bioinformatics 2007, 8:59

2009-08-31 Bioclipse & Proteochemometric Group - 14 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

Integration

Servicesdatabases: PubChemweb servicesGoogle SpreadsheetsMyExperiment.org: BioclipseScripting LanguageTwitter, ...journals, ...

TechniquesSOAP, REST, XMPP, . . .Resource Description Frameworkdedicated APIs

2009-08-31 Bioclipse & Proteochemometric Group - 15 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

MyExperiment: Bioclipse ScriptingLanguage

2009-08-31 Bioclipse & Proteochemometric Group - 16 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

XMPP

XMPPJabberprotocolAlternative toHTTPXML-based:improvedsemantics

FeaturesAsychronousXML-based:improvedsemantics

J. Wagener et al., BMC Bioinformatics, 2009, in production

2009-08-31 Bioclipse & Proteochemometric Group - 17 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

Resource Description Framework

Facts as Triplessubjectpredictate (relation)object

Exampleswp:Benzenechem:hasSMILES"c1ccccc1"wp:Benzene owl:sameAschemspider:123

2009-08-31 Bioclipse & Proteochemometric Group - 18 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

OpenMolecules RDF

http://rdf.openmolecules.net/

2009-08-31 Bioclipse & Proteochemometric Group - 19 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

Blue Obelisk

R Guha et al., J.Chem.Inf.Model.,2006

2009-08-31 Bioclipse & Proteochemometric Group - 20 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

Which License?

ChoiceGPL v2 or v3, LGPL v2 orv3, Apache, BSD, MIT, ...FDL, CC0, PDDLImportant: redistribution,modification

Bad Practisenot explicitly stating yourintentionsPublic Domain

2009-08-31 Bioclipse & Proteochemometric Group - 21 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

Mixing Data?

License IncompatibilityAsk about the copyrightholders intention!

Use Open Standard InterfacesResource DescriptionFramework

2009-08-31 Bioclipse & Proteochemometric Group - 22 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

Conclusions

No Intellectual Monopoly AcchievedJmol, CDK, JChemPaint, Bioclipse

• A huge success!Open Data in chemistry is still way behind

• Open Access trap• Public Domain trap

Semantics is showing up• in RDF• in Publishing

2009-08-31 Bioclipse & Proteochemometric Group - 23 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion

The Details

http://www.citeulike.org/user/

egonw/tag/papers

http:

//chem-bla-ics.blogspot.com

mailto:

egon.willighagen@farmbio.uu.se

2009-08-31 Bioclipse & Proteochemometric Group - 24 - Egon Willighagen | chem-bla-ics.blogspot.com