ContentMine.org and...
Transcript of ContentMine.org and...
![Page 1: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/1.jpg)
The Culture of Research Data
Peter Murray-Rust, ContentMine.org and UniversityOfCambridge
LEARN, London, UK 2016-01-29
The technology for Managing Research Data is already here…
…but we need a change of culture
Open Notebook Science
Publishers must be forced to serve us, not tyrranize us
![Page 2: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/2.jpg)
Just read the big letters
He’s got zillions of slides…
![Page 3: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/3.jpg)
My European Heroes
Young People(ContentMine)
NEELIE KROES
![Page 4: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/4.jpg)
The Right to Read is the Right to Mine
http://contentmine.org
![Page 5: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/5.jpg)
Themes
• Highly domain-dependent (chem, cryst, phylo)
• Requires community and centrality
• University repositories are NOT the solution
• Openness makes it dramatically easier/better
• The publisher-academic complex is a major problem.
• Infrastructure must be open and under our control
![Page 6: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/6.jpg)
WE pay for scholarly publications that WE
can’t read
[1] The Military-Industrial-Academic complex (1961)
(Dwight D Eisenhower, US President)
Publishers Academia Glory+?
$$, MS review
Taxpayer
Student
Researcher
$$ $$
in-kind
The Publisher-Academic complex[1]
![Page 7: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/7.jpg)
Elsevier wants to control Open Data
[asked by Michelle Brook]
![Page 8: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/8.jpg)
Some topics
• Github / software mgt informs data mgt
• Open notebook science
• Open source malaria + LabTrove
• Open phylogenetics
• Computational chemistry
• Crystallography
• Early career researchers can change the world, if we let them.
• Are “publishers” tyrants or servants?
![Page 9: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/9.jpg)
Every Research Data Manager
should be using me
![Page 10: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/10.jpg)
Why I reposit software in GitHub
I WANT TO!!! BETTER QUICKER SECURE AUDIT, BACKTRACKABLE EASY get collaborators Most early career software creators have repos How many people have USED Git?
![Page 11: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/11.jpg)
Free/Open Software Development CODE REPOSITORY
World community
CODE rewrite
validate
CODE fork
CODE
Re-use
CODE Re-use
Github, BitBucket StackOverflow, Apache
inspires
OSI
Example: ContentMine at http://github.com/ContentMine/quickscrape
BORN-OPEN-SOURCE
NO WALLS
![Page 12: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/12.jpg)
GIT housekeeps AUTOMATICALLY, eternally
Daily record of commits and Merges. Can backtrack to ANY Previous version
![Page 13: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/13.jpg)
Community involvement
https://github.com/ContentMine/quickscrape/pulls
Contributions from People “outside project”
![Page 14: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/14.jpg)
Compile Fail
Inactive
Fail Tests
Pass Tests
Continuous Integration (Jenkins) Every time I commit a change 50 projects are recompiled and tested. Impossible to do this manually!
![Page 15: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/15.jpg)
Software management Is a success!
Research DATA management Is a mess.
![Page 16: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/16.jpg)
Traditional Research and Publication
“Lab” work paper/thesis
Write
rewrite
Re-experiment
publish
???
Validation??
DATA
output “belongs” to publisher
Every process is LOSSY
![Page 17: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/17.jpg)
How NOT to publish data
HT Henry Rzepa
From Henry Rzepa: this article http://doi.org/10.1126/science.aad6252 which provides a 22 Mbyte PDF of data (mostly bitmaps of NMR spectra) and comes in at 404 pages long. [1] But this one http://doi.org/10.1021/jacs.5b05902 [comp chem] is 505 pages long (the current record holder?) [1] DATA Behind paywall
![Page 18: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/18.jpg)
505 pages PDF, was a machine-readable log file that could and should have been in a repo
Computational Chemistry
![Page 19: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/19.jpg)
MORE of the PDF DATA Destruction
Blind humans and Machines cannot read this
![Page 20: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/20.jpg)
ALWAYS put your (computational,
instrumental, observational)
data directly into a repository
![Page 21: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/21.jpg)
Let’s see some visionaries
![Page 22: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/22.jpg)
JD Bernal’s 1965 vision
However large an array of facts, however rapidly they accumulate, it is possible to keep them in order and to extract from time to time digests containing the most
generally significant information, while indicating how to find those items of specialized interest. To do so, however,
requires the will and the means. (Bernal, 1965)
Quoted by PMR in http://journals.iucr.org/d/issues/1998/06/01/ba0011/ba0011.pdf
![Page 23: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/23.jpg)
PMR’s Tribute
Planned Memorial Meeting July 14th 2014 Cambridge
OPEN NOTEBOOK SCIENCE
![Page 24: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/24.jpg)
https://en.wikipedia.org/wiki/Bermuda_Principles
• Automatic release of sequence assemblies larger than 1
kb (preferably within 24 hours). • Immediate publication of finished annotated
sequences. • Aim to make the entire sequence freely available in the
public domain for both research and development in order to maximise benefits to society.
HUMAN GENOME project used Open Notebooks
![Page 25: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/25.jpg)
Open is FASTER, BETTER, MORE EFFICIENT
![Page 26: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/26.jpg)
Open is FASTER, BETTER, MORE, MORE EFFICIENT
![Page 27: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/27.jpg)
Open Notebook Science, ONS
Jean-Claude Bradley 2006
All data immediately available to all. NO
INSIDER INFORMATION.
![Page 28: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/28.jpg)
TOOLS
Open Notebook Science
Open engineered repository
World community
INSTRUMENT
validate
merge
MODEL CODE
DATA
DATA knowledge
calibrate
Problems are solved communally; Nothing is needlessly duplicated; “publication“ is continuous ; data are SEMANTIC
Machines and humans Working together
![Page 29: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/29.jpg)
Here are three examples
![Page 30: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/30.jpg)
Mat Todd (Sydney) and MANY collaborators
http://opensourcemalaria.org/ (Chrome for interactivity)
Mat Todd, Univ Sydney, runs an Open Notebook community to create new antimalarials.
![Page 31: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/31.jpg)
Notebook managed on Git.
![Page 32: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/32.jpg)
Interactive OPEN chemical search tool from cheminfo.org
![Page 33: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/33.jpg)
Interactive OPEN molecular display Jmol (Bob Hanson et al)
![Page 34: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/34.jpg)
![Page 35: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/35.jpg)
Interactive OPEN chemical search tool from cheminfo.org
![Page 36: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/36.jpg)
data is associated with the proposed scientific endeavour prior to or at the
point of creation rather than by annotating the data with commentary after the experiment has taken place
University of Southampton
![Page 37: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/37.jpg)
Data thrives on Community
![Page 38: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/38.jpg)
Henry Rzepa does Open Notebook Computational Chemistry…
http://www.rzepa.net/blog/?p=14272
This is a current open notebook discussion, http://www.ch.imperial.ac.uk/rzepa/blog/?p=15552 (see comments, currently 67).
… on his blog
![Page 39: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/39.jpg)
![Page 40: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/40.jpg)
![Page 41: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/41.jpg)
COMMUNITY INVOLVEMENT
![Page 42: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/42.jpg)
Crystallography – a model for Data Management
• Pro-active, friendly international community • Committed active International Union(IUCr) • Data publication valued (1960-present) • Community develops semantics/dictionaries • Committed volunteer software innovators • Heavily Open approach • Massive and valuable re-use of data • Culture of validation/reproducibility • Respect and credit for tool development
![Page 43: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/43.jpg)
IUCr DICTIONARIES
![Page 44: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/44.jpg)
IUCr VALIDATION CRITERIA/TOOLS
![Page 45: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/45.jpg)
DATA
![Page 46: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/46.jpg)
PUBLICLY VALIDATED TRUSTABLE SCIENCE
![Page 47: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/47.jpg)
Where to reposit published crystallography?
Proteins -> PDB, Open
BUT
Inorganics -> ICSD Closed
Organics -> Cambridge (CCDC) Closed
SO
The community has built a Crystallography Open Database
![Page 48: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/48.jpg)
Restrictions on Re-use of Crystallographic data
NOTE: The CCDC is based on data contributed by scientists as part of publication and validation
Crystallographic data from publications now belongs to CCDC
![Page 50: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/50.jpg)
Interactive OPEN crystal search tool
![Page 51: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/51.jpg)
![Page 52: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/52.jpg)
Panton Fellows (Early Career Researchers)
Panton Principles of Open Scientific Data 2010
Publish data openly (CC0) and record your wishes
![Page 53: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/53.jpg)
Sophie Kershaw, Panton Fellow : Doctoral Training in Oxford
![Page 54: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/54.jpg)
Sophie Kershaw, Panton Fellow
![Page 55: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/55.jpg)
Rotation-Based Learning (RBL)
Phase 1: Initiator
• No communication
permitted between groups
• Attempt to reproduce
existing literature
• Deliver a coherent research
story by the end of Phase 1
Phase 2: Successor
• Communication between
groups still prohibited
• Validate and develop the
inherited research story
• Critique your predecessors
• Role of research producer vs. research user • Can this approach help to foster awareness of reproducibility issues?
Throughout Phases 1 & 2:
• Daily lectures on open
science culture & techniques
• First-hand application to own
research work
• Version control using GitHub
• Daily group supervision
![Page 56: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/56.jpg)
… third-year graduate students
So first-year grad students should be
trained by…
![Page 57: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/57.jpg)
https://en.wikipedia.org/wiki/Tree_of_life CC BY-SA
![Page 58: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/58.jpg)
Authors don’t deposit data (Ross Mounce)
![Page 59: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/59.jpg)
http://www.slideshare.net/rossmounce/the-pluto-project-ievobio-2014
![Page 60: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/60.jpg)
And we did it as Open Notebook Science
all data and code on Github
Discussion on public Discourse Tool
![Page 61: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/61.jpg)
https://en.wikipedia.org/wiki/Tree_of_life CC BY-SA
![Page 62: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/62.jpg)
4300 images in Github
![Page 63: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/63.jpg)
“Root”
We analysed every pixel
![Page 64: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/64.jpg)
Copyright, Open Access, and Human Rights March 13, 2015 Kevin Smith, J.D. 2 Comments The United Nations Human Rights Council … a report from Farida Shaheed, who is a “Special Rapporteur” in the area of “cultural rights.” … its frank recognition that intellectual property laws are in tension with the fundamental human right of access to science and culture … WIPO is charged, whether effectively or not, to find ways to facilitate open access to science and culture … this is not just a “what’s best for academia and for my interests” issue, but a true human rights issue. Ms. Shaheed’s report makes this case in a concise and compelling way … - See more at: http://blogs.library.duke.edu/scholcomm/2015/03/13/copyright-open-access-and-human-rights/#sthash.6CQiMoiV.dpuf
Many diagrams had author errors
![Page 65: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/65.jpg)
Supertree created from 4300 papers
![Page 66: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/66.jpg)
Copyright, Open Access, and Human Rights March 13, 2015 Kevin Smith, J.D. 2 Comments The United Nations Human Rights Council … a report from Farida Shaheed, who is a “Special Rapporteur” in the area of “cultural rights.” … its frank recognition that intellectual property laws are in tension with the fundamental human right of access to science and culture … WIPO is charged, whether effectively or not, to find ways to facilitate open access to science and culture … this is not just a “what’s best for academia and for my interests” issue, but a true human rights issue. Ms. Shaheed’s report makes this case in a concise and compelling way … - See more at: http://blogs.library.duke.edu/scholcomm/2015/03/13/copyright-open-access-and-human-rights/#sthash.6CQiMoiV.dpuf
Aves
Apterygidae
Marsupialia
Monotremata
Mammalia
Reptilia
Amphibia
Arthropoda
Myriapodia
Okapia johnstoni
Pyrus
Stuffed Tree of Life
![Page 67: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/67.jpg)
Supertree for 924 species
Tree
![Page 68: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/68.jpg)
Can we mine for animals?
YesWith the Phylogeny Cmunity [*]
[*] overlaps with “Tree of Life”, “Evolutionary Biology” , “Taxonomy”, “Species”
![Page 69: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/69.jpg)
So now we can legally mine the
whole literature in the UK
Yes! And we are starting to do it…
NORMA
![Page 70: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/70.jpg)
So why not Git for Data?
![Page 71: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/71.jpg)
DAT is Git for Data!!
![Page 72: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/72.jpg)
DAT! Queen Mary UL reposits DNA
![Page 73: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/73.jpg)
The John S. and James L. Knight Foundation is an American private, non-profit foundation dedicated to supporting "transformational ideas that promote quality journalism, advance media innovation, engage communities and foster the arts."[2]
DAT supports public data
![Page 74: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/74.jpg)
@Senficon (Julia Reda) :Text & Data mining in times of #copyright maximalism:
"Elsevier stopped me doing my research" http://onsnetwork.org/chartgerink/2015/11/16/elsevier-stopped-me-doing-my-research/ … #opencon #TDM
Elsevier stopped me doing my research Chris Hartgerink
![Page 75: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/75.jpg)
I am a statistician interested in detecting potentially problematic research such as data fabrication, which results in unreliable findings and can harm policy-making, confound funding decisions, and hampers research progress. To this end, I am content mining results reported in the psychology literature. Content mining the literature is a valuable avenue of investigating research questions with innovative methods. For example, our research group has written an automated program to mine research papers for errors in the reported results and found that 1/8 papers (of 30,000) contains at least one result that could directly influence the substantive conclusion [1]. In new research, I am trying to extract test results, figures, tables, and other information reported in papers throughout the majority of the psychology literature. As such, I need the research papers published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention to redistribute the downloaded materials, had legal access to them because my university pays a subscription, and I only wanted to extract facts from these papers. Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days. This boils down to a server load of 0.0021GB/[min], 0.125GB/h, 3GB/day. Approximately two weeks after I started downloading psychology research papers, Elsevier notified my university that this was a violation of the access contract, that this could be considered stealing of content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading (which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university. I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly hampering me in my research. [1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22. doi: 10.3758/s13428-015-0664-2
Chris Hartgerink’s blog post
![Page 76: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/76.jpg)
Some Children of the Digital Enlightenment
• David Carroll & Joe McArthur: OAButton • Rayna Stamboliyska & Pierre-Carl Langlais • Jon Tennant • Ross Mounce • Jenny Molloy • Erin McKiernan • Jack Andraka • Michelle Brook • Heather Piwowar • TheContentMine Team • Rufus Pollock • Jonathan Gray • Sophie Kay
Jean-Claude Bradley [1] a chemist developed Open notebook science; making the entire primary record of a research project publicly available online as it is recorded. (WP) J-C promoted these ideas with UNDERGRADUATE scientists. [1] Unfortunately J-C died in 2014; we held a memorial meeting in Cambridge
Sophie Kay
![Page 77: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/77.jpg)
![Page 78: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/78.jpg)
![Page 79: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/79.jpg)
OPEN CLOSED
Zenodo Figshare
Git
Dat
OpenOffice Word, PPT
LabTrove, cheminfo.org Chemdraw
CrystallographyOpenDB Cambridge Cryst data Centre
WriteLatex / Overleaf
ReadCube, Symplectic,
![Page 80: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/80.jpg)
> > > Henry. > Where and what is your latest repository and can I demonstrate it? This will be better than pointing to some dead Quixote site. And any blog posts would be useful. Happy to talk today if you are free. Chempond is still running eg http://chempound.ch.ic.ac.uk:8090/content/f0705698-39fa-4279-b736-f2fdca571e7b/ Timed out... Is it running? There have been firewall problems in the past. I thought they were fixed. Will check. Do you have blog posts which show either (a) how the repository is set up and (b) an Open Notebook approach to a project - where you discuss a problem before it is formally published? Both of these would be very useful. This is a current open notebook discussion, http://www.ch.imperial.ac.uk/rzepa/blog/?p=15552 (see comments, currently 67). This is an earlier one, http://www.rzepa.net/blog/?p=14272 (with 86 comments) and also incorporates Jsmol to visualise all the data This one starts discussion as an open notebook http://www.ch.imperial.ac.uk/rzepa/blog/?p=12115 with the resulting formal publication at 10.1002/jcc.23985 This was the original open notebook post http://www.ch.imperial.ac.uk/rzepa/blog/?p=984 with the resulting formal publication at 10.1038/NCHEM.596 This one incorporates open data into its citation list http://www.ch.imperial.ac.uk/rzepa/blog/?p=15505 and is also an open notebook follow up to my PhD thesis work, formally published in 1975 or so, thus operating in reverse to the above. This shows some end outcomes: http://www.ch.imperial.ac.uk/rzepa/blog/?p=15313 This shows the principles: http://www.ch.imperial.ac.uk/rzepa/blog/?p=10972 This is an introductory tutorial http://www.ch.imperial.ac.uk/rzepa/blog/?p=14454 This is a critique http://www.ch.imperial.ac.uk/rzepa/blog/?p=13826 This is “convincing case” http://www.ch.imperial.ac.uk/rzepa/blog/?p=13248 This is about metadata http://www.ch.imperial.ac.uk/rzepa/blog/?p=12932 And its use http://www.ch.imperial.ac.uk/rzepa/blog/?p=12526 You have seen this data nightmare before: http://www.ch.imperial.ac.uk/rzepa/blog/?p=12728 This is about ORCID http://www.ch.imperial.ac.uk/rzepa/blog/?p=12513
![Page 81: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/81.jpg)
![Page 82: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/82.jpg)
Open Source software inspires Open Science
Jean-Claude Bradley 2006
![Page 83: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/83.jpg)
![Page 84: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/84.jpg)
![Page 85: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/85.jpg)
![Page 86: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/86.jpg)
![Page 87: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/87.jpg)
![Page 88: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/88.jpg)
Ross Mounce (Bath), Panton Fellow
• Sharing research data: http://www.slideshare.net/rossmounce • How-to figures from PLOS/One [link]: Ross shows how to bring figures to life: • PLOSOne at http://bit.ly/PLOStrees • PLOS at http://bit.ly/phylofigs (demo)
![Page 89: ContentMine.org and UniversityOfCambridgelearn-rdm.eu/wp-content/uploads/reasearchdata-160129185840-2.pdf · Themes •Highly domain-dependent (chem, cryst, phylo) •Requires community](https://reader035.fdocuments.us/reader035/viewer/2022071114/5feb25ae7c3fbe04bd6a3971/html5/thumbnails/89.jpg)