My data, your data, our data - increasing data value through reuse (Eurocris2014 keynote)
-
Upload
kevin-ashley -
Category
Data & Analytics
-
view
598 -
download
26
description
Transcript of My data, your data, our data - increasing data value through reuse (Eurocris2014 keynote)
My Data, Our Data, Your Data:data reuse through data management
Kevin Ashley Digital Curation Centre
www.dcc.ac.uk@kevingashley
Reusable with attribution: CC-BY The DCC is supported by Jisc
2
A summary
• Why data reuse ?• What stops us ?• How data management helps• Harmonising the goals of research
administration and research• Barriers again• The case for reuse - again
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
3
My home – the DCC
• Mission – to increase capability and capacity for research data services in UK institutions
• Not just a UK problem – an international one
• Training, shared services, guidance, policy, standards, futures
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
Kevin Ashley – Eurocris2014 - CC-BY 42014-05-14
What is data curation ?
• “Maintaining, preserving and adding value to research data throughout its lifecycle”
• More than preservation:– Active management – dealing with change
• Less than preservation:– Lifecycle sometimes involves destruction
5
DCC guidance
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
62014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
SWEDEN
DENMARK
CANADA
7
Data reuse stories
• The palaeontologist who saved years of work with archaeological data
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
8
What a paleontologist looks at
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
Now100 million years ago
25m50m 75m
1m
9
What a paleontologist looks at
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
Now100 million years ago
25m50m 75m
1mNow 1 million years
750,000500,000100,000
10
What an archaeologist looks at
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
Now 1 million years
750,000500,000100,000
100,000 years ago75,00050,00025,000
11
Data reuse stories
• The palaeontologist who saved years of work with archaeological data
• The 19th-century ships logs that help us model climate change
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
122014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
The Old weather project
Data for research, not from research
Kevin Ashley – Eurocris2014 - CC-BY 132014-05-14
14
Data reuse stories
• The palaeontologist who saved years of work with archaeological data
• The 19th-century ships logs that help us model climate change
• The ‘noise’ from research radar that mapped dust from Eyjafjallajökull
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
15
Data reuse - messages
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
Often your data tells stories that your
publications do not
Not all data comes from other researchers
One person’s noise is another person’s signal
Discipline-bounded data discovery doesn’t give us
all we need or want
Kevin Ashley – Eurocris2014 - CC-BY 162014-05-14
Why care?
• Data is expensive – an investment• Reuse:
– More research– Teaching & Learning– Planning
• Impact – with or without publication• Accountability• Legal & regulatory requirements
17
Why does this matter?
• Research quality– How close can we get to
the truth?• Research speed
– How quickly can we get to the truth?
• Research finance– How much does the
truth cost?
• Improving one or more of these is of interest to all actors:
• Researchers as data creators
• Researchers as data reusers
• Research institutions• Funders – hence
government and society
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
Kevin Ashley – Eurocris2014 - CC-BY 18
G8UK - Endorses OAOpen Data CharterPolicy Paper18 June 2013
2014-05-14
G8UK - Billigt offenen ZugangEine offene Daten CharterStrategiepapier.
19
Funder requirements
• UK
• USA – NSF, NEH, NIH• Europe
• Most place burden on researcher – some on the institution
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
http://www.epsrc.ac.uk/about/standards/researchdata/Pages/policyframework.aspx
20
RCUK policy - The 1-minute version
• Research data are a public good – make openly available in timely & responsible way
• Have policies & plans. Data with long-term value should be preserved & usable
• Metadata for discovery & reuse. Link publications & data
• Sometimes law, ethics get in the way. We understand.• Limited embargos OK. Recognition is important – always
cite data sources• OK to use public money to do this. Do it efficiently.
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
Kevin Ashley – Eurocris2014 - CC-BY
EPSRC policy points
• Awareness of regulatory environment• Data access statement• Policies and processes• Data storage• Structured metadata descriptions• DOIs for data• Securely preserved for a minimum of 10 years
from last use2014-05-14
21
Compliance expected by 2015
Kevin Ashley – Eurocris2014 - CC-BY 222014-05-14
DCC Policy Summary
http://www.dcc.ac.uk/resources/policy-and-legal
Kevin Ashley – Eurocris2014 - CC-BY 232014-05-14
Findable, citable data has value
• Important to link publications to data (and vice versa)• Increases citations – of data & publication• Increases reuse (hence value)• But effects exist even without publication, if data is:
– Archived– Citable– Discoverable
MORAL: build a data registry
24
What stops data reuse• Loss• Destruction• Pride• Gluttony• Ineptitude• Concealment• Bureaucracy• Complexity• Procrastination• Lack of potential
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
Kevin Ashley – Eurocris2014 - CC-BY 25
“Departments don’t have guidelines or norms for personal back-up and researcher procedure,
knowledge and diligence varies tremendously. Many have experienced moderate to
catastrophic data loss”
Incremental Project Report, June 2010
http://www.flickr.com/photos/mattimattila/3003324844/
2014-05-14
26
What stops data reuse• Loss• Destruction• Pride• Gluttony• Ineptitude• Concealment• Bureaucracy• Complexity• Procrastination• Lack of potential
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
27
How people talk about data
• I put my data in figshare and I got a DOI for it• Not our data; the university’s data; my
funder’s data; the data; the people’s data; your data.
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
28
Data ownership – it’s messy
• You need ownership to make data free• Governments may assert this• Industrial collaborators – understanding role
of public funding• Research admin tracks the rules
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
29
ON METADATA
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
30
Disciplines – current state
• Typically specialised• Focussed on discipline-specific concerns• Frequently embedded – hence processing
required to expose independently• Historic failure to express generic concepts
generically– Place– Time
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
Kevin Ashley – Eurocris2014 - CC-BY 312014-05-14
Kevin Ashley – Eurocris2014 - CC-BY 322014-05-14
Understanding Data Requirements
http://www.dcc.ac.uk/
Kevin Ashley – Eurocris2014 - CC-BY 332014-05-14
Kevin Ashley – Eurocris2014 - CC-BY 34
Data centres are good value!
• See Jisc reports on ADS, BADC, UKDA:• Returns on investment between 400% and
1200%
2014-05-14
352014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
36
Integrity
• Not everyone publishes here
• Almost all fraud connected to unavailable data
• People suffer & die due to research fraud
• When your research is reproducible – it gets cited
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
37
Integrity – not without data• Cyril Burt
– Twin studies on intelligence.– Questioned 1976; now discredited
• Duke case– Data hiding leads to wasted treatments, clinical trials,
probable death & huge lawsuits• Dutch cases
– Stapel – 55 publications – “fictitious data”– Poldermans – fabricated data or negligence?
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
“The case for open data: the Duke Clinical Trials “– blog post, Kevin Ashley, http://www.dcc.ac.uk/news/case-open-data-duke-clinical-trials“Lies, Damned Lies and Research Data: Can Data Sharing Prevent Data Fraud?” – Doorn, Dillo, van Horik, IJDC 8(1); doi:10.2218/ijdc.v8i1.256
38
Citability
• Making data available increases citations• Everyone – academic, funder, institution – loves
citations• Want evidence?
– Alter, Pienta, Lyle – 240%, social sciences *– Piwowar, Vision – 9% (microarray data)†– Henneken, Accomazzi – 20% (astronomy) #
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
† Piwowar H, Vision TJ. (2013) Data reuse & the open data citation advantage. PeerJ PrePrints 1:e1v1 http://dx.doi.org/10.7287/peerj.preprints.1v1
* Amy Pienta, George Alter, Jared Lyle, (2010) The Enduring Value of Social Science Research: The Use and Reuse of Primary Research Data.http://hdl.handle.net/2027.42/78307
# Edwin Henneken, Alberto Accomazzi, (2011) Linking to Data - Effect on Citation Rates in Astronomy. http://arxiv.org/abs/1111.3618
Kevin Ashley – Eurocris2014 - CC-BY 392014-05-14How to cite data
What data to keep
40
The Data Deluge is upon us
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
Sensor’s ability to produce data outstrips IT’s ability to process it
Kevin Ashley – Eurocris2014 - CC-BY 412014-05-14
Kevin Ashley – Eurocris2014 - CC-BY 42
Roles and Responsibilities
What data to keep
2014-05-14
Kevin Ashley – Eurocris2014 - CC-BY 43
Excuses – and responses• “People will ask questions”
– So use a data centre or repository• “It will be misinterpreted”
– Stuff happens. Also, openness encourages correction• “It’s not interesting”
– Let others be the judge – your noise is my signal• “I might get another paper out of it”
– Up to a point. We might get more research out of it• “I don’t have permission”
– A real problem. But solvable at senior level• “It’s too bad/complicated” –see above• “It’s not a priority”
– Unfortunately, funders are making it so. But if you looked at the evidence, it would be your priority as well
2014-05-14
See e.g. Carly Strasser’s blog: http://datapub.cdlib.org/2013/04/24/closed-data-excuses-excuses/
44
Should all data be open?
• NO• Many reasons – most to do with human
subjects• But data existence should always be open• Allows discovery & negotiation on use• Avoids pointless replication
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
Kevin Ashley – Eurocris2014 - CC-BY 45
Some conundrums
• Releasing genome data is OK when it’s:– An identified human subject– An anonymous human subject– Your pet dog– Another mammal– An insect– A plant– A virus
2014-05-14
46
It’s amazing what people will share…
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
47
Data reuse from Hubble
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
482014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
49
Pimp your data –
make it findable & reusable
2014-04-25 Kevin Ashley, DCC – SocSciScot14 - CC-BY
Gking.harvard.edu/data
50
Data is variable
• Not always textual• Not always tabular• Not always fixed – continual change• Not always clearly authored – think of archival
provenance• Not always associated with publication• Often with indistinct boundaries• Multi-dimensional and non-linear
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
51
Some messages for you
• Some things we need to know about data:– When/where/what is it about?– Who owns it– What rights apply– What it is derived from & how– What software may be associated– What data management plan applies– How do I gain access ?– Where is it ?– When was/will it be destroyed?
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
52
What about your data?
• If administrative data isn’t freely available, why not?
• Expose it in bulk – not just as a web page• Gain the value from your overheads!
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
53
What about collaboration?
• Collaborate within the university• Collaborate with partners• Collaborate with regional, national services• Not everything can be done well locally• Some examples…
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY
Kevin Ashley – Eurocris2014 - CC-BY 54http://dataintelligence.3tu.nl/en/home/
http://www.sheffield.ac.uk/is/research/projects/
rdmrose
Choice of RDM training materials for librarians
Up-skilling for data
http://datalib.edina.ac.uk/mantra/libtraining.html
2014-05-14
55
My message to researchers• The credit belongs to you• The data belongs to all of us• Share, and we all reap the
benefits
2014-05-14 Kevin Ashley – Eurocris2014 - CC-BY