Claudia Bauzer Medeiros Digital preservation – caring for our data to foster knowledge discovery...

58
Digital preservation caring for our data to foster knowledge discovery and dissemination Claudia Bauzer Medeiros Institute of Computing UNICAMP

Transcript of Claudia Bauzer Medeiros Digital preservation – caring for our data to foster knowledge discovery...

Page 1: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

Digital preservation caring for our data to foster

knowledge discovery and

dissemination

Claudia Bauzer Medeiros

Institute of Computing

UNICAMP

Page 2: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

Pre-Saervare

(Before) – (Save)

= save before disappears

Page 3: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

Maintain

Manu-tenere

= being able to get/find it

Page 4: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination
Page 5: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

Dec 2008

Feb 2010

Page 6: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

Data deluge

• At end of 2011 – info created and replicated > 1.8 zettabytes

• 90% data created in the last 2 years

• 5 hour flight – 240 Tbytes

• Facebook – 200 million users, >70 languages

• Each person in England is filmed 300 times/day

• Teenagers in the US send average 110 phone text messages a day

=> We need to build arks during the deluge - PRESERVATION

Page 7: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

Outline

• Why preserve?

• What to preserve?

• How to preserve?

• Where to preserve?

And a few associated challenges

Page 8: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

Outline

• Why preserve?

• What to preserve?

• How to preserve?

• Where to preserve?

And a few associated challenges

Page 9: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

WHY PRESERVE

• Costly to produce

• Contribute to progress of science

• Intrinsic value

culture/science/sustainability

Page 10: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

WHY PRESERVE• Costly to produce

– Infrastructure, power, software, models, visualization, people

– Hardware, Software, Peopleware

• Contribute to progress of science– Reproducibility and reusability

– Publication and sharing

– Quality

• Intrinsic value culture/science/sustainability– Digital humanities

– Domesday project

– Fonoteca Neotropical Jacques Vieillard

Page 11: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

WHY PRESERVE• Costly to produce

– Infrastructure, power, software, models, visualization, people

– Hardware, Software, Peopleware

• Contribute to progress of science– Reproducibility and reusability

– Publication and sharing

– Quality

• Intrinsic value culture/science/sustainability– Digital humanities

– Domesday project

– Fonoteca Neotropical Jacques Vieillard

Page 12: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

WHY PRESERVE• Costly to produce

– Infrastructure, power, software, models, visualization, people

– Hardware, Software, Peopleware

• Contribute to progress of science– Reproducibility and reusability

– Publication and sharing

– Quality

• Intrinsic value culture/science/sustainability– Digital humanities

– Domesday project

– Fonoteca Neotropical Jacques Vieillard

Page 13: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

The Domesday Project 1086-1986

• Digital decay

• Equipment obsolescence

• Software obsolescence

Page 14: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

Domesday reloaded

Page 15: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

Fonoteca

Neotropical

Jacques

Vieillard

Page 16: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination
Page 17: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination
Page 18: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

Outline

• Why preserve?

• What to preserve? • How to preserve?

And associated challenges

Page 19: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

What to preserve?

• Data

• BUT what is “data”?

• Only data?

Page 20: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

What to preserve?

• Data

• BUT what is “data”?

– Files and records

– Models, documentation, annotations, sketches,

experiments, recordings

• Only data?

Page 21: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

What to preserve?

• Data

• BUT what is “data”?

– Files and records

– Models, documentation, annotations, sketches,

experiments, recordings

• Only data?

– How produced it – workflows, devices,

methodologies, materials and methods,

reasonings, logs --- provenance

Page 22: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

What to preserve?

• Data

• Environment in which was produced

• Data needed to preserve occupies more space

than the data itself

• Preservation means storing more than object

itself

Page 23: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

23/10000

What about our research data?(slide adapted from Jim Gray)

Answers

Questions

“Collaboratory”Data-driven science

Models

Simulations

Papers

Files

Experiments

Instruments

DATA

Page 24: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

24/10000

Data sources?Table of Product Characteristics

id Property name Value

MilkProd productsrep MilkA

MilkProd quantity 10000

MilkProd validity date 10/06/2006

CheeseProd productsr

ep

Minas

CheeseProd quantity 2000

CheeseProd validity date 12/02/2006

CheeseProd shape Circular

Page 25: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

25/10000

eEnvironmental Science

• Direct and indirect observations

Page 26: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

26/10000

Data sources

Page 27: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

27/10000

Page 28: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

We are

DATASCOPE

engineers

Software is the

device/tool

Page 29: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

Outline

• Why preserve?

• What to preserve?

• How to preserve?

And associated challenges

Page 30: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

How to preserve?

How to construct the ark during the

deluge?

Presaervare, Manutenere and Share

Page 31: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

How to preserve?

• To ensure retrievability and sharing– Index structures

– Ontologies, metadata, keywords, standards

– Workflows

• To ensure longevity – Media decay, software decay, hardware decay

• To ensure quality– Curation procedures

• To afford maintenance costs– Cloud? CAP theorem?

Page 32: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

How to preserve?

• To ensure retrievability and sharing– Index structures

– Ontologies, metadata, keywords, standards

– Workflows

• To ensure longevity – Media decay, software decay, hardware decay

• To ensure quality– Curation procedures

• To afford maintenance costs– Cloud? CAP theorem?

Page 33: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

How to preserve?

• To ensure retrievability and sharing– Index structures

– Ontologies, metadata, keywords, standards

– Workflows

• To ensure longevity – Media decay, software decay, hardware decay

• To ensure quality– Curation procedures

• To afford maintenance costs– Cloud? CAP theorem?

Page 34: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

How to preserve?

• To ensure retrievability and sharing– Index structures

– Ontologies, metadata, keywords, standards

– Workflows

• To ensure longevity – Media decay, software decay, hardware decay

• To ensure quality– Curation procedures, metadata,standards

• To afford maintenance costs– Cloud? CAP theorem?

Page 35: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

How to preserve?

• To ensure retrievability and sharing– Index structures

– Ontologies, metadata, keywords, standards

– Workflows

• To ensure longevity – Media decay, software decay, hardware decay

• To ensure quality– Curation procedures,metadata, standards

• To afford maintenance costs– Cloud? CAP theorem? =======� WHERE

Page 36: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

How to preserve?

• To ensure retrievability and sharing– Index structures

– Ontologies, metadata, keywords, standards

– Workflows

• To ensure longevity – Media decay, software decay, hardware decay

– PEOPLE DECAY

• To ensure quality– Curation procedures,metadata, standards

• To afford maintenance costs– Cloud? CAP theorem? =======� WHERE

Page 37: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

Sharing and open access

NSF Data Management Policy

Paper and data publication

Page 38: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination
Page 39: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

Sharing of Data Leads to Progress on Alzheimer’s

By GINA KOLATA

Published: August 12, 2010

= NEW YORK TIMES

In 2003, a group of scientists and executives from the National Institutes of Health, the Food and

Drug Administration, the drug and medical-imaging industries, universities and nonprofit groups

joined in a project that experts say had no precedent: a collaborative effort to find the biological

markers that show the progression of Alzheimer’s disease in the human brain.

share all the data, making every single

finding public immediately, available to

anyone with a computer anywhere in the

world

=> AVAILABILITY and REUSE

Page 40: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

40/10000

• Data must be properly curated throughout its

life-cycle and released with the appropriate

high-quality metadata.

• Medical Research Council UK

Page 41: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

41/10000

• Research data should be made available for

use by other researchers. Researchers must

retain research data, including electronic data,

in a durable, indexed and retrievable form.

• Australian Govnmt National Health and

Medical Research Council

Page 42: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

42/10000

Microsoft Academic Search

40M publications

19M authors

75 publishers (Wiley, Springer, ACM, IEEE …)

Page 43: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

43/10000

Google Scholar Citations

Page 44: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

44/10000

• Citing data is as important as citing papers

• For researchers, publishers, data centers

• Over 1M DOI, several major national research

libraries

– Germany, France, Korea, Netherlands, Australia,

USA...

• Present manager – German National Library of

Science and Technology

Page 45: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

45/10000

Publish on the Cloud

Add metadata

Pre-print sharing

Page 46: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

46/10000

FNJV

proj.lis.ic.unicamp.br/fnjv

• Sharing by publishing on the Web

• Retrievability by extending metadata

Page 47: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination
Page 48: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination
Page 49: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

CURATION AND USE OF STANDARDS

Page 50: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

Workflows and model preservation

Page 51: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination
Page 52: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

52/10000

Workflows and model preservation

Comb-e-Chem

X-Ray

e-Lab

Analysis

Properties

Properties

e-Lab

SimulationVideo

Dif

fra

cto

me

ter

Grid Middleware

Structures

Database

Page 53: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

The cloud and CAP

Page 54: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

Outline

• Why preserve?

• What to preserve?

• How to preserve?

• Where to preserve?

And a few associated challenges

PRE-SAVE and MANU-TENERE

Page 55: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

Outline• Why preserve?

– Costly to produce (hardware, software, peopleware)

– Contribute to progress of science

– Value – culture, science, sustainability

• What to preserve? – Data [WHAT IS DATA?]

– Context of production and use

• How to preserve?– Accessibility and sharing – standards, metadata,

ontologies

– Integrity and quality – context to use (hw, sw), standards

Page 56: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

56/10000

References

Page 57: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

References

NSF – CISE Data management policy

The Domesday Project

http://www.atsf.co.uk/dottext/domesday.html

The CLARIN Project (languages)

Eigenfactor.org

Altmetrics movement

Page 58: Claudia Bauzer Medeiros  Digital preservation – caring for our data to foster knowledge discovery and dissemination

Thank you!!!!