NIH Data Catalog - Updated Results

24
Building an NIH Data Catalog: Bit by Bit Kevin Read Board of Regents September 11, 2013 1

description

 

Transcript of NIH Data Catalog - Updated Results

Page 1: NIH Data Catalog - Updated Results

Building an NIH Data Catalog: Bit by Bit

Kevin ReadBoard of Regents

September 11, 2013

1

Page 2: NIH Data Catalog - Updated Results

2

Page 3: NIH Data Catalog - Updated Results

Searching for NIH-funded ‘Orphaned’ data sets in

PubMed and PubMed Central

3

Page 4: NIH Data Catalog - Updated Results

113,089

75,441

Remaining articles with orphaned data sets

NIH-funded articles for 2011:

88,592

78,901

Non-PMC Articles

Non-research Articles

Molecular Sequence Data MH71,91

3 SI Field

71,680

PMC Acknowledgements

69,857

XML

4

Page 5: NIH Data Catalog - Updated Results

SI Field Exclusions

Clinical-Trials.gov

PDB GEO GenBank PubChem RefSeq ISRCTN OMIM0

200

400

600

800

1000

1200

1400

1600

Excluded Articles

5

Page 6: NIH Data Catalog - Updated Results

6

PMC Acknowledgements Exclusions

PDB

Clinica

lTrials.

gov

GenBankGEO IRD

MGI

DIP

Flybase

dbGaPSRA

Worm

BaseM

PD

NURSARGD

ICPSR

VectorB

ase0

100

200

300

400

500

600

700

800

Excluded keywords

Page 7: NIH Data Catalog - Updated Results

7

XML Keyword Exclusions

GenBankPDB

GEOdbSNP

Clinica

lTrials.

govRGD

Flybase SRA DIP

dbGaP

Worm

Base MGI

BioGRID

VectorB

ase

Multiple Keyword

s0

100

200

300

400

500

600

Excluded keywords

FlyBase:GeneNetwork:Mouse Genome Informatics:Neuroscience Information

Framework:Rat Genome Database:WormBase:Zebrafish Model

Organism Database

GenBank:PDB

Page 8: NIH Data Catalog - Updated Results

NIH Sponsored data repositories have now been added to PubMed and PMC

search indexes

8

Page 9: NIH Data Catalog - Updated Results

383

What category of data set was used for the research described in the article?

Were live human or animal subjects used

in the collection of the data?

What were the subject(s) of study (from which or whom the data was collected)?

If new data set(s) were created, what

type(s) of data were collected?

What existing data set(s) were used? If any?

How many data sets are there in each

article?

9

Page 10: NIH Data Catalog - Updated Results

10

Measuring blood pressure in mice

Measuring left hemisphere of brain for growth factor

Staining and imaging

Analysis of images using software

Page 11: NIH Data Catalog - Updated Results

Phase OneResults

11

Page 12: NIH Data Catalog - Updated Results

Average number of data sets per article:

2.9212

Page 13: NIH Data Catalog - Updated Results

% of data sets that use live subjects

54%

Human

51%Animal

49%

13

Page 14: NIH Data Catalog - Updated Results

% of new data

87%

14

% of data created using pre-existing

data sets

13%

Page 15: NIH Data Catalog - Updated Results

Data Types

15

ImageGenetic or Genomic

Chemical

Biochemical

Electrical (Elecrophysiologic

al)

Optical – non-image

Behavioral

Computational Simulation or model

Magnetic Resonance – non-

image

Structural

Physiological

Questionnaire/Survey

Clinical Measures

Geospatial

INSUFFICIEN

T

Page 16: NIH Data Catalog - Updated Results

Inter-rater Reliability:

16

Total # of datasets (High)

Total # of datasets (Low)

0

100

200

300

400

500

600

700

800

Total number of datasets found per 25 articles

Total

43%

Page 17: NIH Data Catalog - Updated Results

How do we define a data set?

17

DATA SET

Page 18: NIH Data Catalog - Updated Results

How do we define a data set?

18

DATA SETS

Page 19: NIH Data Catalog - Updated Results

How do we define a data set?

19

DATA SETS

Page 20: NIH Data Catalog - Updated Results

Where in the collection/processing

pipeline should data be described?

20

Page 21: NIH Data Catalog - Updated Results

How do we assign data types to NIH funded

data sets?

21

Page 22: NIH Data Catalog - Updated Results

What data should be shared in an NIH Data

Catalog?

22

Data sets that can be

repurposed

Data sets that make an article easier to

understand

Page 23: NIH Data Catalog - Updated Results

Acknowledgements

Project SponsorsJerry Sheehan & Mike Huerta

Special ThanksLou Knecht & Jim Mork

AnnotatorsPreeti Kochar, Helen Ochej, Susan Schmidt, Melissa Yorks, Shari Mohary, Olga Printseva, Janice Ward, Oleg Rodionov, Sally Davidson, Jennie Larkin, Peter Lyster, Matt McAuliffe, Greg Farber, Betsy Humphreys, Jerry Sheehan, Mike Huerta, Lou Knecht, Suzy Roy, Swapna Abhyankar, Olivier Bodenreider, Karen Gutzman, Dina Demner Fusman, Laritza Rodriguez, Sonya Shooshan, Samantha Tate, Matthew Simpson, Tracy Edinger, Olubumi Akiwumi, Mary Ann Hantakas, Corinn Sinnott

SupportKathel Dunn & David Gillikin

Library OperationsJoyce Backus & Dianne Babski

NLM LeadershipDonald Lindberg & Betsy Humphreys

All images are CC

23

Page 24: NIH Data Catalog - Updated Results

24

Minimal MetadataCommon Metadata

ElementsProposed Definition

Data Unique Identifier A unique ID string that identifies a data set within the catalog

Author Individuals involved in producing or contributing to data

Affiliation Affiliation of each author associated with the appropriate author occurrence

Data Title Name or title by which the data set is known

Data Location The name of the entity that holds, archives, publishes, distributes, releases, issues, or produces the data w/ its associated accession number.

Date The year, month and date when the data was made available

Data Description (structured narrative) Structured narrative description for efficient indexing

Data Descriptors Metadata describing data contents using controlled labels (e.g. Organism, Disease, Perturbation, Gender, Cell type)

PMID Identifier that will link dataset to associated article(s) AND be provided for the data catalog entry

Availability/Accessibility of Data Indication of whether the data is available to use and how to access it

Award Number Grant/award numbers associated with the data set

Related Data Data that was used in the creation of the new data set