Building an NIH Data Catalog: Bit by Bit

33
Building an NIH Data Catalog: Bit by Bit Kevin Read NLM Associate Fellowship Presentation July 24, 2013 1

description

OBJECTIVE The purpose of the project was to a) develop a set of core, minimal metadata elements that would be used to describe data sets, and b) carry out a study to identify data sets in NIH-funded articles from PubMed and PubMed Central (PMC) that do not provide an indication that their data is stored in a specific place like a repository or registry. These efforts will inform the BD2K initiative and a planned NIH Data Catalog. METHODS An analysis of the metadata schemas for all NIH data repositories was undertaken. Commonalities from these data repositories were identified, mapped to existing data-specific metadata standards from DataCite and Dryad, and then were integrated into MEDLINE XML metadata to attempt to establish a sustainable and integrated metadata schema. The second phase of this project identified data sets in articles from PubMed and PMC by searching specifically for NIH-funded articles from the year 2011. After excluding articles that contain mention of data sets being deposited in existing repositories, thirty staff members from NLM and B2DK were recruited to analyze a random sample of the results to identify how many, and what types of data sets were created per article. RESULTS A preliminary set of minimal metadata elements were developed that could sufficiently describe NIH-funded data sets and be integrated within MEDLINE’s schema, with minor additions. At present, results of the second phase to analyze PubMed and PMC articles for data sets are pending once all submissions from NLM staff are complete. CONCLUSION The efforts to develop a minimal set of metadata elements and identify the amount, and types of data sets that are produced from NIH funded articles will serve to inform the BD2K’s initiative to build an NIH Data Catalog going forward.

Transcript of Building an NIH Data Catalog: Bit by Bit

Page 1: Building an NIH Data Catalog: Bit by Bit

Building an NIH Data Catalog: Bit by Bit

Kevin ReadNLM Associate Fellowship Presentation

July 24, 2013

1

Page 2: Building an NIH Data Catalog: Bit by Bit

NIH Big Data to KnowledgeFacilitating Broad Use of Biomedical Big Data

2

Page 3: Building an NIH Data Catalog: Bit by Bit

NIH Data CatalogWhat is it designed to do?

3

Page 4: Building an NIH Data Catalog: Bit by Bit

NIH Data Catalog

Data sets areCITABLE

Data sets areDISCOVERABL

E

Data sets areLINKED TO

THE LITERATURE

Data sets arePART OF THE RESEARCH

ECOSYSTEM

4

Page 5: Building an NIH Data Catalog: Bit by Bit

NIH Data CatalogWhat do we need to know in order to build it?

Minimal Metadata Elements

How do current data repositories describe their

data?

Orphaned Data sets

How many data sets are not currently represented in a

data repository?

5

Page 6: Building an NIH Data Catalog: Bit by Bit

Finding Common Metadata Elements

Exploring how NIH Data Repositories describe their data

6

Page 7: Building an NIH Data Catalog: Bit by Bit

7

Page 8: Building an NIH Data Catalog: Bit by Bit

Categorizing Metadata Descriptors

Common Metadata Elements

Authorship

Data Description

Title Information

8

Page 9: Building an NIH Data Catalog: Bit by Bit

Identifying Metadata Variations

Date

Study Date

Date Processe

d

Release Date

Completion Date

Last Updated Date

Prepared on Date

Authorship

Authors

Creators

Data Provide

r

Principal

Investigator(s

)

Contributors

Data Author

s

9

Page 10: Building an NIH Data Catalog: Bit by Bit

Mapping Metadata Commonalities to Existing Standards

Common Metadata Elements

Common Metadata Elements

10

Page 11: Building an NIH Data Catalog: Bit by Bit

11

Mapping Metadata to MEDLINECommon Metadata

ElementsProposed Definition

Data Unique Identifier A unique ID string that identifies a data set within the catalog

Author Individuals involved in producing or contributing to data

Affiliation Affiliation of each author associated with the appropriate author occurrence

Data Title Name or title by which the data set is known

Data Location The name of the entity that holds, archives, publishes, distributes, releases, issues, or produces the data w/ its associated accession number.

Date The year, month and date when the data was made available

Data Description (structured narrative) Structured narrative description for efficient indexing

Data Descriptors Metadata describing data contents using controlled labels (e.g. Organism, Disease, Perturbation, Gender, Cell type)

PMID Identifier that will link dataset to associated article(s) AND be provided for the data catalog entry

Availability/Accessibility of Data Indication of whether the data is available to use and how to access it

Award Number Grant/award numbers associated with the data set

Related Data Data that was used in the creation of the new data set

Page 12: Building an NIH Data Catalog: Bit by Bit

Data Catalog Citation

Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.

SI: dbGaP/pht002543.v2.p1

Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.

SI: dbGaP/pht002543.v2.p1

Author

Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.

SI: dbGaP/pht002543.v2.p1

Data Title

Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.

SI: dbGaP/pht002543.v2.p1

Data Description Location

Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.

SI: dbGaP/pht002543.v2.p1

Date of NIH Data Catalog issue

Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.

SI: dbGaP/pht002543.v2.p1

NIH Data Catalog Volume (Issue)

Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.

SI: dbGaP/pht002543.v2.p1

Data Unique Identifier

Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.

SI: dbGaP/pht002543.v2.p1

PMID Assigned to NIH Data Catalog Record

Secondary source ID (Link to actual dataset)

Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.

SI: dbGaP/pht002543.v2.p1

12

Page 13: Building an NIH Data Catalog: Bit by Bit

Searching for NIH-funded ‘Orphaned’ data sets in

PubMed and PubMed Central

13

Page 14: Building an NIH Data Catalog: Bit by Bit

113,089

75,441

Remaining articles with orphaned data sets

NIH-funded articles for 2011:

88,592

78,901

Non-PMC Articles

Non-research Articles

Molecular Sequence Data MH71,91

3 SI Field

71,680

PMC Acknowledgements

69,857

XML

14

Page 15: Building an NIH Data Catalog: Bit by Bit

SI Field Exclusions

Clinical-Trials.gov

PDB GEO GenBank PubChem RefSeq ISRCTN OMIM0

200

400

600

800

1000

1200

1400

1600

Excluded Articles

15

Page 16: Building an NIH Data Catalog: Bit by Bit

16

PMC Acknowledgement Exclusions

PDB

Clinica

lTrials.

gov

GenBankGEO IRD

MGI

DIP

Flybase

dbGaPSRA

Worm

BaseM

PD

NURSARGD

ICPSR

VectorB

ase0

100

200

300

400

500

600

700

800

Excluded keywords

Page 17: Building an NIH Data Catalog: Bit by Bit

17

XML Keyword Exclusions

GenBankPDB

GEOdbSNP

Clinica

lTrials.

govRGD

Flybase SRA DIP

dbGaP

Worm

Base MGI

BioGRID

VectorB

ase

Multiple Keyword

s0

100

200

300

400

500

600

Excluded keywords

FlyBase:GeneNetwork:Mouse Genome Informatics:Neuroscience Information

Framework:Rat Genome Database:WormBase:Zebrafish Model

Organism Database

GenBank:PDB

Page 18: Building an NIH Data Catalog: Bit by Bit

Total # of articles collected

for 2011 after exclusion:

69,657

Random sample with 95% confid.

interval:

383

18

Page 19: Building an NIH Data Catalog: Bit by Bit

383

What category of data set was used for the research described in the article?

Were live human or animal subjects

used in the collection of the

data?

What were the subject(s) of study (from which or whom the data was collected)?

If new data set(s) were created,

what type(s) of data were collected?

What existing data set(s) were used? If any?

How many data sets are there in

each article?19

Page 20: Building an NIH Data Catalog: Bit by Bit

20

Measuring blood pressure in mice

Measuring left hemisphere of brain for growth factor

Staining and imaging

Analysis of images using software

Page 21: Building an NIH Data Catalog: Bit by Bit

Preliminary Results‘Orphaned’ Data

50 articles

21

Page 22: Building an NIH Data Catalog: Bit by Bit

Average number of data sets per article:

5.84

22

Page 23: Building an NIH Data Catalog: Bit by Bit

% of data sets that use live subjects

51%

Human

60%Animal

40%

23

Page 24: Building an NIH Data Catalog: Bit by Bit

% of data sets that were

considered to be new

74%% of data sets

that used existing data with mods or added value

12%

% of data sets that used

existing data as is

13%

% with no data

1%24

Page 25: Building an NIH Data Catalog: Bit by Bit

25

% of articles that collected only new data:

56%

% of articles that used only existing data:

32%% of articles that used a

combination of data:

8%

% of articles that used no

data:

4%

Page 26: Building an NIH Data Catalog: Bit by Bit

Data TypesIN

SUFFICIE

N

T

26

Page 27: Building an NIH Data Catalog: Bit by Bit

Building an NIH Data Catalog

Questions to Consider

27

Page 28: Building an NIH Data Catalog: Bit by Bit

What do we consider to be a data set?

All of the data created within a paper?

Multiple data sets of different data types within a paper?

Every individual collection of data within a paper?

28

Page 29: Building an NIH Data Catalog: Bit by Bit

Where in the collection/processing

pipeline should data be described?

29

Page 30: Building an NIH Data Catalog: Bit by Bit

Is there a convenient way to point to data sets

within an article?

Abstract? Labeled area?Reference list?

30

Page 31: Building an NIH Data Catalog: Bit by Bit

How do we adequately describe data sets so

that they are discoverable?

Develop a strategy to create appropriate data descriptors31

Page 32: Building an NIH Data Catalog: Bit by Bit

How do we adequately describe data sets so that they are

discoverable?

Is there a convenient way to point to data sets within an article?

Where in the data collection/processing pipeline

should data be described?

What do we consider to be a data set?

32

Page 33: Building an NIH Data Catalog: Bit by Bit

Acknowledgements

Project SponsorsJerry Sheehan & Mike Huerta

Special ThanksLou Knecht & Jim Mork

AnnotatorsPreeti Kochar, Helen Ochej, Susan Schmidt, Melissa Yorks, Shari Mohary, Olga Printseva, Janice Ward, Oleg Rodionov, Sally Davidson, Jennie Larkin, Peter Lyster, Matt McAuliffe, Greg Farber, Betsy Humphreys, Jerry Sheehan, Mike Huerta, Lou Knecht, Suzy Roy, Swapna Abhyankar, Olivier Bodenreider, Karen Gutzman, Dina Demner Fusman, Laritza Rodriguez, Sonya Shooshan, Samantha Tate, Matthew Simpson, Tracy Edinger, Olubumi Akiwumi, Mary Ann Hantakas, Corinn Sinnott

SupportKathel Dunn & David Gillikin

Library OperationsJoyce Backus & Dianne Babski

NLM LeadershipDonald Lindberg & Betsy Humphreys

All images are CC

33