Building an NIH Data Catalog: Bit by Bit
-
Upload
readkev -
Category
Health & Medicine
-
view
114 -
download
1
description
Transcript of Building an NIH Data Catalog: Bit by Bit
Building an NIH Data Catalog: Bit by Bit
Kevin ReadNLM Associate Fellowship Presentation
July 24, 2013
1
NIH Big Data to KnowledgeFacilitating Broad Use of Biomedical Big Data
2
NIH Data CatalogWhat is it designed to do?
3
NIH Data Catalog
Data sets areCITABLE
Data sets areDISCOVERABL
E
Data sets areLINKED TO
THE LITERATURE
Data sets arePART OF THE RESEARCH
ECOSYSTEM
4
NIH Data CatalogWhat do we need to know in order to build it?
Minimal Metadata Elements
How do current data repositories describe their
data?
Orphaned Data sets
How many data sets are not currently represented in a
data repository?
5
Finding Common Metadata Elements
Exploring how NIH Data Repositories describe their data
6
7
Categorizing Metadata Descriptors
Common Metadata Elements
Authorship
Data Description
Title Information
8
Identifying Metadata Variations
Date
Study Date
Date Processe
d
Release Date
Completion Date
Last Updated Date
Prepared on Date
Authorship
Authors
Creators
Data Provide
r
Principal
Investigator(s
)
Contributors
Data Author
s
9
Mapping Metadata Commonalities to Existing Standards
Common Metadata Elements
Common Metadata Elements
10
11
Mapping Metadata to MEDLINECommon Metadata
ElementsProposed Definition
Data Unique Identifier A unique ID string that identifies a data set within the catalog
Author Individuals involved in producing or contributing to data
Affiliation Affiliation of each author associated with the appropriate author occurrence
Data Title Name or title by which the data set is known
Data Location The name of the entity that holds, archives, publishes, distributes, releases, issues, or produces the data w/ its associated accession number.
Date The year, month and date when the data was made available
Data Description (structured narrative) Structured narrative description for efficient indexing
Data Descriptors Metadata describing data contents using controlled labels (e.g. Organism, Disease, Perturbation, Gender, Cell type)
PMID Identifier that will link dataset to associated article(s) AND be provided for the data catalog entry
Availability/Accessibility of Data Indication of whether the data is available to use and how to access it
Award Number Grant/award numbers associated with the data set
Related Data Data that was used in the creation of the new data set
Data Catalog Citation
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Author
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Data Title
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Data Description Location
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Date of NIH Data Catalog issue
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
NIH Data Catalog Volume (Issue)
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Data Unique Identifier
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
PMID Assigned to NIH Data Catalog Record
Secondary source ID (Link to actual dataset)
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
12
Searching for NIH-funded ‘Orphaned’ data sets in
PubMed and PubMed Central
13
113,089
75,441
Remaining articles with orphaned data sets
NIH-funded articles for 2011:
88,592
78,901
Non-PMC Articles
Non-research Articles
Molecular Sequence Data MH71,91
3 SI Field
71,680
PMC Acknowledgements
69,857
XML
14
SI Field Exclusions
Clinical-Trials.gov
PDB GEO GenBank PubChem RefSeq ISRCTN OMIM0
200
400
600
800
1000
1200
1400
1600
Excluded Articles
15
16
PMC Acknowledgement Exclusions
PDB
Clinica
lTrials.
gov
GenBankGEO IRD
MGI
DIP
Flybase
dbGaPSRA
Worm
BaseM
PD
NURSARGD
ICPSR
VectorB
ase0
100
200
300
400
500
600
700
800
Excluded keywords
17
XML Keyword Exclusions
GenBankPDB
GEOdbSNP
Clinica
lTrials.
govRGD
Flybase SRA DIP
dbGaP
Worm
Base MGI
BioGRID
VectorB
ase
Multiple Keyword
s0
100
200
300
400
500
600
Excluded keywords
FlyBase:GeneNetwork:Mouse Genome Informatics:Neuroscience Information
Framework:Rat Genome Database:WormBase:Zebrafish Model
Organism Database
GenBank:PDB
Total # of articles collected
for 2011 after exclusion:
69,657
Random sample with 95% confid.
interval:
383
18
383
What category of data set was used for the research described in the article?
Were live human or animal subjects
used in the collection of the
data?
What were the subject(s) of study (from which or whom the data was collected)?
If new data set(s) were created,
what type(s) of data were collected?
What existing data set(s) were used? If any?
How many data sets are there in
each article?19
20
Measuring blood pressure in mice
Measuring left hemisphere of brain for growth factor
Staining and imaging
Analysis of images using software
Preliminary Results‘Orphaned’ Data
50 articles
21
Average number of data sets per article:
5.84
22
% of data sets that use live subjects
51%
Human
60%Animal
40%
23
% of data sets that were
considered to be new
74%% of data sets
that used existing data with mods or added value
12%
% of data sets that used
existing data as is
13%
% with no data
1%24
25
% of articles that collected only new data:
56%
% of articles that used only existing data:
32%% of articles that used a
combination of data:
8%
% of articles that used no
data:
4%
Data TypesIN
SUFFICIE
N
T
26
Building an NIH Data Catalog
Questions to Consider
27
What do we consider to be a data set?
All of the data created within a paper?
Multiple data sets of different data types within a paper?
Every individual collection of data within a paper?
28
Where in the collection/processing
pipeline should data be described?
29
Is there a convenient way to point to data sets
within an article?
Abstract? Labeled area?Reference list?
30
How do we adequately describe data sets so
that they are discoverable?
Develop a strategy to create appropriate data descriptors31
How do we adequately describe data sets so that they are
discoverable?
Is there a convenient way to point to data sets within an article?
Where in the data collection/processing pipeline
should data be described?
What do we consider to be a data set?
32
Acknowledgements
Project SponsorsJerry Sheehan & Mike Huerta
Special ThanksLou Knecht & Jim Mork
AnnotatorsPreeti Kochar, Helen Ochej, Susan Schmidt, Melissa Yorks, Shari Mohary, Olga Printseva, Janice Ward, Oleg Rodionov, Sally Davidson, Jennie Larkin, Peter Lyster, Matt McAuliffe, Greg Farber, Betsy Humphreys, Jerry Sheehan, Mike Huerta, Lou Knecht, Suzy Roy, Swapna Abhyankar, Olivier Bodenreider, Karen Gutzman, Dina Demner Fusman, Laritza Rodriguez, Sonya Shooshan, Samantha Tate, Matthew Simpson, Tracy Edinger, Olubumi Akiwumi, Mary Ann Hantakas, Corinn Sinnott
SupportKathel Dunn & David Gillikin
Library OperationsJoyce Backus & Dianne Babski
NLM LeadershipDonald Lindberg & Betsy Humphreys
All images are CC
33