Neurosciences Information Framework (NIF): An example of community Cyberinfrastructure for the...
-
Upload
neuroscience-information-framework -
Category
Technology
-
view
1.146 -
download
0
description
Transcript of Neurosciences Information Framework (NIF): An example of community Cyberinfrastructure for the...
Maryann E. Martone, Ph. D. University of California, San Diego
“A grand challenge in neuroscience is to elucidate brain func>on in rela>on to its mul>ple layers of organiza>on that operate at different spa>al and temporal scales. Central to this effort is tackling “neural choreography” -‐-‐ the integrated func>oning of neurons into brain circuits-‐-‐ Neural choreography cannot be understood via a purely reduc>onist approach. Rather, it entails the convergent use of analy>cal and synthe>c tools to gather, analyze and mine informa>on from each level of analysis, and capture the emergence of new layers of func>on (or dysfunc>on) as we move from studying genes and proteins, to cells, circuits, thought, and behavior....
However, the neuroscience community is not yet fully engaged in exploi;ng the rich array of data currently available, nor is it adequately poised to capitalize on the forthcoming data explosion. “
Akil et al., Science, Feb 11, 2011
• In that same issue of Science – Asked peer reviewers from last year about the availability and use of
data
• About half of those polled store their data only in their laboratories—not an ideal long-‐term solu>on.
• Many bemoaned the lack of common metadata and archives as a main impediment to using and storing data, and most of the respondents have no funding to support archiving
• And even where accessible, much data in many fields is too poorly organized to enable it to be efficiently used.
“...it is a growing challenge to ensure that data produced during the course of reported research are appropriately described, standardized, archived, and available to all.” Lead Science editorial, 2011
Neuroscience is unlikely to be served by a few large databases like the genomics and proteomics community Whole brain data
(20 um microscopic MRI)
Mosiac LM images (1 GB+)
Conven>onal LM images
Individual cell morphologies
EM volumes & reconstruc>ons
Solved molecular structures
No single technology serves these all equally well.
Mul6ple data types; mul6ple scales; mul6ple databases
hZp://neuinfo.org
• Current web is designed to share documents – Documents are unstructured data
• Much of the content of digital resources is part of the “hidden web”
• Wikipedia: The Deep Web (also called Deepnet, the invisible Web, DarkNet, Undernet or the hidden Web) refers to World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines.
• NIF has developed a produc>on technology pla]orm for researchers to: – Discover – Share – Analyze – Integrate neuroscience-‐relevant
informa>on • Since 2008, NIF has
assembled the largest searchable catalog of neuroscience data and resources on the web
• Cost-‐effec>ve and innova>ve strategy for managing data assets
“This unique data depository serves as a model for other Web sites to provide research data. “ -‐ Choice Reviews Online
NIF is poised to capitalize on the new tools and emphasis on big data and open science
h?p://neuinfo.org June10, 2013 dkCOIN Inves>gator's Retreat 8
• A portal for finding and using neuroscience resources
A consistent framework for describing resources
Provides simultaneous search of mul>ple types of informa>on, organized by category
Supported by an expansive ontology for neuroscience
U>lizes advanced technologies to search the “hidden web”
UCSD, Yale, Cal Tech, George Mason, Washington Univ
Literature
Database Federa>on
Registry
• NIF Registry: A catalog of neuroscience-‐relevant resources
• > 6000 currently listed
• > 2200 databases • And we are finding more every day
“Of relevance to neuroscience” is very broad
dkCOIN Inves>gator's Retreat 10
• NIF curators • Nomina>on by the community • Semi-‐automated text mining pipelines
NIF Registry Requires no special skills Site map available for local hos>ng
• NIF Data Federa>on • DISCO interop • Requires some programming skill
Low barrier to entry
• Extended over >me – Parent resource – Suppor>ng agency – Grant numbers – Accessibility – Related to – Organism
– Disease or condi>on – Last updated
First catalog: SFN Neuroscience Database Gateway NIF 0.5 NIF 1.0+
Simple metadata model
Name, descrip>on, type, URL, other names, keywords, unique iden>fier
~2003 2006 2008
12
• NIF Registry is hosted on Seman>c Media Wiki pla]orm Neurolex – Community can add,
review, edit without special privileges
– Searchable by Google – Integrated with NIF
ontologies
– Graph structure
Seman>c wiki: A wiki with seman>cs; pages are linked through rela>onships
NIF is crea>ng the linked data graph of resources
– NIF employs an automated link checker
– Last analysis: 478/6100 invalid URL’s (~8%) – 199 can’t locate at another university or loca>on out of service (~3%)
– Bigger issue: Many resources are no longer updated or maintained
0
20
40
60
80
100
120
140
160
180
200
1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
0
500
1000
1500
2000
2500
3000
3500
Resources added Last upd
ated
Keeping content up to date
Connectome
Tractography
Epigene>cs
• New tags come into existence • New resource types come into existence, e.g., Mobile apps • Resources add new types of content
• Change name • Change scope
• > 7000 updates to the registry last year
It’s a challenge to keep the registry up to date; sitemaps, cura>on, ontologies, community review
• The NIF Registry has created a linked data graph of web-‐accessible resources • Maintained on a community wiki pla]orm • Provides data on the fluidity of the resource landscape – New resources con>nue to be created and found
– Rela>vely few disappear altogether – Many more grow stale, although their value may s>ll be significant
– Maintaining up to date cura>on requires frequent upda>ng
NIF Registry provides insight into the state of digital resources on the web
• The NIF data federa>on performs deep search over the content of over 200 databases • New databases are added at a rate of 25-‐40 per year • Latest update: Open Source Brain; ingest completed in 2 hours
• Databases chosen on a variety of criteria: • Early: tes>ng different types of resources • Thema>c areas • Volunteers
NIF provides access to the largest aggrega>on of neuroscience-‐relevant informa>on on the web
• NIF was one of the first projects to aZempt data integra>on in the neurosciences on a large scale
• NIF is supported by a contract that specified the number of resources to be added per year – Designed to be populated rapidly; set up process for progressive refinement
– No budget was allocated to retrofit exis>ng resources; had to work with them in their current state
– We designed a system that required liZle to no coopera>on or work from providers
– Supports many formats: rela>onal, XML, RDF
Current Planned
DISCO Dashboard Func6ons • Ingest Script Manager • Public Script Repository • Data & Event Tracker • Versioning System • Curator Tool • Data Transformer Manager
June10, 2013 dkCOIN Inves>gator's Retreat 19 Luis Marenco, Rixin Wang, Perrry Miller, Gordon Shepherd Yale University
0
50
100
150
200
250
0.01
0.1
1
10
100
1000
6-‐12 12-‐12 7-‐13 1-‐14 8-‐14 2-‐15 9-‐15 4-‐16 10-‐16 5-‐17
Num
ber of Fed
erated
Datab
ases
Num
ber of Fed
erated
Records (M
illions)
NIF searches the largest colla>on of neuroscience-‐relevant data on the web
DISCO
June10, 2013 dkCOIN Inves>gator's Retreat 20
Results categorized by data type and level of nervous system
Hippocampus OR “Cornu Ammonis” OR “Ammon’s horn” Query expansion: Synonyms
and related concepts Boolean queries
Data sources categorized by “data type” and level of nervous
system
Common views across mul>ple
sources
Tutorials for using full resource when gewng there from
NIF
Link back to record in
original source
Connects to
Synapsed with
Synapsed by
Input region
innervates
Axon innervates Projects to Cellular contact
Subcellular contact
Source site
Target site
Each resource implements a different, though related model; systems are complex and difficult to learn, in many cases
• NIF Connec>vity: 7 databases containing connec>vity primary data or claims from literature on connec>vity between brain regions
• Brain Architecture Management System (rodent) • Temporal lobe.com (rodent) • Connectome Wiki (human) • Brain Maps (various) • CoCoMac (primate cortex) • UCLA Mul>modal database (Human fMRI) • Avian Brain Connec>vity Database (Bird)
• Total: 1800 unique brain terms (excluding Avian)
• Number of exact terms used in > 1 database: 42 • Number of synonym matches: 99 • Number of 1st order partonomy matches: 385
– You (and the machine) have to be able to find it • Accessible through the web • Annota>ons
– You have to be able to access and use it • Data type specified and in a usable form
– You have to know what the data mean • Some seman>cs: “1” • Context: Experimental metadata • Provenance: Where did the data come from?
Repor>ng neuroscience data within a consistent framework helps enormously
Knowledge in space and spa>al rela>onships (the “where”)
Knowledge in words, terminologies and logical rela>onships (the “what”)
• NIF covers mul>ple structural scales and domains of relevance to neuroscience • Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene
Ontology, Chebi, Protein Ontology
NIFSTD
Organism
NS Func>on Molecule Inves>ga>on Subcellular structure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunc>on Quality Anatomical Structure
NIF capitalizes on the growing set of community ontologies available in biomedical science
Purkinje Cell
Axon Terminal
Axon Dendri>c Tree
Dendri>c Spine
Dendrite
Cell body
Cerebellar cortex
There is liZle obvious connec>on between data sets taken at different scales using different microscopies without an explicit representa>on of the biological objects that the data represent
Brain
Cerebellum
Purkinje Cell Layer
Purkinje cell
neuron
has a
has a
has a
is a
• Ontology: an explicit, formal representa>on of concepts rela>onships among them within a par>cular domain that expresses human knowledge in a machine readable form
– Branch of philosophy: a theory of what is
– e.g., Gene ontologies
• Provide universals for naviga>ng across different data sources – Seman>c “index”
• Provide the basis for concept-‐based queries to probe and mine data – Perform reasoning
– Link data through rela>onships not just one-‐to-‐one mappings
“Search compu6ng” What genes are upregulated by drugs of abuse
in the adult mouse? Morphine
Increased expression
Adult Mouse
Some concepts, e.g., age category, are quan>ta>ve but s>ll must be interpreted in a global query system
June10, 2013 dkCOIN Inves>gator's Retreat 32
hZp://neurolex.org Stephen Larson
• Provide a simple interface for defining the concepts required
• Light weight seman>cs • Good teaching tool for learning about seman>c integra>on and the benefits of a consistent seman>c framework
• Community based: • Anyone can contribute their terms, concepts, things • Anyone can edit • Anyone can link
• Accessible: searched by Google
• Growing into a significant knowledge base for neuroscience Demo D03
200,000 edits 150 contributors
• NIF can be used to survey the data landscape
• Analysis of NIF shows mul>ple databases with similar scope and content
• Many contain par>ally overlapping data
• Data “flows” from one resource to the next – Data is reinterpreted, reanalyzed or
added to
• Is duplica>on good or bad?
Databases come in many shapes and sizes
• Primary data: – Data available for reanalysis, e.g.,
microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL)
• Secondary data – Data features extracted through
data processing and some>mes normaliza>on, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connec>vity statements (BAMS)
• Ter>ary data – Claims and asser>ons about the
meaning of data • E.g., gene upregula>on/
downregula>on, brain ac>va>on as a func>on of task
• Registries: – Metadata – Pointers to data sets or
materials stored elsewhere • Data aggregators
– Aggregate data of the same type from mul>ple sources, e.g., Cell Image Library ,SUMSdb, Brede
• Single source – Data acquired within a single
context , e.g., Allen Brain Atlas
Researchers are producing a variety of informa>on ar>facts using a mul>tude of technologies
NIF Analy6cs: The Neuroscience Landscape
NIF is in a unique posi>on to answer ques>ons about the neuroscience landscape
Where are the data?
Striatum Hypothalamus Olfactory bulb
Cerebral cortex
Brain
Brain region
Data source
Vadim Astakhov, Kepler Workflow Engine
Diseases of nervous system
Adding more seman6cs
The combina>on of ontologies, diverse data and analy>cs lets us look at the current landscape in interes>ng ways
Neurodegenera>ve
Seizure disorders
Neoplas>c disease of nervous system
NIH Reporter N
IF data fede
rated sources
• Gemma: Gene ID + Gene Symbol • DRG: Gene name + Probe ID
• Gemma presented results rela>ve to baseline chronic morphine; DRG with respect to saline, so direc>on of change is opposite in the 2 databases • Analysis:
• 1370 statements from Gemma regarding gene expression as a func>on of chronic morphine • 617 were consistent with DRG; over half of the claims of the paper were not confirmed in this analysis • Results for 1 gene were opposite in DRG and Gemma • 45 did not have enough informa>on provided in the paper to make a judgment
Rela>vely simple standards would make life easier
NIF favors a hybrid, >ered, federated system
• Domain knowledge – Ontologies
• Claims, models and observa>ons – Virtuoso RDF triples – Model repositories
• Data – Data federa>on – Spa>al data – Workflows
• Narra>ve – Full text access
Neuron Brain part Disease Organism Gene
Caudate projects to Snpc Grm1 is upregulated in
chronic cocaine Betz cells
degenerate in ALS
NIF provides the tentacles that connect the pieces: a new type of en>ty for 21st century science
Technique People
• 2006-‐2008: A survey of what was out there • 2008-‐2009: Strategy for resource discovery
– NIF Registry vs NIF data federa>on
– Inges>on of data contained within different technology pla]orms, e.g., XML vs rela>onal vs RDF
– Effec>ve search across seman>cally diverse sources • NIFSTD ontologies
• 2009-‐2011: Strategy for data integra>on – Unified views across common sources
– Mapping of content to NIF vocabularies
• 2011-‐present: Data analy>cs – Uniform external data references
• 2012-‐present: SciCrunch: unified biomedical resource services
NIF provides a strategy and set of tools applicable to all domains grappling with mul>ple sources of diverse data (i.e., preZy much everything)
• Search seman>cs
• Ranking • Resources supported by NIH Blueprint Ins>tutes are more thoroughly covered
• Data types, e.g., Brain ac>va>on foci
June10, 2013 dkCOIN Inves>gator's Retreat 41
June10, 2013 42
SciCrunch
NIF MONARCH
Community Services dkCOIN
Shared Resources
Undiagnosed Disease Program
Phenotype RCN
3D Virtual Cell
Na>onal Ins>tute on Aging
One Mind for Research
BIRN
Interna>onal Neuroinforma>cs Coordina>ng
Facility
Model Organism Databases
Community Outreach
DELSA
(not just a data catalog)
43
• 3dVC: Focus on models and simula>on
• Gene Ontology: Focus on bioinforma>cs tools
• Na>onal Ins>tute on aging: Aging-‐related data sets
• Monarch: Phenotype-‐Genotype; deep seman>c data integra>on
• One Mind for Research: Biospecimen repositories
• NeuroGateway: Computa>onal resources
• FORCE11: Tools for next-‐gen publishing and e-‐scholarship
SciCrunch
SciCrunch is ac>vely suppor>ng mul>ple communi>es; mul>ple communi>es are enriching and improving SciCrunch
Community database: beginning
Community database:
End
“How do I share my data/tool?”
“There is no database for my data”
1
2
3
4
Ins3tu3onal repositories
Cloud
INCF: Global infrastructure
Government
Educa>on
Industry
NIF is designed to leverage exis>ng investments in resources and infrastructure
Tool repositories
• No one can be stopped from doing what they need to do
• Every resource is resource limited: few have enough >me, money, staff or exper>se required to do everything they would like – If the market can support 11 MRI databases, fine
– Some consolida>on, coordina>on is warranted though
• Big, broad and messy beats small, narrow and neat – Without trying to integrate a lot of data, we will not know what needs to be done – A lot can be done with messy data; neatness helps though – Progressive refinement; addi>on of complexity through layers
• Be flexible and opportunis>c – A single op>mal technology/container for all types of scien>fic data and informa>on does not exist;
technology is changing
• Think globally; act locally: – No source, not even NIF, is THE source; we are all a source
• Several powerful trends should change the way we think about our data: One Many – Many data
• Genera>on of data is gewng easier shared data • Data space is gewng richer: more –omes everyday • But...compared to the biological space, s>ll sparse
– Many eyes • Wisdom of crowds • More than one way to interpret data
– Many algorithms • Not a single way to analyze data
– Many analy>cs • “Signatures” in data may not be directly related to the ques>on for which they were acquired but tell us something really interes>ng
Are you exposing or burying your work?
Jeff Grethe, UCSD, Co Inves>gator, Interim PI
Amarnath Gupta, UCSD, Co Inves>gator
Anita Bandrowski, NIF Project Leader
Gordon Shepherd, Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen, Washington University
Erin Reid
Paul Sternberg, Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli, George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark, Harvard University
Paolo Ciccarese
Karen Skinner, NIH, Program Officer (re>red)
Jonathan Pollock, NIH, Program Officer
And my colleagues in Monarch, dkNet, 3DVC, Force 11