Neurosciences Information Framework (NIF): An example of community Cyberinfrastructure for the...

Maryann E. Martone, Ph. D. University of California, San Diego

“A grand challenge in neuroscience is to elucidate brain func>on in rela>on to its mul>ple layers of organiza>on that operate at different spa>al and temporal scales. Central to this effort is tackling “neural choreography” -‐-‐ the integrated func>oning of neurons into brain circuits-‐-‐ Neural choreography cannot be understood via a purely reduc>onist approach. Rather, it entails the convergent use of analy>cal and synthe>c tools to gather, analyze and mine informa>on from each level of analysis, and capture the emergence of new layers of func>on (or dysfunc>on) as we move from studying genes and proteins, to cells, circuits, thought, and behavior....

However, the neuroscience community is not yet fully engaged in exploi;ng the rich array of data currently available, nor is it adequately poised to capitalize on the forthcoming data explosion. “

Akil et al., Science, Feb 11, 2011

•  In that same issue of Science –  Asked peer reviewers from last year about the availability and use of

data

•  About half of those polled store their data only in their laboratories—not an ideal long-‐term solu>on.

•  Many bemoaned the lack of common metadata and archives as a main impediment to using and storing data, and most of the respondents have no funding to support archiving

•  And even where accessible, much data in many fields is too poorly organized to enable it to be efficiently used.

“...it is a growing challenge to ensure that data produced during the course of reported research are appropriately described, standardized, archived, and available to all.” Lead Science editorial, 2011

Neuroscience is unlikely to be served by a few large databases like the genomics and proteomics community Whole brain data

(20 um microscopic MRI)

Mosiac LM images (1 GB+)

Conven>onal LM images

Individual cell morphologies

EM volumes & reconstruc>ons

Solved molecular structures

No single technology serves these all equally well.

 Mul6ple data types; mul6ple scales; mul6ple databases

hZp://neuinfo.org

•  Current web is designed to share documents – Documents are unstructured data

•  Much of the content of digital resources is part of the “hidden web”

•  Wikipedia: The Deep Web (also called Deepnet, the invisible Web, DarkNet, Undernet or the hidden Web) refers to World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines.

•  NIF has developed a produc>on technology pla]orm for researchers to: –  Discover –  Share –  Analyze –  Integrate neuroscience-‐relevant

informa>on •  Since 2008, NIF has

assembled the largest searchable catalog of neuroscience data and resources on the web

•  Cost-‐effec>ve and innova>ve strategy for managing data assets

“This unique data depository serves as a model for other Web sites to provide research data. “ -‐ Choice Reviews Online

NIF is poised to capitalize on the new tools and emphasis on big data and open science

h?p://neuinfo.org June10, 2013 dkCOIN Inves>gator's Retreat 8

•  A portal for finding and using neuroscience resources

  A consistent framework for describing resources

  Provides simultaneous search of mul>ple types of informa>on, organized by category

  Supported by an expansive ontology for neuroscience

  U>lizes advanced technologies to search the “hidden web”

UCSD, Yale, Cal Tech, George Mason, Washington Univ

Literature

Database Federa>on

Registry

• NIF Registry: A catalog of neuroscience-‐relevant resources

• > 6000 currently listed

• > 2200 databases • And we are finding more every day

“Of relevance to neuroscience” is very broad

dkCOIN Inves>gator's Retreat 10

• NIF curators • Nomina>on by the community • Semi-‐automated text mining pipelines

 NIF Registry  Requires no special skills  Site map available for local hos>ng

• NIF Data Federa>on • DISCO interop • Requires some programming skill

Low barrier to entry

•  Extended over >me –  Parent resource –  Suppor>ng agency –  Grant numbers –  Accessibility –  Related to –  Organism

–  Disease or condi>on –  Last updated

First catalog: SFN Neuroscience Database Gateway NIF 0.5 NIF 1.0+

Simple metadata model

Name, descrip>on, type, URL, other names, keywords, unique iden>fier

~2003 2006 2008

12

•  NIF Registry is hosted on Seman>c Media Wiki pla]orm Neurolex –  Community can add,

review, edit without special privileges

–  Searchable by Google –  Integrated with NIF

ontologies

–  Graph structure

Seman>c wiki: A wiki with seman>cs; pages are linked through rela>onships

NIF is crea>ng the linked data graph of resources

–  NIF employs an automated link checker

–  Last analysis: 478/6100 invalid URL’s (~8%) –  199 can’t locate at another university or loca>on out of service (~3%)

–  Bigger issue: Many resources are no longer updated or maintained

0

20

40

60

80

100

120

140

160

180

200

1996 1998 2000 2002 2004 2006 2008 2010 2012 2014

0

500

1000

1500

2000

2500

3000

3500

Resources added Last upd

ated

Keeping content up to date

Connectome

Tractography

Epigene>cs

• New tags come into existence • New resource types come into existence, e.g., Mobile apps • Resources add new types of content

• Change name • Change scope

• > 7000 updates to the registry last year

It’s a challenge to keep the registry up to date; sitemaps, cura>on, ontologies, community review

• The NIF Registry has created a linked data graph of web-‐accessible resources • Maintained on a community wiki pla]orm • Provides data on the fluidity of the resource landscape –  New resources con>nue to be created and found

–  Rela>vely few disappear altogether – Many more grow stale, although their value may s>ll be significant

– Maintaining up to date cura>on requires frequent upda>ng

NIF Registry provides insight into the state of digital resources on the web

• The NIF data federa>on performs deep search over the content of over 200 databases • New databases are added at a rate of 25-‐40 per year • Latest update: Open Source Brain; ingest completed in 2 hours

• Databases chosen on a variety of criteria: • Early: tes>ng different types of resources • Thema>c areas • Volunteers

NIF provides access to the largest aggrega>on of neuroscience-‐relevant informa>on on the web

•  NIF was one of the first projects to aZempt data integra>on in the neurosciences on a large scale

•  NIF is supported by a contract that specified the number of resources to be added per year –  Designed to be populated rapidly; set up process for progressive refinement

–  No budget was allocated to retrofit exis>ng resources; had to work with them in their current state

– We designed a system that required liZle to no coopera>on or work from providers

–  Supports many formats: rela>onal, XML, RDF

Current Planned

DISCO Dashboard Func6ons •  Ingest Script Manager •  Public Script Repository •  Data & Event Tracker •  Versioning System •  Curator Tool •  Data Transformer Manager

June10, 2013 dkCOIN Inves>gator's Retreat 19 Luis Marenco, Rixin Wang, Perrry Miller, Gordon Shepherd Yale University

0

50

100

150

200

250

0.01

0.1

1

10

100

1000

6-‐12 12-‐12 7-‐13 1-‐14 8-‐14 2-‐15 9-‐15 4-‐16 10-‐16 5-‐17

Num

ber of Fed

erated

Datab

ases

Num

ber of Fed

erated

Records (M

illions)

NIF searches the largest colla>on of neuroscience-‐relevant data on the web

DISCO

June10, 2013 dkCOIN Inves>gator's Retreat 20

Results categorized by data type and level of nervous system

Hippocampus OR “Cornu Ammonis” OR “Ammon’s horn” Query expansion: Synonyms

and related concepts Boolean queries

Data sources categorized by “data type” and level of nervous

system

Common views across mul>ple

sources

Tutorials for using full resource when gewng there from

NIF

Link back to record in

original source

Connects to

Synapsed with

Synapsed by

Input region

innervates

Axon innervates Projects to Cellular contact

Subcellular contact

Source site

Target site

Each resource implements a different, though related model; systems are complex and difficult to learn, in many cases

• NIF Connec>vity: 7 databases containing connec>vity primary data or claims from literature on connec>vity between brain regions

• Brain Architecture Management System (rodent) • Temporal lobe.com (rodent) • Connectome Wiki (human) • Brain Maps (various) • CoCoMac (primate cortex) • UCLA Mul>modal database (Human fMRI) • Avian Brain Connec>vity Database (Bird)

• Total: 1800 unique brain terms (excluding Avian)

• Number of exact terms used in > 1 database: 42 • Number of synonym matches: 99 • Number of 1st order partonomy matches: 385

– You (and the machine) have to be able to find it •  Accessible through the web •  Annota>ons

– You have to be able to access and use it •  Data type specified and in a usable form

– You have to know what the data mean •  Some seman>cs: “1” •  Context: Experimental metadata •  Provenance: Where did the data come from?

Repor>ng neuroscience data within a consistent framework helps enormously

Knowledge in space and spa>al rela>onships (the “where”)

Knowledge in words, terminologies and logical rela>onships (the “what”)

•  NIF covers mul>ple structural scales and domains of relevance to neuroscience •  Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene

Ontology, Chebi, Protein Ontology

NIFSTD

Organism

NS Func>on Molecule Inves>ga>on Subcellular structure

Macromolecule Gene

Molecule Descriptors

Techniques

Reagent Protocols

Cell

Resource Instrument

Dysfunc>on Quality Anatomical Structure

NIF capitalizes on the growing set of community ontologies available in biomedical science

Purkinje Cell

Axon Terminal

Axon Dendri>c Tree

Dendri>c Spine

Dendrite

Cell body

Cerebellar cortex

There is liZle obvious connec>on between data sets taken at different scales using different microscopies without an explicit representa>on of the biological objects that the data represent

Brain

Cerebellum

Purkinje Cell Layer

Purkinje cell

neuron

has a

has a

has a

is a

•  Ontology: an explicit, formal representa>on of concepts rela>onships among them within a par>cular domain that expresses human knowledge in a machine readable form

–  Branch of philosophy: a theory of what is

–  e.g., Gene ontologies

•  Provide universals for naviga>ng across different data sources –  Seman>c “index”

•  Provide the basis for concept-‐based queries to probe and mine data –  Perform reasoning

–  Link data through rela>onships not just one-‐to-‐one mappings

“Search compu6ng” What genes are upregulated by drugs of abuse

in the adult mouse? Morphine

Increased expression

Adult Mouse

Some concepts, e.g., age category, are quan>ta>ve but s>ll must be interpreted in a global query system

hZp://neurolex.org Stephen Larson

• Provide a simple interface for defining the concepts required

• Light weight seman>cs • Good teaching tool for learning about seman>c integra>on and the benefits of a consistent seman>c framework

• Community based: • Anyone can contribute their terms, concepts, things • Anyone can edit • Anyone can link

• Accessible: searched by Google

• Growing into a significant knowledge base for neuroscience Demo D03

 200,000 edits  150 contributors

•  NIF can be used to survey the data landscape

•  Analysis of NIF shows mul>ple databases with similar scope and content

•  Many contain par>ally overlapping data

•  Data “flows” from one resource to the next –  Data is reinterpreted, reanalyzed or

added to

•  Is duplica>on good or bad?

Databases come in many shapes and sizes

•  Primary data: –  Data available for reanalysis, e.g.,

microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL)

•  Secondary data –  Data features extracted through

data processing and some>mes normaliza>on, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connec>vity statements (BAMS)

•  Ter>ary data –  Claims and asser>ons about the

meaning of data •  E.g., gene upregula>on/

downregula>on, brain ac>va>on as a func>on of task

•  Registries: –  Metadata –  Pointers to data sets or

materials stored elsewhere •  Data aggregators

–  Aggregate data of the same type from mul>ple sources, e.g., Cell Image Library ,SUMSdb, Brede

•  Single source –  Data acquired within a single

context , e.g., Allen Brain Atlas

Researchers are producing a variety of informa>on ar>facts using a mul>tude of technologies

NIF Analy6cs: The Neuroscience Landscape

NIF is in a unique posi>on to answer ques>ons about the neuroscience landscape

Where are the data?

Striatum Hypothalamus Olfactory bulb

Cerebral cortex

Brain

Brain region

Data source

Vadim Astakhov, Kepler Workflow Engine

Diseases of nervous system

Adding more seman6cs

The combina>on of ontologies, diverse data and analy>cs lets us look at the current landscape in interes>ng ways

Neurodegenera>ve

Seizure disorders

Neoplas>c disease of nervous system

NIH Reporter N

IF data fede

rated sources

•  Gemma: Gene ID + Gene Symbol •  DRG: Gene name + Probe ID

•  Gemma presented results rela>ve to baseline chronic morphine; DRG with respect to saline, so direc>on of change is opposite in the 2 databases •  Analysis:

• 1370 statements from Gemma regarding gene expression as a func>on of chronic morphine • 617 were consistent with DRG; over half of the claims of the paper were not confirmed in this analysis • Results for 1 gene were opposite in DRG and Gemma • 45 did not have enough informa>on provided in the paper to make a judgment

Rela>vely simple standards would make life easier

NIF favors a hybrid, >ered, federated system

•  Domain knowledge –  Ontologies

•  Claims, models and observa>ons –  Virtuoso RDF triples –  Model repositories

•  Data –  Data federa>on –  Spa>al data –  Workflows

•  Narra>ve –  Full text access

Neuron Brain part Disease Organism Gene

Caudate projects to Snpc Grm1 is upregulated in

chronic cocaine Betz cells

degenerate in ALS

NIF provides the tentacles that connect the pieces: a new type of en>ty for 21st century science

Technique People

•  2006-‐2008: A survey of what was out there •  2008-‐2009: Strategy for resource discovery

–  NIF Registry vs NIF data federa>on

–  Inges>on of data contained within different technology pla]orms, e.g., XML vs rela>onal vs RDF

–  Effec>ve search across seman>cally diverse sources •  NIFSTD ontologies

•  2009-‐2011: Strategy for data integra>on –  Unified views across common sources

–  Mapping of content to NIF vocabularies

•  2011-‐present: Data analy>cs –  Uniform external data references

•  2012-‐present: SciCrunch: unified biomedical resource services

NIF provides a strategy and set of tools applicable to all domains grappling with mul>ple sources of diverse data (i.e., preZy much everything)

•  Search seman>cs

•  Ranking •  Resources supported by NIH Blueprint Ins>tutes are more thoroughly covered

•  Data types, e.g., Brain ac>va>on foci


June10, 2013 42

SciCrunch

NIF MONARCH

Community Services dkCOIN

Shared Resources

Undiagnosed Disease Program

Phenotype RCN

3D Virtual Cell

Na>onal Ins>tute on Aging

One Mind for Research

BIRN

Interna>onal Neuroinforma>cs Coordina>ng

Facility

Model Organism Databases

Community Outreach

DELSA

(not just a data catalog)

43

• 3dVC: Focus on models and simula>on

• Gene Ontology: Focus on bioinforma>cs tools

• Na>onal Ins>tute on aging: Aging-‐related data sets

• Monarch: Phenotype-‐Genotype; deep seman>c data integra>on

• One Mind for Research: Biospecimen repositories

• NeuroGateway: Computa>onal resources

• FORCE11: Tools for next-‐gen publishing and e-‐scholarship

SciCrunch

SciCrunch is ac>vely suppor>ng mul>ple communi>es; mul>ple communi>es are enriching and improving SciCrunch

Community database: beginning

Community database:

End

“How do I share my data/tool?”

“There is no database for my data”

1

2

3

4

Ins3tu3onal repositories

Cloud

INCF: Global infrastructure

Government

Educa>on

Industry

NIF is designed to leverage exis>ng investments in resources and infrastructure

Tool repositories

•  No one can be stopped from doing what they need to do

•  Every resource is resource limited: few have enough >me, money, staff or exper>se required to do everything they would like –  If the market can support 11 MRI databases, fine

–  Some consolida>on, coordina>on is warranted though

•  Big, broad and messy beats small, narrow and neat –  Without trying to integrate a lot of data, we will not know what needs to be done –  A lot can be done with messy data; neatness helps though –  Progressive refinement; addi>on of complexity through layers

•  Be flexible and opportunis>c –  A single op>mal technology/container for all types of scien>fic data and informa>on does not exist;

technology is changing

•  Think globally; act locally: –  No source, not even NIF, is THE source; we are all a source

•  Several powerful trends should change the way we think about our data: One Many – Many data

•  Genera>on of data is gewng easier shared data •  Data space is gewng richer: more –omes everyday •  But...compared to the biological space, s>ll sparse

–  Many eyes •  Wisdom of crowds •  More than one way to interpret data

–  Many algorithms •  Not a single way to analyze data

–  Many analy>cs •  “Signatures” in data may not be directly related to the ques>on for which they were acquired but tell us something really interes>ng

Are you exposing or burying your work?

Jeff Grethe, UCSD, Co Inves>gator, Interim PI

Amarnath Gupta, UCSD, Co Inves>gator

Anita Bandrowski, NIF Project Leader

Gordon Shepherd, Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen, Washington University

Erin Reid

Paul Sternberg, Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli, George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark, Harvard University

Paolo Ciccarese

Karen Skinner, NIH, Program Officer (re>red)

Jonathan Pollock, NIH, Program Officer

And my colleagues in Monarch, dkNet, 3DVC, Force 11

Neurosciences Information Framework (NIF): An example of community Cyberinfrastructure for the...

Technology

Transcript of Neurosciences Information Framework (NIF): An example of community Cyberinfrastructure for the...