Data-knowledge transition zones within the biomedical research ecosystem

58
Data - knowledge transition zones within the biomedical research ecosystem Maryann E. Martone, Ph. D. University of California, San Diego

Transcript of Data-knowledge transition zones within the biomedical research ecosystem

Data-knowledge transition zones within the biomedical research

ecosystem

Maryann E Martone Ph DUniversity of California San Diego

bull NIF is an initiative of the NIH Blueprint consortium of institutesndash What types of resources (data tools materials services) are available to the

neuroscience community

ndash How many are there

ndash What domains do they cover What domains do they not cover

ndash Where are theybull Web sites

bull Databases

bull Literature

bull Supplementary material

ndash Who uses them

ndash Who creates them

ndash How can we find them

ndash How can we make them better in the future

httpneuinfoorg

bull PDF files

bull Desk drawers

NIF has been surveying

cataloging and tracking the

neuroscience resource

landscape since lt 2008

BD2K Big Data to Knowledgebull BD2K - a trans-NIH initiative established to enable biomedical research as a

digital research enterprise to facilitate discovery and support new knowledge and to maximize community engagement

bull BD2K aims to develop the new approaches standards methods tools software and competencies that will enhance the use of biomedical Big Data by

ndash Facilitating broad use of biomedical digital assets by making them discoverable accessible and citable

ndash Conducting research and developing the methods software and tools needed to analyze biomedical Big Data

ndash Enhancing training in the development and use of methods and tools necessary for biomedical Big Data science

ndash Supporting a data ecosystem that accelerates discovery as part of a digital enterprise

httpbd2knihgov

How many resources are there

How do resources get added to the NIFbullNIF curatorsbullNomination by the communitybullSemi-automated text mining pipelines

NIF RegistryRequires no special skillsManual and semi-automated updates

bullNIF Data FederationbullDISCO interopbullRequires some programming skillbullOpen Source Brain lt 2 hrbullAutomated update via NIF DISCO dashboard

Low barrier to entry incremental refinementMarenco et al 2010 2014

Registry vs Federation Metadata about resource vsmetadatadata in database

What resources are available for GRM1

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

THE STATE OF RESEARCH RESOURCES RESOURCE REGISTRY

Database

Software Application

Data Analysis Service

Topical Portal

Core Facility

Ontology

Software Resource

Years

Anita Bandrowski and Burak Ozyurt

Population Coverage and Linkage of Resource Registry

bull Automated text mining is used to look for ldquoweb page last updatedrdquo or copyright dates

ndash Identified for 570 resources

ndash 373 were not updated within the last 2

years (65)

bull Manual review of ~200 resources

ndash 38 not updated within the past 2 years

(~20)

ndash 8 migrated to new addresses or institutions

ndash 7 are no longer in service (~3)

ndash 3 were deemed no longer appropriate

What happens to these resources

The Registry provides a persistent identifier and metadata record for what once existed but no longer does

Keeping content up to date

Connectome

Tractography

Epigenetics

bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content

bullChange namebullChange scope

bullgt 7000 updates to the registry last year

Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review

DATA FEDERATION

NIF data federation

NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views

250 sourcesgt 800 M records

What do you mean by dataDatabases come in many shapes and sizes

bull Primary data

ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)

bull Secondary datandash Data features extracted through

data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)

bull Tertiary datandash Claims and assertions about the

meaning of data

bull Eg gene upregulationdownregulation brain activation as a function of task

bull Registriesndash Metadata

ndash Pointers to data sets or materials stored elsewhere

bull Data aggregatorsndash Aggregate data of the same

type from multiple sources eg Cell Image Library SUMSdb Brede

bull Single sourcendash Data acquired within a single

context eg Allen Brain Atlas

Researchers are producing a variety of information artifacts using a multitude of technologies

NIF A search engine for data

NIF Information Framework Query and alignment

bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology

bull Available as services through NIF and BioPortal

NIFSTD

Organism

NS FunctionMolecule InvestigationSubcellularstructure

Macromolecule Gene

Molecule Descriptors

Techniques

Reagent Protocols

Cell

Resource Instrument

Dysfunction QualityAnatomical Structure

NIF uses ontologies to enhance search and discovery but is not constrained by them

Find clinical trials that have data available

Current challenge With so much available how do I find what I need

bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends

bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and

tools the answers may differ

bull Many databases have tool bases and workflows that they supportndash Much value has been added to

individual data sets

Facets and filters Progressive refinement of search

FacetFilter

Source

Category

Index

Query Addiction

Registry Data

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

More effective to start with a general query and use the navigation to refine search

Concept Mapper Alignment and weighting

Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum

ldquoData trailsrdquo Linking data and analysis tools

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

bull NIF is an initiative of the NIH Blueprint consortium of institutesndash What types of resources (data tools materials services) are available to the

neuroscience community

ndash How many are there

ndash What domains do they cover What domains do they not cover

ndash Where are theybull Web sites

bull Databases

bull Literature

bull Supplementary material

ndash Who uses them

ndash Who creates them

ndash How can we find them

ndash How can we make them better in the future

httpneuinfoorg

bull PDF files

bull Desk drawers

NIF has been surveying

cataloging and tracking the

neuroscience resource

landscape since lt 2008

BD2K Big Data to Knowledgebull BD2K - a trans-NIH initiative established to enable biomedical research as a

digital research enterprise to facilitate discovery and support new knowledge and to maximize community engagement

bull BD2K aims to develop the new approaches standards methods tools software and competencies that will enhance the use of biomedical Big Data by

ndash Facilitating broad use of biomedical digital assets by making them discoverable accessible and citable

ndash Conducting research and developing the methods software and tools needed to analyze biomedical Big Data

ndash Enhancing training in the development and use of methods and tools necessary for biomedical Big Data science

ndash Supporting a data ecosystem that accelerates discovery as part of a digital enterprise

httpbd2knihgov

How many resources are there

How do resources get added to the NIFbullNIF curatorsbullNomination by the communitybullSemi-automated text mining pipelines

NIF RegistryRequires no special skillsManual and semi-automated updates

bullNIF Data FederationbullDISCO interopbullRequires some programming skillbullOpen Source Brain lt 2 hrbullAutomated update via NIF DISCO dashboard

Low barrier to entry incremental refinementMarenco et al 2010 2014

Registry vs Federation Metadata about resource vsmetadatadata in database

What resources are available for GRM1

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

THE STATE OF RESEARCH RESOURCES RESOURCE REGISTRY

Database

Software Application

Data Analysis Service

Topical Portal

Core Facility

Ontology

Software Resource

Years

Anita Bandrowski and Burak Ozyurt

Population Coverage and Linkage of Resource Registry

bull Automated text mining is used to look for ldquoweb page last updatedrdquo or copyright dates

ndash Identified for 570 resources

ndash 373 were not updated within the last 2

years (65)

bull Manual review of ~200 resources

ndash 38 not updated within the past 2 years

(~20)

ndash 8 migrated to new addresses or institutions

ndash 7 are no longer in service (~3)

ndash 3 were deemed no longer appropriate

What happens to these resources

The Registry provides a persistent identifier and metadata record for what once existed but no longer does

Keeping content up to date

Connectome

Tractography

Epigenetics

bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content

bullChange namebullChange scope

bullgt 7000 updates to the registry last year

Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review

DATA FEDERATION

NIF data federation

NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views

250 sourcesgt 800 M records

What do you mean by dataDatabases come in many shapes and sizes

bull Primary data

ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)

bull Secondary datandash Data features extracted through

data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)

bull Tertiary datandash Claims and assertions about the

meaning of data

bull Eg gene upregulationdownregulation brain activation as a function of task

bull Registriesndash Metadata

ndash Pointers to data sets or materials stored elsewhere

bull Data aggregatorsndash Aggregate data of the same

type from multiple sources eg Cell Image Library SUMSdb Brede

bull Single sourcendash Data acquired within a single

context eg Allen Brain Atlas

Researchers are producing a variety of information artifacts using a multitude of technologies

NIF A search engine for data

NIF Information Framework Query and alignment

bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology

bull Available as services through NIF and BioPortal

NIFSTD

Organism

NS FunctionMolecule InvestigationSubcellularstructure

Macromolecule Gene

Molecule Descriptors

Techniques

Reagent Protocols

Cell

Resource Instrument

Dysfunction QualityAnatomical Structure

NIF uses ontologies to enhance search and discovery but is not constrained by them

Find clinical trials that have data available

Current challenge With so much available how do I find what I need

bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends

bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and

tools the answers may differ

bull Many databases have tool bases and workflows that they supportndash Much value has been added to

individual data sets

Facets and filters Progressive refinement of search

FacetFilter

Source

Category

Index

Query Addiction

Registry Data

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

More effective to start with a general query and use the navigation to refine search

Concept Mapper Alignment and weighting

Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum

ldquoData trailsrdquo Linking data and analysis tools

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

BD2K Big Data to Knowledgebull BD2K - a trans-NIH initiative established to enable biomedical research as a

digital research enterprise to facilitate discovery and support new knowledge and to maximize community engagement

bull BD2K aims to develop the new approaches standards methods tools software and competencies that will enhance the use of biomedical Big Data by

ndash Facilitating broad use of biomedical digital assets by making them discoverable accessible and citable

ndash Conducting research and developing the methods software and tools needed to analyze biomedical Big Data

ndash Enhancing training in the development and use of methods and tools necessary for biomedical Big Data science

ndash Supporting a data ecosystem that accelerates discovery as part of a digital enterprise

httpbd2knihgov

How many resources are there

How do resources get added to the NIFbullNIF curatorsbullNomination by the communitybullSemi-automated text mining pipelines

NIF RegistryRequires no special skillsManual and semi-automated updates

bullNIF Data FederationbullDISCO interopbullRequires some programming skillbullOpen Source Brain lt 2 hrbullAutomated update via NIF DISCO dashboard

Low barrier to entry incremental refinementMarenco et al 2010 2014

Registry vs Federation Metadata about resource vsmetadatadata in database

What resources are available for GRM1

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

THE STATE OF RESEARCH RESOURCES RESOURCE REGISTRY

Database

Software Application

Data Analysis Service

Topical Portal

Core Facility

Ontology

Software Resource

Years

Anita Bandrowski and Burak Ozyurt

Population Coverage and Linkage of Resource Registry

bull Automated text mining is used to look for ldquoweb page last updatedrdquo or copyright dates

ndash Identified for 570 resources

ndash 373 were not updated within the last 2

years (65)

bull Manual review of ~200 resources

ndash 38 not updated within the past 2 years

(~20)

ndash 8 migrated to new addresses or institutions

ndash 7 are no longer in service (~3)

ndash 3 were deemed no longer appropriate

What happens to these resources

The Registry provides a persistent identifier and metadata record for what once existed but no longer does

Keeping content up to date

Connectome

Tractography

Epigenetics

bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content

bullChange namebullChange scope

bullgt 7000 updates to the registry last year

Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review

DATA FEDERATION

NIF data federation

NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views

250 sourcesgt 800 M records

What do you mean by dataDatabases come in many shapes and sizes

bull Primary data

ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)

bull Secondary datandash Data features extracted through

data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)

bull Tertiary datandash Claims and assertions about the

meaning of data

bull Eg gene upregulationdownregulation brain activation as a function of task

bull Registriesndash Metadata

ndash Pointers to data sets or materials stored elsewhere

bull Data aggregatorsndash Aggregate data of the same

type from multiple sources eg Cell Image Library SUMSdb Brede

bull Single sourcendash Data acquired within a single

context eg Allen Brain Atlas

Researchers are producing a variety of information artifacts using a multitude of technologies

NIF A search engine for data

NIF Information Framework Query and alignment

bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology

bull Available as services through NIF and BioPortal

NIFSTD

Organism

NS FunctionMolecule InvestigationSubcellularstructure

Macromolecule Gene

Molecule Descriptors

Techniques

Reagent Protocols

Cell

Resource Instrument

Dysfunction QualityAnatomical Structure

NIF uses ontologies to enhance search and discovery but is not constrained by them

Find clinical trials that have data available

Current challenge With so much available how do I find what I need

bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends

bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and

tools the answers may differ

bull Many databases have tool bases and workflows that they supportndash Much value has been added to

individual data sets

Facets and filters Progressive refinement of search

FacetFilter

Source

Category

Index

Query Addiction

Registry Data

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

More effective to start with a general query and use the navigation to refine search

Concept Mapper Alignment and weighting

Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum

ldquoData trailsrdquo Linking data and analysis tools

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

How many resources are there

How do resources get added to the NIFbullNIF curatorsbullNomination by the communitybullSemi-automated text mining pipelines

NIF RegistryRequires no special skillsManual and semi-automated updates

bullNIF Data FederationbullDISCO interopbullRequires some programming skillbullOpen Source Brain lt 2 hrbullAutomated update via NIF DISCO dashboard

Low barrier to entry incremental refinementMarenco et al 2010 2014

Registry vs Federation Metadata about resource vsmetadatadata in database

What resources are available for GRM1

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

THE STATE OF RESEARCH RESOURCES RESOURCE REGISTRY

Database

Software Application

Data Analysis Service

Topical Portal

Core Facility

Ontology

Software Resource

Years

Anita Bandrowski and Burak Ozyurt

Population Coverage and Linkage of Resource Registry

bull Automated text mining is used to look for ldquoweb page last updatedrdquo or copyright dates

ndash Identified for 570 resources

ndash 373 were not updated within the last 2

years (65)

bull Manual review of ~200 resources

ndash 38 not updated within the past 2 years

(~20)

ndash 8 migrated to new addresses or institutions

ndash 7 are no longer in service (~3)

ndash 3 were deemed no longer appropriate

What happens to these resources

The Registry provides a persistent identifier and metadata record for what once existed but no longer does

Keeping content up to date

Connectome

Tractography

Epigenetics

bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content

bullChange namebullChange scope

bullgt 7000 updates to the registry last year

Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review

DATA FEDERATION

NIF data federation

NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views

250 sourcesgt 800 M records

What do you mean by dataDatabases come in many shapes and sizes

bull Primary data

ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)

bull Secondary datandash Data features extracted through

data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)

bull Tertiary datandash Claims and assertions about the

meaning of data

bull Eg gene upregulationdownregulation brain activation as a function of task

bull Registriesndash Metadata

ndash Pointers to data sets or materials stored elsewhere

bull Data aggregatorsndash Aggregate data of the same

type from multiple sources eg Cell Image Library SUMSdb Brede

bull Single sourcendash Data acquired within a single

context eg Allen Brain Atlas

Researchers are producing a variety of information artifacts using a multitude of technologies

NIF A search engine for data

NIF Information Framework Query and alignment

bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology

bull Available as services through NIF and BioPortal

NIFSTD

Organism

NS FunctionMolecule InvestigationSubcellularstructure

Macromolecule Gene

Molecule Descriptors

Techniques

Reagent Protocols

Cell

Resource Instrument

Dysfunction QualityAnatomical Structure

NIF uses ontologies to enhance search and discovery but is not constrained by them

Find clinical trials that have data available

Current challenge With so much available how do I find what I need

bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends

bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and

tools the answers may differ

bull Many databases have tool bases and workflows that they supportndash Much value has been added to

individual data sets

Facets and filters Progressive refinement of search

FacetFilter

Source

Category

Index

Query Addiction

Registry Data

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

More effective to start with a general query and use the navigation to refine search

Concept Mapper Alignment and weighting

Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum

ldquoData trailsrdquo Linking data and analysis tools

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

How do resources get added to the NIFbullNIF curatorsbullNomination by the communitybullSemi-automated text mining pipelines

NIF RegistryRequires no special skillsManual and semi-automated updates

bullNIF Data FederationbullDISCO interopbullRequires some programming skillbullOpen Source Brain lt 2 hrbullAutomated update via NIF DISCO dashboard

Low barrier to entry incremental refinementMarenco et al 2010 2014

Registry vs Federation Metadata about resource vsmetadatadata in database

What resources are available for GRM1

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

THE STATE OF RESEARCH RESOURCES RESOURCE REGISTRY

Database

Software Application

Data Analysis Service

Topical Portal

Core Facility

Ontology

Software Resource

Years

Anita Bandrowski and Burak Ozyurt

Population Coverage and Linkage of Resource Registry

bull Automated text mining is used to look for ldquoweb page last updatedrdquo or copyright dates

ndash Identified for 570 resources

ndash 373 were not updated within the last 2

years (65)

bull Manual review of ~200 resources

ndash 38 not updated within the past 2 years

(~20)

ndash 8 migrated to new addresses or institutions

ndash 7 are no longer in service (~3)

ndash 3 were deemed no longer appropriate

What happens to these resources

The Registry provides a persistent identifier and metadata record for what once existed but no longer does

Keeping content up to date

Connectome

Tractography

Epigenetics

bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content

bullChange namebullChange scope

bullgt 7000 updates to the registry last year

Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review

DATA FEDERATION

NIF data federation

NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views

250 sourcesgt 800 M records

What do you mean by dataDatabases come in many shapes and sizes

bull Primary data

ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)

bull Secondary datandash Data features extracted through

data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)

bull Tertiary datandash Claims and assertions about the

meaning of data

bull Eg gene upregulationdownregulation brain activation as a function of task

bull Registriesndash Metadata

ndash Pointers to data sets or materials stored elsewhere

bull Data aggregatorsndash Aggregate data of the same

type from multiple sources eg Cell Image Library SUMSdb Brede

bull Single sourcendash Data acquired within a single

context eg Allen Brain Atlas

Researchers are producing a variety of information artifacts using a multitude of technologies

NIF A search engine for data

NIF Information Framework Query and alignment

bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology

bull Available as services through NIF and BioPortal

NIFSTD

Organism

NS FunctionMolecule InvestigationSubcellularstructure

Macromolecule Gene

Molecule Descriptors

Techniques

Reagent Protocols

Cell

Resource Instrument

Dysfunction QualityAnatomical Structure

NIF uses ontologies to enhance search and discovery but is not constrained by them

Find clinical trials that have data available

Current challenge With so much available how do I find what I need

bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends

bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and

tools the answers may differ

bull Many databases have tool bases and workflows that they supportndash Much value has been added to

individual data sets

Facets and filters Progressive refinement of search

FacetFilter

Source

Category

Index

Query Addiction

Registry Data

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

More effective to start with a general query and use the navigation to refine search

Concept Mapper Alignment and weighting

Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum

ldquoData trailsrdquo Linking data and analysis tools

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Registry vs Federation Metadata about resource vsmetadatadata in database

What resources are available for GRM1

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

THE STATE OF RESEARCH RESOURCES RESOURCE REGISTRY

Database

Software Application

Data Analysis Service

Topical Portal

Core Facility

Ontology

Software Resource

Years

Anita Bandrowski and Burak Ozyurt

Population Coverage and Linkage of Resource Registry

bull Automated text mining is used to look for ldquoweb page last updatedrdquo or copyright dates

ndash Identified for 570 resources

ndash 373 were not updated within the last 2

years (65)

bull Manual review of ~200 resources

ndash 38 not updated within the past 2 years

(~20)

ndash 8 migrated to new addresses or institutions

ndash 7 are no longer in service (~3)

ndash 3 were deemed no longer appropriate

What happens to these resources

The Registry provides a persistent identifier and metadata record for what once existed but no longer does

Keeping content up to date

Connectome

Tractography

Epigenetics

bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content

bullChange namebullChange scope

bullgt 7000 updates to the registry last year

Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review

DATA FEDERATION

NIF data federation

NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views

250 sourcesgt 800 M records

What do you mean by dataDatabases come in many shapes and sizes

bull Primary data

ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)

bull Secondary datandash Data features extracted through

data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)

bull Tertiary datandash Claims and assertions about the

meaning of data

bull Eg gene upregulationdownregulation brain activation as a function of task

bull Registriesndash Metadata

ndash Pointers to data sets or materials stored elsewhere

bull Data aggregatorsndash Aggregate data of the same

type from multiple sources eg Cell Image Library SUMSdb Brede

bull Single sourcendash Data acquired within a single

context eg Allen Brain Atlas

Researchers are producing a variety of information artifacts using a multitude of technologies

NIF A search engine for data

NIF Information Framework Query and alignment

bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology

bull Available as services through NIF and BioPortal

NIFSTD

Organism

NS FunctionMolecule InvestigationSubcellularstructure

Macromolecule Gene

Molecule Descriptors

Techniques

Reagent Protocols

Cell

Resource Instrument

Dysfunction QualityAnatomical Structure

NIF uses ontologies to enhance search and discovery but is not constrained by them

Find clinical trials that have data available

Current challenge With so much available how do I find what I need

bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends

bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and

tools the answers may differ

bull Many databases have tool bases and workflows that they supportndash Much value has been added to

individual data sets

Facets and filters Progressive refinement of search

FacetFilter

Source

Category

Index

Query Addiction

Registry Data

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

More effective to start with a general query and use the navigation to refine search

Concept Mapper Alignment and weighting

Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum

ldquoData trailsrdquo Linking data and analysis tools

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

What resources are available for GRM1

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

THE STATE OF RESEARCH RESOURCES RESOURCE REGISTRY

Database

Software Application

Data Analysis Service

Topical Portal

Core Facility

Ontology

Software Resource

Years

Anita Bandrowski and Burak Ozyurt

Population Coverage and Linkage of Resource Registry

bull Automated text mining is used to look for ldquoweb page last updatedrdquo or copyright dates

ndash Identified for 570 resources

ndash 373 were not updated within the last 2

years (65)

bull Manual review of ~200 resources

ndash 38 not updated within the past 2 years

(~20)

ndash 8 migrated to new addresses or institutions

ndash 7 are no longer in service (~3)

ndash 3 were deemed no longer appropriate

What happens to these resources

The Registry provides a persistent identifier and metadata record for what once existed but no longer does

Keeping content up to date

Connectome

Tractography

Epigenetics

bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content

bullChange namebullChange scope

bullgt 7000 updates to the registry last year

Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review

DATA FEDERATION

NIF data federation

NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views

250 sourcesgt 800 M records

What do you mean by dataDatabases come in many shapes and sizes

bull Primary data

ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)

bull Secondary datandash Data features extracted through

data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)

bull Tertiary datandash Claims and assertions about the

meaning of data

bull Eg gene upregulationdownregulation brain activation as a function of task

bull Registriesndash Metadata

ndash Pointers to data sets or materials stored elsewhere

bull Data aggregatorsndash Aggregate data of the same

type from multiple sources eg Cell Image Library SUMSdb Brede

bull Single sourcendash Data acquired within a single

context eg Allen Brain Atlas

Researchers are producing a variety of information artifacts using a multitude of technologies

NIF A search engine for data

NIF Information Framework Query and alignment

bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology

bull Available as services through NIF and BioPortal

NIFSTD

Organism

NS FunctionMolecule InvestigationSubcellularstructure

Macromolecule Gene

Molecule Descriptors

Techniques

Reagent Protocols

Cell

Resource Instrument

Dysfunction QualityAnatomical Structure

NIF uses ontologies to enhance search and discovery but is not constrained by them

Find clinical trials that have data available

Current challenge With so much available how do I find what I need

bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends

bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and

tools the answers may differ

bull Many databases have tool bases and workflows that they supportndash Much value has been added to

individual data sets

Facets and filters Progressive refinement of search

FacetFilter

Source

Category

Index

Query Addiction

Registry Data

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

More effective to start with a general query and use the navigation to refine search

Concept Mapper Alignment and weighting

Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum

ldquoData trailsrdquo Linking data and analysis tools

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

THE STATE OF RESEARCH RESOURCES RESOURCE REGISTRY

Database

Software Application

Data Analysis Service

Topical Portal

Core Facility

Ontology

Software Resource

Years

Anita Bandrowski and Burak Ozyurt

Population Coverage and Linkage of Resource Registry

bull Automated text mining is used to look for ldquoweb page last updatedrdquo or copyright dates

ndash Identified for 570 resources

ndash 373 were not updated within the last 2

years (65)

bull Manual review of ~200 resources

ndash 38 not updated within the past 2 years

(~20)

ndash 8 migrated to new addresses or institutions

ndash 7 are no longer in service (~3)

ndash 3 were deemed no longer appropriate

What happens to these resources

The Registry provides a persistent identifier and metadata record for what once existed but no longer does

Keeping content up to date

Connectome

Tractography

Epigenetics

bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content

bullChange namebullChange scope

bullgt 7000 updates to the registry last year

Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review

DATA FEDERATION

NIF data federation

NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views

250 sourcesgt 800 M records

What do you mean by dataDatabases come in many shapes and sizes

bull Primary data

ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)

bull Secondary datandash Data features extracted through

data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)

bull Tertiary datandash Claims and assertions about the

meaning of data

bull Eg gene upregulationdownregulation brain activation as a function of task

bull Registriesndash Metadata

ndash Pointers to data sets or materials stored elsewhere

bull Data aggregatorsndash Aggregate data of the same

type from multiple sources eg Cell Image Library SUMSdb Brede

bull Single sourcendash Data acquired within a single

context eg Allen Brain Atlas

Researchers are producing a variety of information artifacts using a multitude of technologies

NIF A search engine for data

NIF Information Framework Query and alignment

bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology

bull Available as services through NIF and BioPortal

NIFSTD

Organism

NS FunctionMolecule InvestigationSubcellularstructure

Macromolecule Gene

Molecule Descriptors

Techniques

Reagent Protocols

Cell

Resource Instrument

Dysfunction QualityAnatomical Structure

NIF uses ontologies to enhance search and discovery but is not constrained by them

Find clinical trials that have data available

Current challenge With so much available how do I find what I need

bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends

bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and

tools the answers may differ

bull Many databases have tool bases and workflows that they supportndash Much value has been added to

individual data sets

Facets and filters Progressive refinement of search

FacetFilter

Source

Category

Index

Query Addiction

Registry Data

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

More effective to start with a general query and use the navigation to refine search

Concept Mapper Alignment and weighting

Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum

ldquoData trailsrdquo Linking data and analysis tools

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Database

Software Application

Data Analysis Service

Topical Portal

Core Facility

Ontology

Software Resource

Years

Anita Bandrowski and Burak Ozyurt

Population Coverage and Linkage of Resource Registry

bull Automated text mining is used to look for ldquoweb page last updatedrdquo or copyright dates

ndash Identified for 570 resources

ndash 373 were not updated within the last 2

years (65)

bull Manual review of ~200 resources

ndash 38 not updated within the past 2 years

(~20)

ndash 8 migrated to new addresses or institutions

ndash 7 are no longer in service (~3)

ndash 3 were deemed no longer appropriate

What happens to these resources

The Registry provides a persistent identifier and metadata record for what once existed but no longer does

Keeping content up to date

Connectome

Tractography

Epigenetics

bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content

bullChange namebullChange scope

bullgt 7000 updates to the registry last year

Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review

DATA FEDERATION

NIF data federation

NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views

250 sourcesgt 800 M records

What do you mean by dataDatabases come in many shapes and sizes

bull Primary data

ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)

bull Secondary datandash Data features extracted through

data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)

bull Tertiary datandash Claims and assertions about the

meaning of data

bull Eg gene upregulationdownregulation brain activation as a function of task

bull Registriesndash Metadata

ndash Pointers to data sets or materials stored elsewhere

bull Data aggregatorsndash Aggregate data of the same

type from multiple sources eg Cell Image Library SUMSdb Brede

bull Single sourcendash Data acquired within a single

context eg Allen Brain Atlas

Researchers are producing a variety of information artifacts using a multitude of technologies

NIF A search engine for data

NIF Information Framework Query and alignment

bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology

bull Available as services through NIF and BioPortal

NIFSTD

Organism

NS FunctionMolecule InvestigationSubcellularstructure

Macromolecule Gene

Molecule Descriptors

Techniques

Reagent Protocols

Cell

Resource Instrument

Dysfunction QualityAnatomical Structure

NIF uses ontologies to enhance search and discovery but is not constrained by them

Find clinical trials that have data available

Current challenge With so much available how do I find what I need

bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends

bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and

tools the answers may differ

bull Many databases have tool bases and workflows that they supportndash Much value has been added to

individual data sets

Facets and filters Progressive refinement of search

FacetFilter

Source

Category

Index

Query Addiction

Registry Data

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

More effective to start with a general query and use the navigation to refine search

Concept Mapper Alignment and weighting

Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum

ldquoData trailsrdquo Linking data and analysis tools

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

bull Automated text mining is used to look for ldquoweb page last updatedrdquo or copyright dates

ndash Identified for 570 resources

ndash 373 were not updated within the last 2

years (65)

bull Manual review of ~200 resources

ndash 38 not updated within the past 2 years

(~20)

ndash 8 migrated to new addresses or institutions

ndash 7 are no longer in service (~3)

ndash 3 were deemed no longer appropriate

What happens to these resources

The Registry provides a persistent identifier and metadata record for what once existed but no longer does

Keeping content up to date

Connectome

Tractography

Epigenetics

bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content

bullChange namebullChange scope

bullgt 7000 updates to the registry last year

Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review

DATA FEDERATION

NIF data federation

NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views

250 sourcesgt 800 M records

What do you mean by dataDatabases come in many shapes and sizes

bull Primary data

ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)

bull Secondary datandash Data features extracted through

data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)

bull Tertiary datandash Claims and assertions about the

meaning of data

bull Eg gene upregulationdownregulation brain activation as a function of task

bull Registriesndash Metadata

ndash Pointers to data sets or materials stored elsewhere

bull Data aggregatorsndash Aggregate data of the same

type from multiple sources eg Cell Image Library SUMSdb Brede

bull Single sourcendash Data acquired within a single

context eg Allen Brain Atlas

Researchers are producing a variety of information artifacts using a multitude of technologies

NIF A search engine for data

NIF Information Framework Query and alignment

bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology

bull Available as services through NIF and BioPortal

NIFSTD

Organism

NS FunctionMolecule InvestigationSubcellularstructure

Macromolecule Gene

Molecule Descriptors

Techniques

Reagent Protocols

Cell

Resource Instrument

Dysfunction QualityAnatomical Structure

NIF uses ontologies to enhance search and discovery but is not constrained by them

Find clinical trials that have data available

Current challenge With so much available how do I find what I need

bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends

bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and

tools the answers may differ

bull Many databases have tool bases and workflows that they supportndash Much value has been added to

individual data sets

Facets and filters Progressive refinement of search

FacetFilter

Source

Category

Index

Query Addiction

Registry Data

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

More effective to start with a general query and use the navigation to refine search

Concept Mapper Alignment and weighting

Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum

ldquoData trailsrdquo Linking data and analysis tools

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Keeping content up to date

Connectome

Tractography

Epigenetics

bullNew tags come into existencebullNew resource types come into existence eg Mobile appsbullResources add new types of content

bullChange namebullChange scope

bullgt 7000 updates to the registry last year

Itrsquos a challenge to keep the registry up to date sitemaps curation ontologies community review

DATA FEDERATION

NIF data federation

NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views

250 sourcesgt 800 M records

What do you mean by dataDatabases come in many shapes and sizes

bull Primary data

ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)

bull Secondary datandash Data features extracted through

data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)

bull Tertiary datandash Claims and assertions about the

meaning of data

bull Eg gene upregulationdownregulation brain activation as a function of task

bull Registriesndash Metadata

ndash Pointers to data sets or materials stored elsewhere

bull Data aggregatorsndash Aggregate data of the same

type from multiple sources eg Cell Image Library SUMSdb Brede

bull Single sourcendash Data acquired within a single

context eg Allen Brain Atlas

Researchers are producing a variety of information artifacts using a multitude of technologies

NIF A search engine for data

NIF Information Framework Query and alignment

bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology

bull Available as services through NIF and BioPortal

NIFSTD

Organism

NS FunctionMolecule InvestigationSubcellularstructure

Macromolecule Gene

Molecule Descriptors

Techniques

Reagent Protocols

Cell

Resource Instrument

Dysfunction QualityAnatomical Structure

NIF uses ontologies to enhance search and discovery but is not constrained by them

Find clinical trials that have data available

Current challenge With so much available how do I find what I need

bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends

bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and

tools the answers may differ

bull Many databases have tool bases and workflows that they supportndash Much value has been added to

individual data sets

Facets and filters Progressive refinement of search

FacetFilter

Source

Category

Index

Query Addiction

Registry Data

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

More effective to start with a general query and use the navigation to refine search

Concept Mapper Alignment and weighting

Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum

ldquoData trailsrdquo Linking data and analysis tools

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

DATA FEDERATION

NIF data federation

NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views

250 sourcesgt 800 M records

What do you mean by dataDatabases come in many shapes and sizes

bull Primary data

ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)

bull Secondary datandash Data features extracted through

data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)

bull Tertiary datandash Claims and assertions about the

meaning of data

bull Eg gene upregulationdownregulation brain activation as a function of task

bull Registriesndash Metadata

ndash Pointers to data sets or materials stored elsewhere

bull Data aggregatorsndash Aggregate data of the same

type from multiple sources eg Cell Image Library SUMSdb Brede

bull Single sourcendash Data acquired within a single

context eg Allen Brain Atlas

Researchers are producing a variety of information artifacts using a multitude of technologies

NIF A search engine for data

NIF Information Framework Query and alignment

bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology

bull Available as services through NIF and BioPortal

NIFSTD

Organism

NS FunctionMolecule InvestigationSubcellularstructure

Macromolecule Gene

Molecule Descriptors

Techniques

Reagent Protocols

Cell

Resource Instrument

Dysfunction QualityAnatomical Structure

NIF uses ontologies to enhance search and discovery but is not constrained by them

Find clinical trials that have data available

Current challenge With so much available how do I find what I need

bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends

bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and

tools the answers may differ

bull Many databases have tool bases and workflows that they supportndash Much value has been added to

individual data sets

Facets and filters Progressive refinement of search

FacetFilter

Source

Category

Index

Query Addiction

Registry Data

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

More effective to start with a general query and use the navigation to refine search

Concept Mapper Alignment and weighting

Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum

ldquoData trailsrdquo Linking data and analysis tools

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

NIF data federation

NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources providing deep query of the contents and unified views

250 sourcesgt 800 M records

What do you mean by dataDatabases come in many shapes and sizes

bull Primary data

ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)

bull Secondary datandash Data features extracted through

data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)

bull Tertiary datandash Claims and assertions about the

meaning of data

bull Eg gene upregulationdownregulation brain activation as a function of task

bull Registriesndash Metadata

ndash Pointers to data sets or materials stored elsewhere

bull Data aggregatorsndash Aggregate data of the same

type from multiple sources eg Cell Image Library SUMSdb Brede

bull Single sourcendash Data acquired within a single

context eg Allen Brain Atlas

Researchers are producing a variety of information artifacts using a multitude of technologies

NIF A search engine for data

NIF Information Framework Query and alignment

bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology

bull Available as services through NIF and BioPortal

NIFSTD

Organism

NS FunctionMolecule InvestigationSubcellularstructure

Macromolecule Gene

Molecule Descriptors

Techniques

Reagent Protocols

Cell

Resource Instrument

Dysfunction QualityAnatomical Structure

NIF uses ontologies to enhance search and discovery but is not constrained by them

Find clinical trials that have data available

Current challenge With so much available how do I find what I need

bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends

bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and

tools the answers may differ

bull Many databases have tool bases and workflows that they supportndash Much value has been added to

individual data sets

Facets and filters Progressive refinement of search

FacetFilter

Source

Category

Index

Query Addiction

Registry Data

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

More effective to start with a general query and use the navigation to refine search

Concept Mapper Alignment and weighting

Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum

ldquoData trailsrdquo Linking data and analysis tools

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

What do you mean by dataDatabases come in many shapes and sizes

bull Primary data

ndash Data available for reanalysis eg microarray data sets from GEO brain images from XNAT microscopic images (CCDBCIL)

bull Secondary datandash Data features extracted through

data processing and sometimes normalization eg brain structure volumes (IBVD) gene expression levels (Allen Brain Atlas) brain connectivity statements (BAMS)

bull Tertiary datandash Claims and assertions about the

meaning of data

bull Eg gene upregulationdownregulation brain activation as a function of task

bull Registriesndash Metadata

ndash Pointers to data sets or materials stored elsewhere

bull Data aggregatorsndash Aggregate data of the same

type from multiple sources eg Cell Image Library SUMSdb Brede

bull Single sourcendash Data acquired within a single

context eg Allen Brain Atlas

Researchers are producing a variety of information artifacts using a multitude of technologies

NIF A search engine for data

NIF Information Framework Query and alignment

bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology

bull Available as services through NIF and BioPortal

NIFSTD

Organism

NS FunctionMolecule InvestigationSubcellularstructure

Macromolecule Gene

Molecule Descriptors

Techniques

Reagent Protocols

Cell

Resource Instrument

Dysfunction QualityAnatomical Structure

NIF uses ontologies to enhance search and discovery but is not constrained by them

Find clinical trials that have data available

Current challenge With so much available how do I find what I need

bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends

bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and

tools the answers may differ

bull Many databases have tool bases and workflows that they supportndash Much value has been added to

individual data sets

Facets and filters Progressive refinement of search

FacetFilter

Source

Category

Index

Query Addiction

Registry Data

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

More effective to start with a general query and use the navigation to refine search

Concept Mapper Alignment and weighting

Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum

ldquoData trailsrdquo Linking data and analysis tools

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

NIF A search engine for data

NIF Information Framework Query and alignment

bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology

bull Available as services through NIF and BioPortal

NIFSTD

Organism

NS FunctionMolecule InvestigationSubcellularstructure

Macromolecule Gene

Molecule Descriptors

Techniques

Reagent Protocols

Cell

Resource Instrument

Dysfunction QualityAnatomical Structure

NIF uses ontologies to enhance search and discovery but is not constrained by them

Find clinical trials that have data available

Current challenge With so much available how do I find what I need

bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends

bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and

tools the answers may differ

bull Many databases have tool bases and workflows that they supportndash Much value has been added to

individual data sets

Facets and filters Progressive refinement of search

FacetFilter

Source

Category

Index

Query Addiction

Registry Data

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

More effective to start with a general query and use the navigation to refine search

Concept Mapper Alignment and weighting

Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum

ldquoData trailsrdquo Linking data and analysis tools

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

NIF Information Framework Query and alignment

bull Aggregate of community ontologies with some extensions for neuroscience eg Gene Ontology Chebi Protein Ontology

bull Available as services through NIF and BioPortal

NIFSTD

Organism

NS FunctionMolecule InvestigationSubcellularstructure

Macromolecule Gene

Molecule Descriptors

Techniques

Reagent Protocols

Cell

Resource Instrument

Dysfunction QualityAnatomical Structure

NIF uses ontologies to enhance search and discovery but is not constrained by them

Find clinical trials that have data available

Current challenge With so much available how do I find what I need

bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends

bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and

tools the answers may differ

bull Many databases have tool bases and workflows that they supportndash Much value has been added to

individual data sets

Facets and filters Progressive refinement of search

FacetFilter

Source

Category

Index

Query Addiction

Registry Data

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

More effective to start with a general query and use the navigation to refine search

Concept Mapper Alignment and weighting

Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum

ldquoData trailsrdquo Linking data and analysis tools

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Find clinical trials that have data available

Current challenge With so much available how do I find what I need

bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends

bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and

tools the answers may differ

bull Many databases have tool bases and workflows that they supportndash Much value has been added to

individual data sets

Facets and filters Progressive refinement of search

FacetFilter

Source

Category

Index

Query Addiction

Registry Data

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

More effective to start with a general query and use the navigation to refine search

Concept Mapper Alignment and weighting

Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum

ldquoData trailsrdquo Linking data and analysis tools

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Current challenge With so much available how do I find what I need

bull ldquoWhat genes are upregulatedby chronic morphinerdquondash It depends

bull Most often use cases require connecting a researcher to relevant data sets and appropriate toolsndash Depending upon the data and

tools the answers may differ

bull Many databases have tool bases and workflows that they supportndash Much value has been added to

individual data sets

Facets and filters Progressive refinement of search

FacetFilter

Source

Category

Index

Query Addiction

Registry Data

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

More effective to start with a general query and use the navigation to refine search

Concept Mapper Alignment and weighting

Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum

ldquoData trailsrdquo Linking data and analysis tools

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Facets and filters Progressive refinement of search

FacetFilter

Source

Category

Index

Query Addiction

Registry Data

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

More effective to start with a general query and use the navigation to refine search

Concept Mapper Alignment and weighting

Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum

ldquoData trailsrdquo Linking data and analysis tools

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Concept Mapper Alignment and weighting

Findgene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum Findgene Anatomycerebellum

ldquoData trailsrdquo Linking data and analysis tools

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

ldquoData trailsrdquo Linking data and analysis tools

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Query across Registry and Federation

bull Registry and Federation were treated separately even though Federation comprises views of Registry entries

bull Experimenting with new combined index

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

SciCrunch A ldquosocial networkrdquo for resources

bull NIF is a general search engine across all of neurosciencebull Very powerful for discovery

and general browsingbull Can perform analytics across

the spectrum of biomedical resources

bull Many communities want to create more focused portalsbull Specialized for their domainbull Restrict the particular sourcesbull Organize the data according

to their needsbull Use their own branding

bull How do we create a system that satisfies community needs without creating another silo

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Put dkNET here

httpdknetorg

Autogenerated snippets

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Where can I find validated antibodies against CART

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

1 100 10000 1000000 10000000010000000000

SOFTWARE

PROTOCOLS

PHENOTYPE

PATHWAYS

MULTIMEDIA

MOLECULE

MICROARRAY

IMAGES

GENE

DRUGS

DATASET

CLINICAL TRIALS

BRAIN ACTIVATION FOCI

ATLAS

ANNOTATION

All databases in the SciCrunchFederation become immediately available through More Resources

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Breaking down silos Community enrichment

Itrsquos like a Mendeley for resources

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

SciCrunch

SharedResources

Undiagnosed Disease Program

Phenotype RCN

One Mind for Research

Consortia-PediaFaster Cures

Model Organism Databases

Community Outreach

Shared curation shared expertiseResource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Making use of community

FacetFilter

Source

Category

Index

Community Community

Community resources

SciCrunchdata (all)

Gene

Gemma

Gene OrganismExpression

level

GeoIntegrated Expression

Literature

Brings expertise of community to understanding how to work with data

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

KNOWLEDGE TO DATA GAP ANALYSIS

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Looking across the ecosystem Where are the data

Data Sources

Bringing knowledge to data Gap analysis

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Forebrain

Midbrain

Hindbrain

0

1-10

11-100

gt101

Data Sources

Revealing biases in the dataspace

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

SW Oh et al Nature 000 1-8 (2014) doi101038nature13186

Adult mouse brain connectivity matrix revenge of the midbrain

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

The tale of the tailldquoHuman neuroimaging typically is performed on a whole brain basis However for several reasons tail of the caudate activity can easily be missed

bullOne reason is limitations in the normalization algorithms that typically are optimized to maximize accuracy for cortical rather than subcorticalstructures

bullA second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body and completely exclude the tail

bullA final reason is that the tail of the caudate is close to the hippocampus and could be misidentified as such especially in tasks involving learning and memory

Therefore the tail of the caudate may be recruited in additional cognitive tasks but yet not have been properly identified and reported in the neuroimaging literaturerdquo

Seger CA The visual corticostriatal loop through the tail of the caudate circuitry and function Front Syst Neurosci 2013 Dec 67104 doi 103389fnsys201300104 eCollection 2013

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Importance of comprehensive indices For how many proteins are there antibodies

0

1-10

11-100

101-1000

1001+

Human protein coding genes (Entrez Gene) vs of search results from the antibodyregistryorg

AntibodyregistryorgTrish Whetzel and Anita Bandrowski

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

ldquoThe Data Homunculusrdquo

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Data-Knowledge Mismatch

Dutowski et al 2013 Nature Biotechnology

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

The scourge of neuroanatomical nomenclature

bullNIF Connectivity 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

bullBrain Architecture Management System (rodent)bullTemporal lobecom (rodent)bullConnectome Wiki (human)bullBrain Maps (various)bullCoCoMac (primate cortex)bullUCLA Multimodal database (Human fMRI)bullAvian Brain Connectivity Database (Bird)

bullTotal 1800 unique brain terms (excluding Avian)

bullNumber of exact terms used in gt 1 database 42bullNumber of synonym matches 99bullNumber of 1st order partonomy matches 385

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

6 parcellation schemes of mouse prefrontal cortex based on Nissl alone

Van De Werd HJ1 Uylings HB Brain Struct Funct 2014 Mar219(2)433-59 doi 101007s00429-013-0630-

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

How many neuron types are

there

NIH funding announcement BRAIN Initiative Transformative

Approaches for Cell-Type Classification in the Brain

ldquoThe mammalian brain contains a vast number of cells These cells are

generally grouped within broad classes (eg neurons or glia) but it is

currently unknown exactly how many classes existrdquo

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Transition Zones Neurons and their properties

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Analysis of Red Links in the Neuron Registry

bull INCF Project

ndash Neuron Registrybull Neurolexorg

bull Semantic MediaWiki

ndash gt 30 experts worldwide

ndash Fill out neuron pages in NeurolexWiki

Soma location

Dendrite location

Axon location

0

50

100

150

200

250

300

NumberTotal

redlinkseasy fixes

hard fixes

Soma location

Dendrite location

Axon location

Social networks and community sites let us learn things from the collective behavior of contributors show limits in our knowledge and our knowledge representations

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Domain Knowledge

Ontologies

AtlasesMaps

Annotation

Claims assertions

Registries

Derived data

Models and simulations

Analyses

Data

Databases Data sets

Literature

Search and Discovery

Cannot try to shoe-horn everything into a single representation or system but figure out how information (data + knowledge) can flow between them Knowledge is fluid and will continually update

SciCrunch Creating a Data and Resource Discovery Environment

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

BD2K Creating a Data Discovery Index

bull BioCADDIE

ndash Dr Lucila Ohno-Machado PI

ndash FORCE11 Community engagement piece

bull What should a data discovery index do

ndash Task Forces

ndash Pilot projects

bull How should it be built httpbiocaddieorg

BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

NIF team (past and present)

Jeff Grethe UCSD Co Investigator Interim PI

Amarnath Gupta UCSD Co Investigator

Anita Bandrowski NIF Project Leader

Gordon Shepherd Yale University

Perry Miller

Luis Marenco

Rixin Wang

David Van Essen Washington University

Erin Reid

Paul Sternberg Cal Tech

Arun Rangarajan

Hans Michael Muller

Yuling Li

Giorgio Ascoli George Mason University

Sridevi Polavarum

Fahim Imam

Larry Lui

Andrea Arnaud Stagg

Jonathan Cachat

Jennifer Lawrence

Svetlana Sulima

Davis Banks

Vadim Astakhov

Xufei Qian

Chris Condit

Mark Ellisman

Stephen Larson

Willie Wong

Tim Clark Harvard University

Paolo Ciccarese

Karen Skinner NIH Program Officer (retired)

Jonathan Pollock NIH Program Officer

And my colleagues in Monarch dkNet 3DVC Force 11

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

BD2K-K2BD Data Discovery Index

bull Accounting of what is availablendash Comprehensive resource registry

ndash UPCrsquos for research resources

bull Information frameworkndash Major concepts contained in data but also accounting of what happens to

data as it flows through the ecosystem (provenance)

bull Community-based portals into shared data resourcesndash Share expertise

ndash Metrics of trust

ndash Shared curation and upkeep

bull Two way validation of knowledge to data

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Registry vs Federation Metadata aboutresource vs metadatadata in database

With the thousands of databases and other information sources available simple descriptive metadata will not suffice

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

What have we learned Grabbing the long tail of small data

bull NIF is in a unique position to ask questions against the data resource landscape

bull The data space is not uniform

bull Data ldquoflowsrdquo from one resource to the next

ndash Data is reinterpreted reanalyzed or added to

bull Currently very difficult to track data as it moves across the landscape

ndash Makes it difficult to learn from combined efforts

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Working with and extending ontologies Neurolexorg

httpneurolexorg Larson et al Frontiers in Neuroinformatics in press

bullSemantic MediWiki

bullProvide a simple interface for defining the concepts required

bullLight weight semantics-sets of triples

bullGood teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

bullCommunity based

bullAnyone can contribute their terms concepts things

bullAnyone can edit

bullAnyone can link

bullAccessible searched by Google

bullGrowing into a significant knowledge base for neuroscience

Demo D03

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Neuron Lexicon Gauging the state of knowledge in neuroscience

bull Led by Dr Gordon Shepherd

bull gt 30 world wide experts

bull Simple set of properties

bull Consistent naming scheme

bull Integrated with Structural Lexicon

bull Used for annotation in other resources eg NeuroElectro

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Stable identifiers and annotations allow us to ldquotrackrdquo data as it moves

Data flows throughout the ecosystemvalue is added

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Analyzed

Curated

GSE13732

Analyzed

Mirrored

Buthellipeven our standards need standards

GSE13732

E-GEOD-13732

GEOGSE13732

Standard identifier format for all data federation sources text mining to deal with inconsistencies

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Same data different analysis

bull Gemma Gene ID + Gene Symbolbull DRG Gene name + Probe ID

bull Gemma presented results relative to baseline chronic morphine DRG with respect to saline so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

bull Analysisbull1370 statements from Gemma regarding gene expression as a function of chronicmorphine

bull617 were consistent with DRG over half of the claims of the paper were not confirmed in this analysis

bullResults for 1 gene were opposite in DRG and Gemma

bull45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with it

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

How many do we use

These resources themselves need to be citable

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Resource Identification Initiative Linking resources to literature

bull Have authors supply appropriate identifiers for key resources used within a study such that they arendash Machine processible (ie unique

identifier that resolves to a single resource)

ndash Outside of the paywallndash Uniform across journals and

publishers

bull Pilot project SciCrunch portal serving identifiers forndash Softwaredatabasesndash Antibodiesndash Genetically modified organisms

Launched February 2014 gt 30 journals participating

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

What studies have used

bullgt200 articles have appeared to date

bullgt30 journals

bullData set being made available to community

bullgt 650 RRIDrsquos

bull~10 disappeared after copyediting

bull5 were in error

Database available at httpswwwforce11orgnode5635

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

C

Neurolex gt 1 million triples

Dr Yi Zeng Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers