Science Bioinformatics Data Resources

download Science Bioinformatics Data Resources

of 40

Transcript of Science Bioinformatics Data Resources

  • 8/12/2019 Science Bioinformatics Data Resources

    1/40

  • 8/12/2019 Science Bioinformatics Data Resources

    2/40

    June 10/11, 2014

    Michelle Hudson, Science & Social Science Data LibrarianKristin Bogdan , Science & Social Science Data Librarian

    Kayleigh Bohmier, Science Research Support Librarian for Astronomy,

    Geology & Geophysics, and Physics

    Rolando Garcia-Milian, Biomedical Sciences Research Support

    Science Data Resources: From Astronomy to Bioinformatics

  • 8/12/2019 Science Bioinformatics Data Resources

    3/40

    A Brief Overview of Data in theSciences

  • 8/12/2019 Science Bioinformatics Data Resources

    4/40

    Examples of data

    questions

    Where would you find

    these? Spectroscopy on jars found

    in wrecks Spectra of M31 ApoE structure Ice cores Genomic sequences for

    extinct mammals

  • 8/12/2019 Science Bioinformatics Data Resources

    5/40

    Types of Data Observational

    data captured in real time, irreplaceable sensor readings, telescope images, geologic samples

    Experimental data from lab equipment, expensive to reproduce gene sequences

    Simulation data generated from models (models are more important than the

    data) Derived or compiled

    data put together from other information 3D models, compileddatabases

  • 8/12/2019 Science Bioinformatics Data Resources

    6/40

    Formats of data Documents, spreadsheets, lab notebooks, questionnaires, survey

    responses, health indicators, audio recordings, video recordings,

    protein and gene sequences, images, films, spectra, slides,

    artifacts, specimens, samples, models, algorithms, scripts,

    software code, etc.

  • 8/12/2019 Science Bioinformatics Data Resources

    7/40

    General resources data.gov: http://www.data.gov/ DataONE: http://www.dataone.org/ NCBI: http://www.ncbi.nlm.nih.gov/ EBI: http://www.ebi.ac.uk/ FigShare: http://figshare.com/ Dryad: http://datadryad.org/ PLOS|ONE: http://www.plosone.org/

    Data journals:http://mlibrarydata.wordpress.com/2014/05/09/data-journals/

    Research guide: http://guides.library.yale.edu/sciencedata

    http://mlibrarydata.wordpress.com/2014/05/09/data-journals/http://guides.library.yale.edu/sciencedatahttp://guides.library.yale.edu/sciencedatahttp://guides.library.yale.edu/sciencedatahttp://guides.library.yale.edu/sciencedatahttp://guides.library.yale.edu/sciencedatahttp://mlibrarydata.wordpress.com/2014/05/09/data-journals/http://mlibrarydata.wordpress.com/2014/05/09/data-journals/http://mlibrarydata.wordpress.com/2014/05/09/data-journals/http://mlibrarydata.wordpress.com/2014/05/09/data-journals/
  • 8/12/2019 Science Bioinformatics Data Resources

    8/40

    Astronomy,or,

    Massively Open Online Archives

  • 8/12/2019 Science Bioinformatics Data Resources

    9/40

    Astronomy data Self-collected Data.NASA.gov US Virtual Observatory

    (US VO) ApJ supplement Research centers and

    collaborations

    Figshare Astronomy Dataverse github

    Screenshot: Virtual Observatory Data Explorer search for M31

  • 8/12/2019 Science Bioinformatics Data Resources

    10/40

    Government Data NASA Data processing levels

    Ranked 0-4 Level 0 is raw data Level 1 has been

    processed and error-corrected

    This is, incidentally, where conspiracy theoriescome from

    Level 2 data maycontain derivedparameters

    Levels 3+ have furtherprocessing

    http://data.nasa.gov/http://archive.eso.org/cms.htmlhttp://vao.stsci.edu/portal/Mashup/Clients/Portal/DataDiscovery.htmlhttp://archive.stsci.edu/
  • 8/12/2019 Science Bioinformatics Data Resources

    11/40

    Data fromResearchers

    Astrophysical Journal

    Supplement http://dx.doi.org/10.1088/00

    67-0049/212/2/26 http://dx.doi.org/10.1088/00

    67-0049/212/2/19 http://dx.doi.org/10.1088/00

    67-0049/212/1/6 Project web pages (i.e.,

    Kepler) Intermediary data products

    remain a problem (i.e., code,

    analyzed data sets)

    Olausen, S. A., & Kaspi, V. M. (2014). Table 2 from TheMcGill Magnetar Catalog. The Astrophysical Journal Supplement Series, 212 (1), 1-22. doi:10.1088/0067-0049/212/1/6

    http://dx.doi.org/10.1088/0067-0049/212/2/26http://dx.doi.org/10.1088/0067-0049/212/2/26http://dx.doi.org/10.1088/0067-0049/212/2/19http://dx.doi.org/10.1088/0067-0049/212/2/19http://dx.doi.org/10.1088/0067-0049/212/1/6http://dx.doi.org/10.1088/0067-0049/212/1/6http://dx.doi.org/10.1088/0067-0049/212/1/6http://dx.doi.org/10.1088/0067-0049/212/1/6http://dx.doi.org/10.1088/0067-0049/212/1/6http://dx.doi.org/10.1088/0067-0049/212/1/6http://dx.doi.org/10.1088/0067-0049/212/1/6http://dx.doi.org/10.1088/0067-0049/212/2/19http://dx.doi.org/10.1088/0067-0049/212/2/19http://dx.doi.org/10.1088/0067-0049/212/2/19http://dx.doi.org/10.1088/0067-0049/212/2/19http://dx.doi.org/10.1088/0067-0049/212/2/19http://dx.doi.org/10.1088/0067-0049/212/2/26http://dx.doi.org/10.1088/0067-0049/212/2/26http://dx.doi.org/10.1088/0067-0049/212/2/26http://dx.doi.org/10.1088/0067-0049/212/2/26http://dx.doi.org/10.1088/0067-0049/212/2/26http://dx.doi.org/10.1088/0067-0049/212/2/26
  • 8/12/2019 Science Bioinformatics Data Resources

    12/40

    Harvards Astronomy Dataverse A solution for intermediate stage data products

  • 8/12/2019 Science Bioinformatics Data Resources

    13/40

    Physics:

    A Little Less Open,But Everyone Knows Where To Find It

  • 8/12/2019 Science Bioinformatics Data Resources

    14/40

    Reference Data National Nuclear Data

    Center at Brookhaven

    National Laboratory:

    http://www.nndc.bnl.gov/ Department of Energy Data

    Explorer:

    http://www.osti.gov/dataexp

    lorer/ MCPlots (Monte Carlo plots

    reference for HEP):

    http://mcplots.cern.ch/

    Monte Carlo plot reference from MCPlots

    http://www.nndc.bnl.gov/http://www.osti.gov/dataexplorer/http://www.osti.gov/dataexplorer/http://mcplots.cern.ch/http://mcplots.cern.ch/http://mcplots.cern.ch/http://www.osti.gov/dataexplorer/http://www.osti.gov/dataexplorer/http://www.osti.gov/dataexplorer/http://www.nndc.bnl.gov/http://www.nndc.bnl.gov/
  • 8/12/2019 Science Bioinformatics Data Resources

    15/40

    Experimental Data Durham HepData Project

    Reactions database Data from active

    experiments

    Data reviews http://durpdg.dur.ac.uk/H

    EPDATA/REAC Example:

    http://durpdg.dur.ac.uk/view/ins1297226

    Experimental data releases IceCube:

    http://icecube.wisc.edu/science/data

    http://durpdg.dur.ac.uk/HEPDATA/REAChttp://durpdg.dur.ac.uk/HEPDATA/REAChttp://durpdg.dur.ac.uk/view/ins1297226http://durpdg.dur.ac.uk/view/ins1297226http://icecube.wisc.edu/science/datahttp://icecube.wisc.edu/science/datahttp://icecube.wisc.edu/science/datahttp://icecube.wisc.edu/science/datahttp://icecube.wisc.edu/science/datahttp://durpdg.dur.ac.uk/view/ins1297226http://durpdg.dur.ac.uk/view/ins1297226http://durpdg.dur.ac.uk/view/ins1297226http://durpdg.dur.ac.uk/HEPDATA/REAChttp://durpdg.dur.ac.uk/HEPDATA/REAChttp://durpdg.dur.ac.uk/HEPDATA/REAC
  • 8/12/2019 Science Bioinformatics Data Resources

    16/40

    and then, we have the data grids.

  • 8/12/2019 Science Bioinformatics Data Resources

    17/40

    Geoscience Data Resources

  • 8/12/2019 Science Bioinformatics Data Resources

    18/40

    Kinds of

    Geoscience Data Geospatial Rocks and Minerals Economic Geology Paleobiology Climate History Geochemistry Physical Samples

    Image credit: USGS, via Wikimedia Commons:http://commons.wikimedia.org/wiki/File:Seismograph_Pinat ubo.jpg

  • 8/12/2019 Science Bioinformatics Data Resources

    19/40

    Physical Samples

    as Data Identified as data in NSF

    guidelines Different analyses = new

    data New techniques

    developed over time Repositories for samples

    specific metadata

    requiredGry, Parent. 7 May 2011. Peronopsis interstrictus. Retrieved

    from the Wikimedia Commons athttp://commons.wikimedia.org/wiki/File%3APeronopsis_interstrictus_White%2C_1874_2.jpg

  • 8/12/2019 Science Bioinformatics Data Resources

    20/40

    Geo/Paleo Sample Repositories/Registries International Geo Sample Number (IGSN) -

    http://www.geosamples.org/ Peabody Museum - http://peabody.yale.edu/collections/search-

    collections PaleoBioDB - http://paleobiodb.org/#/

    http://www.geosamples.org/http://peabody.yale.edu/collections/search-collectionshttp://peabody.yale.edu/collections/search-collectionshttp://paleobiodb.org/http://paleobiodb.org/http://paleobiodb.org/http://peabody.yale.edu/collections/search-collectionshttp://peabody.yale.edu/collections/search-collectionshttp://peabody.yale.edu/collections/search-collectionshttp://peabody.yale.edu/collections/search-collectionshttp://www.geosamples.org/http://www.geosamples.org/
  • 8/12/2019 Science Bioinformatics Data Resources

    21/40

    Other Resources for Geoscience USGS Earth Explorer - http://earthexplorer.usgs.gov/ Data.Gov - http://www.data.gov/ GeoGratis - http://geogratis.cgdi.gc.ca/ Morphobank - http://morphobank.org/ EarthCube http://earthcube.org/ CINERGI - http://workspace.earthcube.org/cinergi

    http://earthexplorer.usgs.gov/http://geogratis.cgdi.gc.ca/http://morphobank.org/http://earthcube.org/http://workspace.earthcube.org/cinergihttp://workspace.earthcube.org/cinergihttp://workspace.earthcube.org/cinergihttp://earthcube.org/http://earthcube.org/http://morphobank.org/http://morphobank.org/http://geogratis.cgdi.gc.ca/http://geogratis.cgdi.gc.ca/http://earthexplorer.usgs.gov/http://earthexplorer.usgs.gov/
  • 8/12/2019 Science Bioinformatics Data Resources

    22/40

  • 8/12/2019 Science Bioinformatics Data Resources

    23/40

    Problem Rapid Growth of Biomedical data

    GenBank Statistics http://www.ncbi.nlm.nih.gov/genbank/genbankstats-2008/

    0.00

    0.50

    1.00

    1.50

    2.00

    2.50

    3.00

    3.50

    2000 2001 20022003 20042005 200620072008 2009 2010 2011 2012

    M i l l i o n s

    Samples Submitted to Gene ExpressionOmnibus Database

    Compiled from GEO historic datahttp://www.ncbi.nlm.nih.gov/geo/summary/?type=history

    http://www.ncbi.nlm.nih.gov/genbank/genbankstats-2008/http://www.ncbi.nlm.nih.gov/genbank/genbankstats-2008/http://www.ncbi.nlm.nih.gov/geo/summary/?type=historyhttp://www.ncbi.nlm.nih.gov/geo/summary/?type=historyhttp://www.ncbi.nlm.nih.gov/genbank/genbankstats-2008/http://www.ncbi.nlm.nih.gov/genbank/genbankstats-2008/http://www.ncbi.nlm.nih.gov/genbank/genbankstats-2008/
  • 8/12/2019 Science Bioinformatics Data Resources

    24/40

    Compiled by from PubMedhttp://www.ncbi.nlm.nih.gov/pubmed

    0.00

    5.00

    10.00

    15.00

    20.00

    25.00

    1940 1960 1980 2000 2020

    M i l l i o n s

    Number of Records in PubMed

    Biomedical Literature

    Problem Growth of the Biomedical Literature

    Huge volume (PubMed 23132342citations)

    High diversity

    High quality (peer review)

    Users overwhelmed by long list of search results

    1/3 of Pubmed queries result in 100 or more citations (Islamaj,2009)

    http://www.ncbi.nlm.nih.gov/pubmedhttp://www.ncbi.nlm.nih.gov/pubmed
  • 8/12/2019 Science Bioinformatics Data Resources

    25/40

    Querying the biomedical literature becomes more difficult

    Medical Subject HeadingsFiltersBoolean operators

    Problem Querying the Biomedical Literature

  • 8/12/2019 Science Bioinformatics Data Resources

    26/40

    Modified from OpenHelix

    EGFR

    retrieves documents/ records

    T14D inh ibi ted EGF receptor internal izat ion

    EGFR regulates tum or cel l pro l i ferat ion

    EGFR is express ed in SCCHN

    extracts facts

    Information Retrieval

    records

    Information Extraction

    records

    Information Retrieval vs Information Extraction

  • 8/12/2019 Science Bioinformatics Data Resources

    27/40

    Alternative Tools for Mining the Biomedical Literature

    Alternative tools for mining the biomedical literature combine:

    Statistical methods,

    Ontologies / Controlled vocabularies

    Natural Language Processing tools,

    Visualization tools

    Reduced time for discovering meaningfulresults.

  • 8/12/2019 Science Bioinformatics Data Resources

    28/40

    Alternative Mining Tools for the Biomedical Literature

  • 8/12/2019 Science Bioinformatics Data Resources

    29/40

    Alternative Tools for Mining the Biomedical Literature

    Main gene query

    Protein/gene associated

    Synonym

    Medical terminology (MeSH)

  • 8/12/2019 Science Bioinformatics Data Resources

    30/40

    Alternative Tools for Mining the Biomedical Literature

    Linked to Entrez Geneand OMIM database

  • 8/12/2019 Science Bioinformatics Data Resources

    31/40

    Workshop- Novel Online Tools for Mining the BiomedicalLiterature

  • 8/12/2019 Science Bioinformatics Data Resources

    32/40

    Case 1 Few Results in the Biomedical Literature

    Searching for novel genes

  • 8/12/2019 Science Bioinformatics Data Resources

    33/40

    Case 2 Few Results in the Biomedical Literature

    Searching for side effects of drugs: Cerebyx respiratory failure

  • 8/12/2019 Science Bioinformatics Data Resources

    34/40

    Phenotypic information can be usedto infer molecular interactions andhinting at new uses of marketeddrugs (Campillos, 2008)

    Case 2 Few Results in the Biomedical Literature

  • 8/12/2019 Science Bioinformatics Data Resources

    35/40

    Data Annotation/ Integration / Visualization Tools GenomeBrowsers

  • 8/12/2019 Science Bioinformatics Data Resources

    36/40

    Workshop- Novel Online Tools for Mining the BiomedicalLiterature

  • 8/12/2019 Science Bioinformatics Data Resources

    37/40

    Contextualizing Data/Results in the Biomedical Knowledge

    Resulting list of upregulated genes aftertreatment of prostatecancer cells with VitD

    Microarray dataobtained from GeneExpression Omnibusrepository wasanalyzed withGEO2R statisticalsoftware

  • 8/12/2019 Science Bioinformatics Data Resources

    38/40

    Contextualizing Data/Results in the Biomedical Knowledge

  • 8/12/2019 Science Bioinformatics Data Resources

    39/40

    References

    Campillos M*, Kuhn M*, Gavin AC, Jensen LJ, Bork P. Drug target identification using side-

    effect similarity. Science. 2008 Jul 11;321(5886):263-6.http://www.ncbi.nlm.nih.gov/pubmed/18621671

    Islamaj Dogan R, Murray GC, Nvol A, Lu Z. (2009) Understanding PubMed user search behavior. Database (Oxford) http://www.ncbi.nlm.nih.gov/pubmed/20157491

    Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The humangenome browser at UCSC. Genome Res. 2002 Jun;12(6):996-1006.http://www.ncbi.nlm.nih.gov/pubmed/12045153

    Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P. A side effect resource to capturephenotypic effects of drugs. Mol Syst Biol. 2010;6:343. Epub 2010 Jan 19.http://sideeffects.embl.de/drugs/56338/

    Rindflesch, T.C. et al. (2011) Semantic MEDLINE: An advanced information management

    application for biomedicine. Information Services & Use, 31, 15-21.http://lhncbc.nlm.nih.gov/system/files/pub-lhncbc-2011-109.pdf

    http://www.ncbi.nlm.nih.gov/pubmed/18621671http://www.ncbi.nlm.nih.gov/pubmed/20157491http://www.ncbi.nlm.nih.gov/pubmed/12045153http://sideeffects.embl.de/drugs/56338/http://lhncbc.nlm.nih.gov/system/files/pub-lhncbc-2011-109.pdfhttp://lhncbc.nlm.nih.gov/system/files/pub-lhncbc-2011-109.pdfhttp://lhncbc.nlm.nih.gov/system/files/pub-lhncbc-2011-109.pdfhttp://lhncbc.nlm.nih.gov/system/files/pub-lhncbc-2011-109.pdfhttp://lhncbc.nlm.nih.gov/system/files/pub-lhncbc-2011-109.pdfhttp://lhncbc.nlm.nih.gov/system/files/pub-lhncbc-2011-109.pdfhttp://lhncbc.nlm.nih.gov/system/files/pub-lhncbc-2011-109.pdfhttp://lhncbc.nlm.nih.gov/system/files/pub-lhncbc-2011-109.pdfhttp://lhncbc.nlm.nih.gov/system/files/pub-lhncbc-2011-109.pdfhttp://sideeffects.embl.de/drugs/56338/http://www.ncbi.nlm.nih.gov/pubmed/12045153http://www.ncbi.nlm.nih.gov/pubmed/20157491http://www.ncbi.nlm.nih.gov/pubmed/20157491http://www.ncbi.nlm.nih.gov/pubmed/18621671
  • 8/12/2019 Science Bioinformatics Data Resources

    40/40