Bioschemas presentation at ECCB 2016, The Hague

Post on 21-Jan-2017

237 views 0 download

Transcript of Bioschemas presentation at ECCB 2016, The Hague

Bioschemas.org

Structured data for Life Sciences using

Schema.org Niall Beard

Scientific Web Technologist, University of Manchester

ELIXIR: European infrastructure for biological informationData infrastructure for Europe’s life-science research:

www.elixir-europe.org

@ELIXIREurope

Data

Interoperability

Tools

Compute

Training

Marine metagenomics

Human data

Crop and forest plants

Rare diseases

• 20 Members • 1 Observer

ELIXIR Hub based alongside EMBL-EBI in Hinxton

• 20 Members• 1 Observer

FAIRFindable

Accessible

Interoperable

Reusable

Finding resources – Search engine index

Resource Resource Resource

Finding resources – Catalogues

bio.tools

tess.elixir-uk.org

Discover resources by filtering metadata

Finding resources – Content Integration platforms

Training Resource

Training Resource

Training Resource

Tool Resource

Tool Resource

Tool Resource

bio.tools

tess.elixir-uk.org

Programmatically aggregated

Bio.tools XSD

https://github.com/bio-tools/biotoolsxsd

Metadata modelie. Recipe type

<div itemscope itemtype="http://schema.org/Recipe">

<div itemprop="nutrition” itemscopeitemtype="http://schema.org/NutritionInformation">

Nutrition facts: <span itemprop="calories">144 kcal</span>, </div>

Ingredients: - <span itemprop="recipeIngredient">800g small new potato</span> - <span itemprop="recipeIngredient">3 shallot</span> . . .

<script type="application/ld+json">{ "@context": "http://schema.org", "@type": ”Recipe", "name": ”Potato Salad", “NutritionInformation”: {

"calories”: “144 kcal”, "recipeIngredient”: “800g small new potato”, "recipeIngredient”: “3 shallot”. . .

Search engine readable = optimized

Content Content Content

Schema.org Schema.org Schema.org

Search engines favour websites containing schema.org in their search results

Content integration aggregationTraining Resource

Training Resource

Training Resource

Schema.org Schema.org Schema.org

tess.elixir-uk.org

Minimum informationControlled vocabularies

Cardinality

Data model

New properties

BioSchemas.orgminimal, maximal, extensible

Trainingmaterials

Events Organizations

Data

Standards

Software

Minimum information

for one content type

Trainingmaterials

Events Organizations

DataSoftware

Standards

Common properties

among content types

More depth to a broad-reach technology

DepthDATS

Reach

Use case 1: TeSS, ELIXIR Training Portal - Aggregates Life Science Training Materials

Large Training Sites• Well-formed APIs• XML Dumps • RSS feeds

Medium/Small Sites• No structured data

The long tail, collections sets and small science

Slide courtesy of Todd Vision, Dryad

http://www.france-bioinformatique.fr/en/training_material

https://search.google.com/structured-data/testing-tool

Applied Drupal 7 schema.org extensionTook about 2 hours

Included in TeSS in an hour

Biosamples entry(Diabetic mouse strain)

Diabetes termEFO_0000400 Experimental

Factor Ontology

Defined byisAbout

Courtesy of Tony Burdett and Simon Jupp

Use case 2: Mapping data to ontologies

Organization- name

MedicalEntity- name- description

MedicalCode- codeValue- codingSystem

MedicalCode- name- url- alternateName- description- codeValue- codingSystem…

CreativeWork- about- name- description- url- datePublished…

Data Term Ontology

Courtesy of Tony Burdett and Simon Jupp

Use case 2: Mapping data to ontologies

Use case 3.1: Dataset Markup, Citation

• Dataset Citation• Mapping to JATS Journal Article

Tag Suite Data extension*• Metadata for data citationGoogle, Bing, Yahoo, Yandex

Trainingmaterials

Events Organizations

DataSoftware

Standards

*Daniel Mietchen et al , Adapting JATS to support data citation, Journal Article Tag Suite Conference (JATS-Con) Proceedings 2015, Bethesda (MD): National Center for Biotechnology Information 2015.

Use case 3.2: Dataset Markup, Samples

• Biobank Samples• Limited number of simple key

properties• Disease, gender, age and

sample type, data available• Cross-walk MIABIS: Minimum

Information About BIobank data Sharing

Google, Bing, Yahoo, Yandex

Trainingmaterials

Events Organizations

DataSoftware

Standards

Cataloging 400 UK Biobanks

Value for content providers

• More exposition through search engines and portals• Favoured in search results

• Low barrier for adoption• Embedding schema.org in pages can be done with off-the-

shelf CMS • Tools for most frameworks and web scripting languages

• Longevity of Standard • Standard is open to the wider community and will survive

past funding• Less chance of the schema deprecating after

implementation

Value for content integration platforms

• Good benefits to persuade providers to structure their data

• Lots of tooling available for parsing structured data• Many open RDFa, JSON-LD, and microdata parses

available on GitHub• Wider community engaged in construction

• Schema.org is a public forum so not limited to just the people you know

• Much more scalable than scraping • Bespoke scripts that gain technical debt when scraping

Development Process

Acknowledgements

Acknowledgments

• TeSSNiall Beard

• BioSharingSA Sansone, A Gonzalez-Beltran, P McQuilton, P Rocca-Serra

• NIH BD2K bioCADDIESA Sansone, A Gonzalez-Beltran, Jeff Grethe

• CommunityPremysl Velek

• EventMartin Cook

• Training materialsAleksandra Nenadic & Gabriella Rustici

Organization representatives

Group chairs

BioSchemas community

• ELIXIRPremysl Velek

• Pistoia AllianceRichard Holland

• GOBLETTerri Attwood

• BBMRIMichaela Mayrhofer

• OrganizationRichard Holland & Rafael C Jimenez

• PersonNiall Beard

• StandardA Gonzalez-Beltran & P McQuilton

Contributors• Aleksandra Nenadic• Adam Hospital • Gabriella Rustici• Carlos Horro• Martin Cook• Niall Beard• Rafael C Jimenez• Andy Jenkinson• Manuel Corpas• Roberto Preste• Richard Holland• Alejandra Gonzalez-Beltran• Andrew Lonie• Carole Coble• Peter McQuilton• Premysil Velek• Ian Dunlop• Jef Grethe• Milo Thurston• Niklas Blomberg

• Isabelle Perseil• Jaap Heringa• Jon Ison• John Hancock• Simon Jupp• John (Jack) D. Van Horn • Ivana Krenkova• Laura Furlong• Morris Swertz• Mateusz Kuzak• Mario Alberich• Mark Thompson• Maria Martin• Mikael Borg• Montserrat González• Norman Morrison• Núria Queralt-Rosinach• Olivier Sallou• Robert Pergl• Pedro Fernandes

• Yasset Perez-Riverol• Sarala Wimalaratne• Nick Juty• Jose Luis Ambite• Brane Leskošek• Celia van Gelder• Christa Janko• Christine Staiger• Dan Brickley• Daniel Faria• Dmitry Repchevsky• Daniel Sobral• Daniel Vaughan• Ian Fore• Frederik Coppens• Josep Ll. Gelpi• ChuQiao Gong• Hedi Peterson• Hervé Ménager• Nina Hrtonova

• Pierre Larmande• Rob Finn• Renzo Kottmann• Rodrigo Lopez• Sameer Velankar• Sara Light• Carol Shreffler • Silvano Squizzato• Susanna Sansone• Tony Burdett• Terri Attwood• Cath Brooksbank• Hedi Peterson• Luc Deltombe• Michaela Mayrhofer• Philippe Rocca-Serra

Upcoming Bioschemas Activities

• Biosoftware description using bio.tools and schema.org - NETTAB, 24th October

• Bioschemas AGM on 8th-9th November in Rothamsted UK• See: https://goo.gl/hu7uYK

• Implementation study proposal being drafted• Develop more content types for life sciences:

• Data repository• Dataset• Sample• Phenotype• Protein annotations

http://bioschemas.org

@BioSchemas

Thank you!Mailing List: all@bioschemas.org