BioMart

48
BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005

description

BioMart. Federated Database Architecture. Arek Kasprzyk EBI 9 June 2005. BioMart. A join project European Bioinformatics Institute (EBI) Cold Spring Harbor Laboratory (CSHL) Aim To develop a simple and scalable data management system capable of integrating distributed data sources. - PowerPoint PPT Presentation

Transcript of BioMart

Page 1: BioMart

BioMart

Federated Database Architecture

Arek KasprzykEBI9 June 2005

Page 2: BioMart

BioMart

• A join project – European Bioinformatics Institute (EBI) – Cold Spring Harbor Laboratory (CSHL)

• Aim– To develop a simple and scalable data management

system capable of integrating distributed data sources.

Page 3: BioMart

Challenges

• Data sources– Large– Distributed– Different data

Page 4: BioMart

Requirements

• User– All data accessible through a single set of interaces– Suitable for power biologists and bioinformaticians

• Deployer– ‘Out of the box’ installation– Built in query optimization– Easy data federation

• Architecture– Distributed– Domain agnostic– Platform independent

Page 5: BioMart

Query Engine

Federated architecture

Page 6: BioMart

BioMart

Data mart

User interfaces

Data sources

Page 7: BioMart

Data mart and dataset

Dataset

Page 8: BioMart

Data mart, dataset and schema

Schema

Page 9: BioMart

Dataset Configuration

XML

XML

XML

Page 10: BioMart

BioMart abstractions

• Dataset– A subset of data organized into 1 or more tables

• Attribute– A single data point – e. g. gene name

• Filter– An operation on an attribute – e. g. ‘Chromosome =1’

Page 11: BioMart

Datasets, Attributes and Filters

GENE

gene_id(PK)gene_stable_id gene_startgene_chrom_endchromosomegene_display_iddescription

Mart

Dataset

Attribute

Filter

Page 12: BioMart

Examples

Upstream sequences for all kinases up-regulated in brain and associated with a

QTL for a neurological disorder

Name, chromosome position, description of all genes located on chromosome 1, expressed in lung,

associated with human homologues and non-synonymous snp changes

Page 13: BioMart

FK

FK

FK

FK

PK

PK

Data model

Page 14: BioMart

FK

FK

FK

FK

PK

FK FK FKFK

PK PK

PK PK

Data model

Page 15: BioMart

FK

FK

FK

FK

PK

PK

FK FK

FK FK

Data model

Page 16: BioMart

main1

PK1

2

PK2PK1

FK2

dm

FK2

dm

FK1 FK2

dm

FK1 FK2

PK1FK1 FK1

FK2 FK2PK2 FK1

Data model - ‘reversed star’

Page 17: BioMart

DatasetFixed schema transformation

A

B

TA

TB

C

Page 18: BioMart

BioMart abstractions

• Link– ‘common currency’ between two datasets – e. g. accession

• Exportable – Potential links to export

• Importable– Potential links to import

Page 19: BioMart

Exportables, Importables and Links

Dataset 1

Dataset 2

Links

Page 20: BioMart

Exportables, Importables and Links

Dataset 1 Dataset 2

Exportable Importable

name = uniprot_id

attributes = uniprot_ac

name = uniprot_id

filters = uniprot_ac

Links

Page 21: BioMart

Exportables, Importables and Links

Dataset 1 Dataset 2

Exportable Importable

name=genomic_region

attributes=chr_name, chr_start, chr_end

name=genomic_region

filters=chr_name (=), chr_start (>=), chr_end (<=)

Links

Page 22: BioMart

Building BioMart databases

Source databases

Mart

Transformation

MartBuilder

Configuration

XML

MartEditor

Page 23: BioMart

MartEditor

Page 24: BioMart

Table naming conventionNaïve configuration

• Tables– Meta tables meta_content– Data tables dataset__content__type

• Data tables– Main __main – Dimension __dm

• Columns– Key _key

Page 25: BioMart

Retrieval

myDatabase

SNPVega

EnsemblUniProt

myMart

MSD

BioMart API

JAVA Perl

MartExplorer MartShell MartView

Schema transformation

MartBuilder

XML

MartEditor

Configuration

Databases

Public data (local or remote)

BioMart architecture

Page 26: BioMart

MartView

Page 27: BioMart

MartExplorer

Page 28: BioMart

MartShell

Using = dataset

Get = attribute

Where = filter

Page 29: BioMart

Mart Query Language (MQL)

● Mart Query Language (MQL) syntax:using <dataset> get <attributes> where <filters>

● Can join datasets together:using Dataset1 get Attribute1 where Filter1=var1 as q;

using Dataset2 get Attribute2 where Filter2=var2 and filter3 in q

● Can script and pipe:martshell.sh -E MQLscript.mql > results.txtmartshell.sh -E MQLscript.mql | wc

Page 30: BioMart

Third party software

• Bioconductor (biomaRt) – BioMart schema

• Taverna – BioMart java library

• DAS ProServer – BioMart perl library

Page 31: BioMart

biomaRt

Page 32: BioMart

Taverna

Page 33: BioMart

ProServer

• No programming• DAS request and responses defined by

Exportables and Importables and configured by MartEditor

• DAS1

Page 34: BioMart

BioMart deployers

• Large scale data federation (EBI)• Optimising access to a large database

(Ensembl, WormBase)• Connecting priopriatery datasets to

public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)

Page 35: BioMart

EBI

UniprotMSD

SANGEREnsemblSNPVegaSequenceWWW

Hinxton example

Page 36: BioMart

BioMart deployers

• Large scale data federation (Hinxton)• Optimising access to a large database

(Ensembl, WormBase, ArrayExpress)• Connecting priopriatery datasets to

public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)

Page 37: BioMart

WormBase

Page 38: BioMart

Ensembl

Page 39: BioMart

ArrayExpress

Page 40: BioMart

BioMart deployers

• Large scale data federation (Hinxton)• Optimising access to a large database

(Ensembl, WormBase)• Federating user data with public data

(Pasteur, INRA, Bayer,Unilever, Serono, Sanofi-Aventis, DevGen, Solexa etc …)

Page 41: BioMart

dbsnp HapMap Ensembl

Give me frequency data from dbsnp

Give me genoype and frequency data from HapMap

Give me SNPs location on gene/transcript

Give me frequency, genotype, location on gene/transcript from dbsnp, HapMap, Ensembl, RefSeq, AceView and Vegas

Java graphical user interface

WWW web browser

                GMIA_SNP_mart_database

RefSeq

SNP1 T/A AL13929 963253 1SNP2 C/T AL13929 963255 -1SNP3 C/G AL13929 963258 1. ……………………………….. ……………………………….

AceView Vega

Genetics of Infectious and Autoimmune Diseases, Pasteur Institute, INSERM U730, Paris, France.

Page 42: BioMart

… what next ?

Page 43: BioMart

BioMart model

• Already applied– Ensembl– Vega– SNP– Uniprot– MSD– ArrayExpress– WormBase– Variety of ‘in house’ projects

• In development– HapMap

Page 44: BioMart

Summary

• BioMart interface– Batch queries– ‘Data mining’– Large annotation

• BioMart software– Set up your own database– Make your database scalable and

responsive– Federate with other data

Page 45: BioMart

Where are we?

• 0.2 released in february• 0.3 to be released in june

– Platforms• Mysql• Oracle• Postgres

Page 46: BioMart

Acknowledgments

• BioMart– Damian Smedley (EBI)– Darin London (EBI)– Will Spooner (CSHL)

• Contributors– Arne Stabenau (Ensembl)– Andreas Kahari (Ensembl)– Craig Melsopp (Ensembl)– Katerina Tzouvara (Uniprot)– Paul Donlon (Unilever)

Page 47: BioMart
Page 48: BioMart