Susanna-Assunta Sansone (Toxicogenomics project coordinator) Microarray Informatics Team EMBL- EBI...

36
Susanna-Assunta Sansone (Toxicogenomics project coordinator) Microarray Informatics Team EMBL- EBI (European Bioinformatics Institute) Transcriptome Symposium, April 2002 CHU Pitié-Salpêtrière, Université Paris VI MIAME and ArrayExpress – a standard for microarray gene expression data and the public database at EBI

Transcript of Susanna-Assunta Sansone (Toxicogenomics project coordinator) Microarray Informatics Team EMBL- EBI...

Susanna-Assunta Sansone

(Toxicogenomics project coordinator)

Microarray Informatics Team

EMBL- EBI (European Bioinformatics Institute)

Transcriptome Symposium, April 2002

CHU Pitié-Salpêtrière, Université Paris VI

MIAME and ArrayExpress – a standard for microarray

gene expression data and the public database at EBI 

EMBL- EBI centre for research and services in bioinformatics that makes and maintains public db:

• EMBL Nucleotide Sequence, SWISS-PROT, Ensembl, MSD, etc.

Practical reasons:• Easy data access• Resolves local storage issues• Common data exchange formats can be developed

Scientific reasons:• Curation can be applied• Annotation can be controlled • Additional info can be stored that is missing in publications• Improve data comparison !

Public standard can be applied

Why have a public database?

MIAME standard

MIAME annotation challenge:• MGED BioMaterial Ontology

Uses of MIAME concepts:• ArrayExpress:

a public repository for gene expression data

• MIAMExpress

submission and annotation tool

Talk structure

Talk structure

MIAME standard

Standard for microarray data - Why?

Size of dataset Different platforms - nylon, glass

Different technologies - oligos, spotted

References to external db not stable! Array annotation Sample annotation Data sharing needs standardized way to

annotate and record the information!

Microarray Gene Expression Data Group: EBI + world’s largest microarray labs and companies

(Sanger, Stanford, TIGR, Universite D'Aix-Marseille II,

Affymetrics, Agilent, NCBI, DDBJ, etc.)

MGED Group aims to• Facilitate adoption of standards for:

– Experiment annotation– Data representation

• Introduce standard for:– Experimental controls– Data normalization methods

Standard for microarray data - MGED Group

Minimum information about a microarray experiment

NOT a formal specification BUT a set of guidelines

Sufficient information must be recorded to:• Correctly interpret and verify the results• Replicate the experiments

Structured information must be recorded to:• Query and correctly retrieve the data• Analyse the data

MIAME- Brazma et al., Nature Genetics, 2001

General MIAME principles

ArraySample

• Sample source• Sample treatments• Extraction protocol• Labeling protocol

• Array design information• Location of each element• Description of each element

Hybridization protocol

• Quantification matrix• Analysis protocol• Software specifications

• Image• Scanning protocol• Software specifications

Hybridisation

MIAME 6 parts of a microarray experiment

MIAME

MIAME

• Strategy• Algorithm• Control array elements

Final data

Normalisation

• 3 data processing levels• Lack of gene expression measurement units !

ArraySample HybridisationArraySample Hybridisation

ArraySample HybridisationArraySample Hybridisation

Experiment

MIAME 6 parts of a microarray experiment

Annotation implementations are required !• Avoid/reduce free text descriptions• Use of controlled terms• Definitions and sources for each term• Remove of synonyms, or use of synonym mappings• Data curation at source (LIMS)• Integration of controlled terms in query interfaces

Facilitate data queries-analysis…….

MIAME – Annotation challenge

Samples

Gene expression matrix

A gene expression database from the data analyst’s point of view

Gene expression levels?Gen

es

and

tran

scrip

tion

units

Samples

Gen

es

and

tran

scrip

tion

units

Gene expression matrix

A gene expression database from the data analyst’s point of view

• Array description:- Gene annotations

• Sample annotations:- Source- Treatment

Gene expression levels

MIAME - Gene annotation

Unambiguous identification

Synonyms !• Community approved names• Alternative to gene names

Usable external sources e.g.:• EMBL-GenBank - sequence accession n.• Jackson Lab - approved mouse gene names• HUGO - approved human gene names• GO categories - function, process, location

MIAME - Sample annotation Gene expression data only have a meaning in the context of detailed sample descriptions !

Usable external sources e.g.:• NCBI Taxonomy - organisms• Jackson Lab - mouse strains names

• Mouse Anatomical Dictionary – mouse anatomy

• ChemID – compounds• ICD-9 – diseases classification

More is needed…..

Annotation – implementations required!

Need an ontology to describe the sample:• Defining controlled vocabularies and……• ….Using existing external ontologies

Integrate the ontology in LIMS and databases:• Develop browser or interface for the ontology• Develop internal editing tools for the ontology

However some free text description is

unavoidable

Talk structure MIAME standard

MIAME annotation challenge:• MGED BioMaterial Ontology

What CV and ontology are?

Controlled Vocabulary (CV):• Set of restrictive terms used to describe

something, in the simplest case it could be a list

Ontology is more then a CV:• Describes the relationship between the terms in

a structured way, provides semantics and constraints

• Capture knowledge and make it machine processable

Sample annotation – MGED BioMaterial Ontology

Under construction by Chris Stoeckert (Univ. of Penn.) and MGED members

Use OILed (rdf, daml and html files available) Motivated by MIAME and guided by ‘case scenarios’ Defines terms, provides constraints, develops CVs for

sample annotation Links also to external CVs and ontologies Will be extended to other part of a microarray

experiment that need to be described

Sample annotation – MGED BioMaterial Ontology

an example

Sample source and treatment description, and its correct annotation using the MGED BioMaterial Ontology

classes and correspondent external references:

“Seven week old C57BL/6N mice

were treated with fenofibrate.

Liver was dissected out, RNA prepared………”

©-BioMaterialDescription

©-Biosource Property

©-Organism

©-Age

©-DevelopmentStage

©-Sex

©-StrainOrLine

©-BiosourceProvider

©-OrganismPart

©-BioMaterialManipulation

©-EnvironmentalHistory

©-CultureCondition

©-Temperature

©-Humidity

©-Light

©-PathogenTests

©-Water

©-Nutrients

©-Treatment

©-CompoundBasedTreatment

(Compound)

(Treatment_application)

(Measurement)

MGED BioMaterial Ontology Instances

7 weeks after birth

Female

Charles River, Japan

22 2C

55 5%

12 hours light/dark cycle

Specified pathogen free conditions

ad libitum

MF, Oriental Yeast, Tokyo, Japan

in vivo, oral gavage

100mg/kg body weight

External References

NCBI TaxonomyNCBI Taxonomy

Mouse Anatomical DictionaryMouse Anatomical Dictionary

International Committee on Standardized Genetic Nomenclature for Mice

International Committee on Standardized Genetic Nomenclature for Mice

Mouse Anatomical DictionaryMouse Anatomical Dictionary

ChemIDplusChemIDplus

Mus musculus musculus id: 39442

Stage 28

C57BL/6

Liver

Fenofibrate, CAS 49562-28-9

MIAME standard

Sample annotation:• MGED BioMaterial Ontology

Talk structure

Uses of MIAME concepts:• ArrayExpress a

public repository for gene expression data

• MIAMEpresssubmission and annotation tool

Specifies the content of the information:• Sufficient• Structured

Uses of MIAME concepts

Uses:• Creation of MIAME-compliant LIMS or databases e.g: ArrayExpress

• Development of submission/annotation tool for generating MIAME-compliant information e.g.: MIAMExpress

Users

EBIWeb server

Browse-Query

Central database

Data warehouse

ArrayExpress

Curationdatabase

Image server

Update

MAGE-ML

OutputLoader

MIAMExpress

Submission

LIMS

Submission

MIAMExpress

ArrayExpress – data flow

Central database

Data warehouse

ArrayExpress

Implementation in ORACLE of the MAGE-OM model:• Microarray gene expression - Object Model• OMG approved standard (MGED and Rosetta, 2001)• Model developed in UML

Object model-based query mechanism:• Automatic mapping to SQL

Independent of:• Experimental platform• Image analysis method• Normalization method

MAGE-ML data loader:• Microarray gene expression - Mark-up Language generated from model

ArrayExpress - details

Final data

Normalisation

ArraySample HybridisationArraySample Hybridisation

ArraySample HybridisationArraySample Hybridisation

Experiment

MIAME 6 parts of a microarray experiment

ArrayExpress – conceptual model

ArrayExpress – simplified model

• Classes are represented by boxes

• Classes describe objects

• Related classes are grouped together in packages

• MAGE-OM has 16 packages, ~ 150 tables

• Human data - EMBL (ironchip)

• Yeast data - EMBL• S. pombe - Sanger Institute • Available as example

annotated and curated data sets

• Array descriptions - TIGR• Array description - Affymetrix• Mouse data - TIGR and HGMP• Anopheles data - EMBL• Direct pipeline - Sanger Institute

LIMS• Data - DESPRAD partners• Toxicogenomics data- ILSI HESI

Near future:Currently:

ArrayExpress -data (via MAGE-ML)

ArrayExpress – query interface

First release 12 Januray 2002

SEQLOGO

EPCLUST Expression data GENOMES

sequence, function, annotation

SPEXSdiscover patterns

URLMAPprovide links

External data, toolspathways, function,

etc.

PATMATCHvisualise patterns

EP:GOGeneOntology

EP:PPIProt-Prot ia.

ArrayExpress – link to Expression Profiler

Expression data

User support and help documentation:• Ontologies and CV’s• Minimize free text, removal of synonyms• Help on MAGE-ML format and MAGE-OM

MIAME compliance-check

Curation at source (LIMS)

To provide high-quality, well-annotated data

and allow automated data analysis

ArrayExpress – curation effort

MIAMExpress

Submission and annotation tool:• Curators will monitor the submissions

Based on MIAME concepts:• Experiment, Array and Protocol submissions • Generates MIAME-compliant information

Uses MGED BioMaterial Ontology terms:• Terms and required fields are explained

Allows user driven ontology development:• User can provide new terms and their sources

Allows browsing:• Array descriptions• Protocols

MIAMExpress - details

MIAMExpress

Version 1 launch in December 2002 Expected users:

• Limited local bioinformatic support• No LIMS on site• Small scale users with custom made arrays

Can be installed as local version:• As a lab-book to annotate your experiment • As part of a LIMS

Interfaces:• Version 1 is general• Future versions, application specific interfaces

- Species specific- Toxicogenomics specific (ILSI- HESI)

MIAMExpress - details

Load public data into ArrayExpress:• TIGR, EMBL, ILSI HESI, DESPRAD partners

Improve query interfaces

Launch MIAMExpress v.1 (Dec.2002)

MIAMExpress v.2:• Extended according to the user needs• Integrated MGED ontology• Increased usability, flexibility and scalability

Develop curation tools

ArrayExpress - future

Acknowledgments Microarray Informatics Team at EBI (19 members):

• Alvis Brazma (Team Leader and MGED President)

• Helen Parkinson (Curation Coordinator)

• Mohammad Shojatalab (MIAMExpress Database Programmer)

• Ugis Sarkans (ArrayExpress Database development coordinator)

• Jaak Vilo (Expression Profiler)

• Curators and Programmers.

MGED members and working groups:• Alvis Brazma (MGED President, MIAME)

• Chris Stoeckert, U. Penn. (MGED Ontology Working Group)

Open sources resources:• ArrayExpress and MIAMExpress schema-access to code• MIAME document and glossary• MAGE-ML dtd and annotation examples• MGED Ontology and other resources………

www.mged.org / www.ebi.ac.uk/[email protected]

Be aware of MIAME !• Nature, Lancet and have already expressed their interest• Founding agencies

Join MGED meetings, tutorials and mailing lists:• MGED-5 meeting in Japan (Sept. 2002)• Ontology for BioSample description, EBI (Nov. 2002)

Resources and ….messages