Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue...

28
Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology [email protected]

Transcript of Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue...

Page 1: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Development and Use of Controlled Vocabularies at the

Arabidopsis Information Resource (TAIR)

Sue RheeCarnegie InstitutionDept. Plant Biology

[email protected]

Page 2: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

TAIRhttp://arabidopsis.org

• A model organism database for Arabidopsis thaliana

• Current major data types:• community (~11,000 people, ~4,000 labs)• literature (~12,000 articles, ~450 reviews)• genes and proteins (~29,000 genes, ~28000 proteins)• alleles and polymorphisms (~150,000)• germplasms (~150,000, ~1000 mutant, ~800 ecotypes )• ‘expert’ gene families (~450 containing ~4000 genes)• microarray data (~130 experiments, ~600 hybridizations)• metabolic pathways (~170 pathways, ~1000 reactions)

Page 3: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Controlled VocabulariesExisting:• GO function, process, component• Arabidopsis anatomy, developmental stagesUnder development:• experimental methods• environmental factors • PO anatomy, developmental stagesPlanned:• PO traitNeeded:• chemical• values? (qualitative and quantitative)

Page 4: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Developing controlled vocabs• Anatomy

• Developmental Stages

• Methodology

Using controlled vocabs• Gene and gene product functional annotation

• Community

• Microarray experiments and array elements

• Alleles

• Germplasm

Page 5: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Purpose for Anatomy and Developmental Stages Ontologies

To describe things like:• where is a gene expressed in the plant

• at what stage of development was the plant when the RNA sample taken

• from what tissues was the protein sample derived

• what part(s) of the plant are affected in a mutant line

Page 6: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Anatomy and Developmental Stages

• anatomical parts (295)• developmental stages (69)

•Sources• Katherine Esau (1960) The Anatomy of Seed Plants• John Bowman (1994) Arabidopsis An Atlas of Morphology and Development• Meyerowitz & Somerville ed. (1994) Arabidopsis An Atlas of Morphology and Development• Numerous primary articles and websites on development/anatomy• Stanley Letovsky, Cereon Genomics• Doug Boyes, Paradigm Genetics

Leonore Reiser (TAIR), Jonathan Clarke (JIC)

Page 7: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Rules for Anatomy and Developmental Stages Ontology Development

• Terms from literature and text books that describe anatomy and development (364 terms; 221 defs).

• For anatomy- the terms must describe parts that are found in Arabidopsis (limited scope).

• Developmental stages should be based on morphological features- regardless of a time component as different accessions reach the same stage at different times. An example is the floral developmental stages defined by John Bowman.

Page 8: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

• Created separate anatomy and developmental stages as directed acyclic graphs.

• Tried to make the graphs orthogonal in order to generate cross products easily.

• Creating cross products between stages and anatomy (what parts exist at what stage?)

• Creating cross products with developmental process terms.

• Used DAGEditor (BDGP) for ontologies and eventually making cross products.

Methods for Anatomy and Developmental Stages Ontology Development

Page 9: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Ontologies for Anatomy and Development

Page 10: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Crossing Anatomy and Developmental Stsages

etc.etc.etc…

Page 11: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Current Status• 221/364 terms are defined.• Terms (definitions and relationships) are

checked for accuracy (external review and literature).

• Being used to annotate genes and products• Files available on GOBO and TAIR ftp

sites• In collaboration with MaizeDB and

Gramene on sharing the ontologies to build a common plant ontology (probably flowering plant…)

Page 12: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Developing Methods Ontology(An ontology of experimental techniques)

• Sources:• short, semi-controlled description of experimental information during annotation (102)

• protocols from the research community (152)

• microarray experiments (129)

• Current status:• DAG structure

• 195 terms and 3 definitions

• more structure revision needed

Leonore Reiser, Margarita Garcia-Hernandez, Gabe Lander

Page 13: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Developing controlled vocabs• Anatomy

• Developmental Stages

• Methodology

Using controlled vocabs• Gene and gene product functional annotation

• Community

• Microarray experiments and array elements

• Alleles

• Germplasm

Page 14: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Currently Annotated Data Types

• Genes and gene products (2931)– molecular function (2599 genes to 296 terms)

– biological process (536 genes to 269 terms)

– subcellular location (695 genes to 104 terms)

– Anatomy & devel. Stages (117 genes to 50 terms)

– spatial and temporal expression pattern (110 genes to 52 terms)

• Community (2415 comm. to 2892 terms)– research interest (2737 terms)

– organism of interest (192 terms)

Page 15: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Basic Process of Literature Curation

Subject term Object term

Paper

Binds toInvolved inFunctions asExpressed inIs subunit ofRelated to

Required forLocated in

Interacts withRegulatesMore…

data object (e.g. gene)controlled vocabulary

data object

automatic automatic

manual

Currently 20 types

Page 16: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Pubsearch Statistics (10/02)

Data types unique matched (unique papers)

Genes 3182 (7485), 8.9%GO process terms 653 (8439), 9.5% GO function terms 830 (6537), 15.2% GO component terms 266 (5686), 23.1% Anatomy/develop terms 213 (5583), 58.5%

| development | 2531 |

| growth | 1900 |

| transcription | 1114 |

| biosynthesis | 865 |

| flowering | 697 |

| transduction | 684 |

| transport | 659 |

| signal transduction | 625 |

| germination | 455 |

| metabolism | 425 |

| binding | 1494 |

| enzyme | 1184 |

| kinase | 637 |

| receptor | 433 |

| beta-glucuronidase | 413 |

| protein kinase | 309 |

| hormone | 302 |

| DNA binding | 299 |

| transcription factor | 269 |

| transporter | 230 |

| cell | 2487 |

| membrane | 1031 |

| chromosome | 842 |

| chloroplast | 604 |

| plasma membrane | 408 |

| cell wall | 291 |

| plastid | 270 |

| nucleus | 258 |

| intracellular | 236 |

| host | 230 |

TOP TEN LISTS

Page 17: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

PubSearch Annotation User Interface

Page 18: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.
Page 19: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Expanding Data Objects to Annotate

• microarray experiments

• RNA samples

• array elements (ESTs, oligos, PCR products)

• alleles (natural variant & mutant forms)

• germplasm (ecotypes & mutant lines)

Page 20: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Microarray Data Annotation• Experiments

• goals (e.g. GO process)• variables (e.g. anatomy, environment, chemical)

• need a qualifier (e.g. values ontology?)• type/category (e.g. methods)

• RNA Samples• germplasm • biomaterial (e.g. anatomy, devel. stages)• external conditions (e.g. methods, envirnoment, chemical)

• Array elements• affected by/in XXX (e.g. GO process, anat, dev)

• induced during/in XXX• reduced during/in XXX

Page 21: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Phenotype Annotation

Trait

Conditionenvironmentchemical

Methodology

TraitValueGermplasm

hy4-1 mutant line long height

measure with ruler

light

ConditionValue

absence

AnatomyBiol. Process

etc…hypocotyl

Page 22: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Expanding Types of Annotations

By using more relationship types rather than more terms in an ontology.

For example:• Gene to gene family

– Relationship type: is a member of

• Molecular interactions– Relationship types: represses, activates, binds to

• Genetic interactions– Relationship types: suppresses, enhances

Page 23: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

A model of control of flowering in Arabidopsis

From “Molecular Genetics of Plant Development”

Page 24: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Generating the image from the database

Gene1 Relationship Gene2

ELF3 Represses GA1

ELF3 Represses SPY

ELF3 Activates EMF1

Represses =

Activates =

ELF3 GA1

ELF3 GA1, SPY

ELF3 GA1, SPY

EMF1

Page 25: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Genetic Interaction / Transcriptional Regulation Pathways

Page 26: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Acknowledgements• Leonore Reiser• Tanya Berardini• Suparna Mundodi• Margarita Garcia-Hernandez• Eva Huala• Lukas Mueller• Peifen Zhang• Aisling Doyle, J. Yoon, Gabe Lander• Danny Yoo, Iris Xu• Jonathan Clarke (John Innes Institute)• GO, TIGR, Monsanto, MaizeDB, Gramene, SRI International

Page 27: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Where to get our stuff

• ontologies and annotations (ftp site)

•ftp://ftp.arabidopsis.org/home/tair/Ontologies/

• annotations (search & download )

• http://www.arabidopsis.org/info/ontologies/

• literature curation software-pubsearch (download)

•http://www.gmod.org/

Page 28: Development and Use of Controlled Vocabularies at the Arabidopsis Information Resource (TAIR) Sue Rhee Carnegie Institution Dept. Plant Biology rhee@acoma.stanford.edu.

Sources of Vocabularies

• Literature• primary research articles (~12000)• textbooks (~10)• protocols (~150)• web sites and databases (~50)

• Community• individual database submission (e.g. research interest)• collaboration (e.g. JIC, MaizeDB, Gramene)• bulk contribution (e.g. Monsanto)