Ontology application and use at the encode dcc

21
Ontology application and use at the ENCODE DCC Venkat Malladi Data Wrangler, ENCODE DCC Department of Genetics Stanford University School of Medicine Venkat Malladi ENCODE DCC

Transcript of Ontology application and use at the encode dcc

Page 1: Ontology application and use at the encode dcc

Ontology application and use at the ENCODE DCC

Venkat MalladiData Wrangler, ENCODE DCC Department of Genetics Stanford University School of Medicine

Venkat Malladi ENCODE DCC

Page 2: Ontology application and use at the encode dcc

Overview

Venkat Malladi ENCODE DCC

MetadataModel

Ontologies Search Futuredirections

Intro to ENCODE and the DCC

Page 3: Ontology application and use at the encode dcc

What is ENCODE?

Venkat Malladi ENCODE DCC Modified from PLoS Biol 9-e1001046,2011

(M. Pazin)

Approximately ~30 different assays

Page 4: Ontology application and use at the encode dcc

Role of the Data Coordination Center

Venkat Malladi ENCODE DCC

Production labsAnalysis groups

Genome Browser

ENCODE portal(DCC)

Data files

Metadata DCCDCC Integrative websites

Scientificcommunity

Role: Data generation Data organization Data access

Tasks: Perform assays Data processing & validation Web-based searchesPerform analyses Data file storage Data

downloadsValidate data Metadata curation

Submit data filesSubmit metadata

Page 5: Ontology application and use at the encode dcc

Challenge: Find common biosamples from data generated by two consortia

Venkat Malladi ENCODE DCC

356 termshttp://encodeproject.org/ENCODE/cellTypes.html

Projects are internally consistent…..

314 termsGEO characteristics: common_name, tissue_type, cell_type, lines

Page 6: Ontology application and use at the encode dcc

Simple text match

Venkat Malladi ENCODE DCC

360 termsCell type

… but only 3 biosample names match exactly between projects

314 termsGEO

IMR90PBMCTh17

Page 7: Ontology application and use at the encode dcc

Metadata annotation using Ontologies

Page 8: Ontology application and use at the encode dcc

An ontology is a set of words and relationships … … All relationships must be true.

Venkat Malladi ENCODE DCC

nucleuschromosome

mitochondrial chromosome

mitochondrion

cellParent term

Child term

part_of

part_of

part_of

part_of is_a

part_ofX

Page 9: Ontology application and use at the encode dcc

An ontology is a set of words and relationships.Need true relationships because inferences can be based

upon them.

Venkat Malladi ENCODE DCC

nucleuschromosome

mitochondrial chromosome

mitochondrion

cellParent term

Child term

part_of

part_of

part_of

part_of is_a

part_ofX

part_of

X part_of

http://www.geneontology.org/GO.ontology.relations.shtml

True

False

Page 10: Ontology application and use at the encode dcc

Why use ontologies?

Venkat Malladi ENCODE DCC

Reason 1: Consistent way of describing biological concepts

Reason 2: Consistency of language facilitates identification of related data easily.

Reason 3: Consistency in data analysis because relationships between terms provide flexibility of grouping while everyone uses the same set of metadata

Page 11: Ontology application and use at the encode dcc

What metadata is annotated with ontologies?

Venkat Malladi ENCODE DCC

1. the biological sample serving as input (Biosample)

2. the reagents and conditions applied to the biological input (Treatment)

3. the set of methods and conditions to survey the biological input (Assay)

Page 12: Ontology application and use at the encode dcc

Venkat Malladi ENCODE DCC

Page 13: Ontology application and use at the encode dcc

Biosample ontologies

Venkat Malladi ENCODE DCC

1. Uber anatomy ontology (Uberon) - structure, location and heterogenous mixture of cells

2. Cell Ontology (CL) - primary cells or stem cells

3. Experimental Factor Ontology (EFO) - no direct corresponding anatomical structure or physiological cell type

Page 14: Ontology application and use at the encode dcc

Venkat Malladi ENCODE DCC

Page 15: Ontology application and use at the encode dcc

Challenge: Find all heart-related tissues?

Venkat Malladi ENCODE DCC

Heart_OCHCFHCFaaHCMOthers?

Fetal HeartHeartRight AtriumRight VentricleOthers?

Page 16: Ontology application and use at the encode dcc

Searching ENCODE metadata

Venkat Malladi ENCODE DCC

Page 17: Ontology application and use at the encode dcc

Ontology driven search

Venkat Malladi ENCODE DCC

Page 18: Ontology application and use at the encode dcc

Future directions

Venkat Malladi ENCODE DCC

• Additional ontologies

• Ontology- based data validations

Page 19: Ontology application and use at the encode dcc

Additional ontologies

Venkat Malladi ENCODE DCC

• Protein Ontology (PRO,http://pir.georgetown.edu/pro/pro.shtml)o transforming growth factor beta-1 (human)— PR:P01137

• EDAM Ontology (EDAM, http://edamontology.org)o FASTQ—format:1930, BAM—format:2572o sequence alignment—data:0863

Page 20: Ontology application and use at the encode dcc

Ontology based validations

Venkat Malladi ENCODE DCC

Page 21: Ontology application and use at the encode dcc

Acknowledgments

Venkat Malladi ENCODE DCC

Nikhil Podduturi, Laurence Rowe, Forrest Tanaka

Esther Chan, Jean Davidson, Venkat Malladi, Cricket Sloan, J. Seth Strattan

Eurie Hong, Mike Cherry (PI), Jim Kent (co-PI), Ben Hitz

Brian Lee, Stuart Miyasato, Matt Simison, Zhenhua Wang, Marcus Ho

Data Wranglers

Software Engineers

QA, administration, biocuration

National Institute of General Medical Sciences of the United States AQ1215 National Institutes of Health (GM10331601); U41 grant from National Human Genome Research Institute at the U.S. National Institutes of Health (HG006992)