GOSt a Gene Ontology mining tool Jüri Reimand

32
GOSt a Gene Ontology mining tool Jüri Reimand

description

GOSt a Gene Ontology mining tool Jüri Reimand. Overview. Introduction, bioinformatics Gene Ontology (GO) GOSt, a Gene Ontology mining tool Statistics and thresholds Ordered gene lists Extending GO. cluster similar profiles. measures over time. Introduction. Bioinformatics - PowerPoint PPT Presentation

Transcript of GOSt a Gene Ontology mining tool Jüri Reimand

GOSta Gene Ontology mining tool

Jüri Reimand

Overview

• Introduction, bioinformatics• Gene Ontology (GO)• GOSt, a Gene Ontology mining tool• Statistics and thresholds• Ordered gene lists• Extending GO

Introduction

• Bioinformatics– Analysis of experimental data

• Genes encode proteins – Proteins : building blocks of living organisms– Gene expression : protein production from

genetic code• Microarray experiments measure gene

expression– Thousands of genes simultaneously– Expression levels over time– Different biological conditions– Comparison of healthy and diseased cells

measuresover time

cluster similarprofiles

Introduction

• Biological experiments give large amounts of data

• Groups of similar genes: – top “most active” genes– similar expression profiles over time

• Many genes have some available annotations– Previous knowledge from databases

• How to describe the group as a whole?– What are the common features?– Which features are significantly overrepresented?

“steroid metabolism”“biosynthesis”“iron ion binding”

Gene Ontology (GO)

• GO - Directed Acyclic Graph (DAG)– Vertices: terms– Edges: relations between general and specific terms

• Hierarchically structured vocabulary– 3 DAGs: processes, components, functions

• Annotations to vocabulary terms– Association between

a gene g and a property t (GO term t)– Based on biological discoveries– Genes of many genomes are annotated to GO

• Annotation sets : for a fixed organism– All genes associated with GO term t

GO example• Graph fragment

with some terms related to organ development

• Vocabulary is general to living organisms

• Gene annotations organism-specific

• True Path Rulehierarchical annotations

ENSG00000163217ENSG00000161202

GO example

ENSG00000163217ENSG00000161202

• Graph fragment with some terms related to organ development

• Vocabulary is general to living organisms

• Gene annotations organism-specific

• True Path Rulehierarchical annotations

GOSt – Gene Ontology Statistics

• GO annotations to groups of genes• Statistical significance of results• Thresholds for distinguishing significant results• Analysing ordered lists of genes• Visualisation methods, WWW interface• Command line toolset for large-scale analysis

GOSt example

45 mouse genes

338 GO

GenesGO

termsP-value

Evidencecodes

Annotations to gene groups

• Result: term t matches query Q

Gq GtGq GtGq GtQuery GO Terme.g.heartdevelopment

Statistical significance

• Is intersection Q∩T significant?• Fisher's one-tailed test

– Cumulative hypergeometric probability– Get observed or more genes in intersection Q∩T– P ( pick k white balls out of K white and N-K black balls )

• Multiple testing– Every query results in a number of p-values– Matching GO terms are not independent– Increased rate of false positive matches

• Which p-values are significant?

Experimental thresholds

• Simulation experiment– Fix some gene query size k– Repeat 1000 times:

• Generate synthetic query Q with k elements :random subset of organism's genes

• Observe best p-value p for query Q• Store p-value, p --> P

– Choose p', 50th smallest p-value from P– Threshold p' – top 5% of p-values for random queries

of size k• Calculate for query lengths k = [1,1000]• Compare with standard multiple testing

corrections– Bonferroni (1936), Benjamini-Hochberg (1995)

Analytical thresholds

• Analytical approach to simulated thresholds– Fix gene query size k– Observe all sizes and frequencies of GO annotation

sets T– Presume events with different T independent– Observe possible p-values p with query of k elements

– Always correct p by constant c=0.97 (set dependencies!)

– Find such threshold p', that gives p ~= 0.95• Repeat for query lengths k = [1,1000]

Significance thresholds

Significance thresholds

Significance thresholds

Significance thresholds

Ordered lists of genes

• Gene groups may be ordered– Interesting gene and few most

similar genes – Top “most active” genes– Increasing distance from cluster

centre

• Top of the list, but how many? – Compare list with GO term– Which portion gives best p-value?– Peak significance of ordered query

GOSt algorithms

• Unordered query– Intersections with all annotation sets T

• Exhaustive algorithm for ordered queries:

– intersections with all Qi and annotation sets T• Approximate algorithm for ordered queries:

– for every annotation set T, view only list portions that give local p-value extremes

• local best p : list ends with matching gene• local worst p : list ends just before matching

gene

Example: Ordered list analysis

query length

p-value

Peak significance at ordered list of

28 genes

List of genes, and matches for “Biosynthesis of steroids”

GenesGO

categoriesP-value

Evidencecodes

Ordered list query

Algorithm speed comparison

24 sec

2.8 sec

GOSt features

• Command line interface (C/C++ and Perl)• Graphical user interface in web

http://bioinf.ebc.ee/GOST– SWOG (Graphics language, Jaanus Hansen 2005)

• Data for multiple organisms– yeast, chicken, cow, mouse, rat, human...

• Wrappers for parallel applications (GRID, MPI)• Pipelines for gene expression data analysis

Extending GO ( i )

• Pathway – a network of interacting genes and proteins– metabolism pathways, disease pathways, ..

• Include pathway data to GO vocabulary– KEGG Pathway database– pathways as vocabulary terms– related genes as annotations to terms

• KEGG terms independent of GO vocabularyGO:0003674 molecular_function

GO:0005575 cellular_component

GO:0008150 biological_process

KEGG:00000 KEGG pathways

GO

KEGG:05010 - Alzheimer's disease

Extending GO ( ii )

• Gene expression started by transcription factors (TF)

• TFs bind to certain patterns in DNA– Transcription Factor Binding Sites (TFBS)– Often found in regions close to gene (1k bp)

• Include TFBS data from TRANSFAC– Patterns (putative TFBS) as vocabulary terms– annotations to genes near patterns

ATATAATAAAGATGAGGCGAATATAAATATACCGGCCCTTAGCGCGAAGCAATTCATCATATAAGCGAGAGAGGCCAATATGCAATCTTCGACAGCAT

geneTF binding site

Transcription factor

TRANSFAC motifs

• Motifs added in a hierarchy– according to PWM score– 5 levels:

• near_threshold• ...• near_MAX_score

• Work in progress– Hedi Peterson

GO:0003674 molecular_function

GO:0005575 cellular_component

GO:0008150 biological_process

KEGG:00000 KEGG pathways

GO

TF:M00000 TRANSFAC motifs

TF:M00431_4 TTTSGCGS:4TF:M00431_3 TTTSGCGS:3TF:M00431_2 TTTSGCGS:2TF:M00431_1 TTTSGCGS:1TF:M00431_0 TTTSGCGS:0TF:M00328_4 NCNNTNNTGCRTGANNNN:4TF:M00328_3 NCNNTNNTGCRTGANNNN:3TF:M00328_2 NCNNTNNTGCRTGANNNN:2

depth inhierarchy

Summary• We investigated means for finding GO annotations to

groups of genes, and statistical methods for determining significance of results.

• We combined GO vocabulary with various types of biological data, such as KEGG pathways and TRANSFAC regulatory elements.

• We proposed analytical thresholds for distinguishing significant results from structured and partly dependent GO annotations, and verified thresholds with simulation experiments.

• We proposed a novel concept of analyzing GO annotations for ordered lists of genes, and implemented fast algorithms for the purpose.

• The practical result of our work is GOSt, a GO mining tool. Command line interface is suitable for large-scale automatic analysis, while graphical web interface enables highly visualized and interactive analysis.

Sneak preview

• GO analysis of hierarchical clustering tree– Cluster genes according

to expression similarity and ..

– .. “Wrap up” nodes that show no significant annotations in GO

• Work in progress– Meelis Kull– Darja Krushevskaja

Acknowledgments

Jaak Vilo

BIIT groupHedi Peterson Raivo KoldeMeelis Kull Konstantin TretjakovJaanus Hansen Pavlos PavlidisPriit Adler Asko TiidumaaIlja Livenson Darja Krushevskaja

FunGenES Consortium