Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas...

Post on 19-Jan-2016

213 views 0 download

Tags:

Transcript of Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas...

Analyzing digital gene expression data in Galaxy

Supervisors:

Peter-Bram A.C. ’t Hoen

Kostas Karasavvas

Students:

Ilya Kurochkin

Ivan Rusinov

GalaxyGalaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research.

Adding new tool in Galaxy

To add new tool in Galaxy you need:• Tool definition file in xml format

• The tool script

...

SAGE• Sequence and count short tags representative for a

transcript• Absolute abundance of transcript

Existing pipeline for analyzing DeepSAGE data

GAPSS: General analysis pipeline for second generation sequencers

Implemented in Galaxy

Some final steps were missed:- Gene annotation (ENSEMBL/Biomart) and summarization- Statistical analysis of differential gene expression

Existing workflow

Gene annotation and summarization

Tool for counting DeepSAGE tags in

ENSEMBL annotated exons.

Tool for automatic BioMart format file obtaining.

Obtain BioMart format file

Count DeepSAGE tags in annotated exons

Input files:1) BioMart format file:

2) SAM format file:

Count DeepSAGE tags in annotated exons

Count DeepSAGE tags in annotated exons

Output file:

Count DeepSAGE tags in annotated exons

1. For each line in SAM file reads all Biomart file. (~1 second/line)

2. BioMart file load into dictionary, data splits by chromosome name and strand. (50 seconds for 10,000 lines)

3. SAM file is loaded into dictionary, data splits by chromosome name, strand and genomic position. (16 seconds for 10,000 lines)

4. Work with several SAM files.

5. Both files are loaded into dictionaries. (16 seconds for 10,000 lines; ~16 minutes for 7,768,787 lines)

6. Sort BioMart dictionary by exon coordinates, problem with crossing and repeated exons.

7. Binary search for position from SAM file in sorted list of exon coordinates was implemented. (77 seconds for 7,768,787 lines)

About R/Bioconductor

• R is a language and environment for statistical computing and graphics.

• Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development.

Statistical analysis of differential gene expression

Tool for examining differential expression of replicated count data using edgeR package of Bioconductor

Tool for estimating the variance in count data and test for differential expression using DESeq package of Bioconductor

Analysis of differentially expressed genes (edgeR)

Input files:1. DeepSAGE tags in annotated

exons counter output file2. Metadata file

Design matrix Contrast vector

1

-1

0

Generalized linear model

Analysis of differentially expressed genes (edgeR)

Analysis of differentially expressed genes (edgeR)

Output file:

Analysis of differentially expressed genes (DESeq)

Test for differences between the base means of two levels

Input files:1. DeepSAGE tags in annotated

exons counter output file2. Metadata file

Create a CountDataSet object

Estimate the effective library size for a CountDataSet

Estimate the variance functions for a CountDataSet

Analysis of differentially expressed genes (DESeq)

Analysis of differentially expressed genes (DESeq)

Output file:

Comparison of results obtained by edgeR and DESeq

Full workflow

Thank you for your attention

Any questions?