Microarray data analysis with Chipster 22.9.2008

Microarray data analysis with Chipster22.9.2008

Jarno Tuimala

Program – an analysis workflow

Basic functionality of Chipster Data import Quality control Normalization

• Describing the experiment

Filtering and missing value considerations Statistical testing Clustering and visualization Annotation

Introduction to Chipster

Chipster Goal: Easy access to leading analysis tools such as those developed in the

R/Bioconductor project

Features• Easy to use graphical user interface• Comprehensive selection of tools• Support for different array types (Affymetrix, Agilent, Illumina, cDNA)• Compatible with Windows, Linux and Mac OS X• Easy to install and update• Wizards and workflows• Interactive graphics • Transparency (as opposed to “black box”)• Alternative annotations for Affymetrix arrays

• Automatic tracking of performed analyses

http://www.csc.fi/english/customers/university/useraccounts/scientificservices.pdf http://chipster.csc.fi

How does it work?

internet

front end

SSL

SOAP

international Web Services

ANALYSIS VISUALISATION

CSC desktop

clientJava Web Startinstalls and updates client automatically

Corona/Murska

analyser

security

Data Tools

Visual

izatio

n

Phenodata – describing your experiment

Phenodata file is created during normalization Fill in the group column with numbers describing your experimental setup

• e.g. 1 = healthy control, 2 = cancer sample• necessary for the statistical tests to work

If you bring in previously created normalized data and phenodata:• Choose ”import directly” in the import tool• Right click on normalized data, choose ”Link to” phenodata and link type ”Annotation”

If you brought in normalized data and need to create phenodata for it:• Utilities/ Generate phenodata (fill in the chiptype parameter!)• Right click on normalized data, choose ”Link to” phenodata and link type ”Annotation”• Fill in the group column

Visualizing the data

Data visualization panel• Maximize and redraw for better viewing

Two types of visualizations1. Interactive visualizations produced by the client program

• Select the visualization method from the pulldown menu of the data visualization panel

• Save by right clicking on the image

2. Static images produced by R/Bioconductor, Weeder, etc• Select from Analysis tools/ Visualisation• View by double clicking on the image file• Save by right clicking on the file name and choosing ”Export”

Interactive visualizations by the client

Spreadsheet Histogram Scatterplot 3D scatterplot Expression profiles Clustered profiles Hierarchical clustering SOM clustering Array pseudo-image Venn diagram

Available actions: Change titles, colors etc Zoom in/out

Static images produced by R/Bioconductor

Volcano plot Box plot Histogram Heatmap Venn diagram Idiogram Chromosomal position Correlogram Dendrogram QC stats plot RNA degradation plot K-means clustering SOM-clustering

Automatic tracking of analysis history

Running many analyses simultaneously

You can have max 5 analysis jobs running at the same time Use Task manager to

• view parameters, status,…• cancel jobs

Workspace – continue later/elsewhere

Saving your workspace allows you to continue later• File/ Save workspace• File/ Load workspace

Currently it is possible to have only one workspace saved at the time

If you would like to continue your work on another computer, you need to transfer the workspace-snapshot -folder to the corresponding location

• C:\Documents and Settings\ekorpela\nami-work-files\workspace-snapshot

Importing files

Affymetrix CEL-files are imported to Chipster automatically

Other files are imported using the Import tool

Import tool, step 1

Define• Header• Footer• Title row• Delimiter

Import tool, step 2

Define columns Modify flags

Importing Agilent files (required fields)

Sample (rMeanSignal) Sample background (rBGMedianSignal) Control (gMeanSignal) Control background (gBGMedianSignal) Identifier (ProbeName) Annotation (ControlType) Flag (IsManualFlag)

https://extras.csc.fi/biosciences/chipster-manual/data-formats.html

Quality control

Quality control tools

Quality control -tools• Affymetrix basic

RNA degradation + Affy QC• Agilent

MA-plot + density plot + boxplot

Visualization – dendrogram Statistics - NMDS

Affymetrix I

Quality control tools are run on raw data (CEL files).• Dendrogram and NMDS on normalized data

Agilent

General QC – dendrogram and NMDS

Scatterplots

Heatmaps (this took an hour to calculate)

QC-tools in Chipster

Quality control• Affymetrix basic• Affymetrix RLE and NUSE• Agilent 2-color

Visualization• Dendrogram• Heatmap• Correlogram

Statistics• NMDS

Normalization

What is normalization?

Normalization is the process of removing systematic variation from the data.

Typically you would normalize your data so that all the chips become comparable.

Methods

Affymetrix• Background correction + expression estimation + summarization• RMA (default) uses only PM probes, fits a model to them, and gives out

expression values after quantile normalization and median polishing

Agilent• Background correction + averaging duplicate spots + normalization

After normalization the expression values are always expressed on log2-scale

Affymetrix

Methods: MAS5, Plier, RMA, GCRMA, Li-Wong• MAS5 is the older Affymetrix method, Plier is a newer one• RMA is the default, and works rather nicely if you have more than a

few chips• GCRMA is similar to RMA, but takes also GC% content into account• Li-Wong is the method implemented in dChip

Variance stabilization makes the variance over all the chips similar

• Works only with MAS5 and Plier, since all others output log2-tranformed data by default (and thus corrected for the same phenomenon)

Custom chiptype• If you want to use reannotated probes (they are really assigned to

the genes where they belong), select one from this menu.

Agilent I

Background correction• Background treatment

None, Subtract, Edwards, Normexp• Background offset

0 or 50

Normalize chips• None, median, loess

Normalize genes (not typically used)• None, scale (to median), quantile

Chiptype• A must setting!

Agilent II

Background treatment typically generates many negative values that are coded as missing values after log2-transformation.

• Usual subtract option does this• Using normexp + offset 50 will generate no negative values,

and gives rather good estimates (best method reported)

Loess removes curvature from the data (suggested)

Checking normalization

Filtering

Gene filtering

Removing probes for genes that are• Not expressed• Expressed at constant level (not changing)

Often a good idea, and necessary before multiple testing correction can be adequately applied

• Some controversy on this…

Non-specific filtering• Expression, flags, SD, …

Specific filtering• Statistical testing

Non-specific filtering

Often used for removing bad quality data:• Intensity value too low• Intensity value saturated• Appearance of the spot is abnormal

Typically, non-changing genes are also removed These can be removed using

• Filter by standard deviation• Filter by interquartile range• Filter by expression

Specific filtering

Selecting genes that are associated with some phenotype

Typically involves statistical testing

Biologists typically concentrate on fold change (magnitude of effect), statisticians on p-value.

• Both tell a slightly different story. Fold change ignores knowledge of variability, p-value ignores the size of the effect.

• Take both into account by combining the filters.• Filter on expression value (what is biologically significant)

and test for differences (what is statistically significant)

Unspecific filtering in Chipster

Pre-processing• Filter by expression

• Select the upper and lower cut-offs• Select the number of chips this rule has to fulfilled on• Select whether to return genes inside or outside the range

• Filter by SD• Select the percentage of genes to filter out

• Filter by interquartile range (IQR)• Select the IQR

• Filter by coefficient of variation (CV)• Median is used for filtering on CV (cannot be changed)

Utilities1. Calculate descriptive statistics2. Filter using a column

Venn diagram

Select three datasets in Chipster Run the Venn diagram tool from Visualization tool

category

SD CV

IQR

Statistics

Some terminology

Usually tests for comparing means of two or more groups are used

• Variance might be of interest too, but in practise this is never done.

Parametric tests (assume data normally distributed)• Typically used for microarray data

Non-parametric tests (assume no normality)

P-value• Risk of saying that there is a difference when there really isn’t

• Traditionally 0.05 is used as a cut-off for significance

• False discovery range is a p-value corrected for multiple tests (more on this later)

Mean and variance, an example for 1 gene

-6 -4 -2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

density.default(x = x1)

N = 100000 Bandwidth = 0.08956

De

nsi

ty

-10 -5 0 5 10

0.0

0.1

0.2

0.3

0.4

density.default(x = y1)

N = 100000 Bandwidth = 0.09006

De

nsi

ty

Statistical testing

Needs replication (>2 chips per group)• Replication makes it possible to estimate uncertainty or variability in the

measurements. This is typically measured by standard deviation.

Comparing means (parametric tests)• One-group tests

• Compare to a known mean

• Example: One-sample t-test

• Two-group tests

• Compare two groups’ means

• Example: Two-sample t-test

• Several group tests

• Compare several groups’ means

• Example: Analysis of variance (ANOVA)

• Two or more groups, two or more factors

• Compare means in the groups according to both factor simultaneously

• Example: multiple linear regression (linear modeling in Chipster)

t-test

Compares means of two groups• If the p-value is small that means that there is a difference between the groups.

• If the p-value is large (>0.05), there is no difference between the groups.

• p-value is a risk of saying that there is a difference when there actually isn’t.

A test for every gene is run separately -> thousands of tests and p-values

SE

xxt 21

ANOVA

A generalization of t-test. Compares means of several groups. Tells whether the means are different, but not which

means differ from each other.• For this you can use post-hoc tests (not implemented in

Chipster) or linear modelling (implemented in Chipster)

A test for every gene is run separately -> thousands of tests and p-values

Multiple testing correction I

After getting the results for all the genes, p-values are adjusted for the number of tests conducted.

When making several comparisons using the same test, some of the results will be chance findings.

• Example: if p threshold is 0.05, every 20th significant result might be due to chance alone. If there were 10000 genes that were tested, 500 genes would be expected to be chance findings. If we found 550 genes to be significant, most of those (500) would be false positives, and only a minority are true positives (50).

This can be corrected for (to some extent) by using a multiple testing correction.

• Benjamini and Hochberg FDR: If FDR threshold is 0.05, 5% of significant results are expected to be false positives (chance findings). If we tested 10000 genes, and 500 genes were significant after FDR correction, 25 of those are expected to be false positives, and 475 are expected to be true positives.

• Thus, FDR can be much higher than p-value, and the results can still be meaningful and worth investigating.

Multiple testing correction II

The ranking of the genes does not change after multiple testing correction!

• If you know that you can validate, say, 10 genes, then there’s no difference if you select the most significant genes before or after the multiple testing correction.

• If there are no significant genes left after multiple testing correction, you probably have some differences, but not enough power in your experiment to detect those differences. In that case the top 10 genes are still the ones that are most likely to validate.

Gene set test (”global test”)

A typical result of an microarray experiment is a list of differentially expressed genes.

Biologically, grouping these genes in pathways or functional categories would be more interesting.

Are pathways associated with our endpoints of interest?

• Is there a difference in nucleotide metabolism between 5-FU-treated cancer patients and their healthy controls?

Works on the expression values data.

Gene enrichment analysis

A typical result of an microarray experiment is a list of differentially expressed genes.

Biologically, grouping these genes in pathways or functional categories would be more interesting.

Takes a list of differentially expressed genes, and tests whether they are enriched in any functional categories.

Works on the gene list.

Statistical tests in Chipster

Statistics• One sample tests

• Are the genes expressed at all (different from 0)?• Two group tests• Several group tests• Linear modeling

Visualization• Volcano plot

Clustering

Clustering methods

Hierarchical clustering Non-hierarchical clustering

• K-means• QT-clustering• Self-organizing maps

Classification / class prediction• K-nearest neighbor (KNN)

Hierachical clustering

Two phases:• Pick a distance measure

• Euclidean distance• Standard / Pearson correlation

• Pick the dendrogram drawing method• Average linkage

Average linkage example

Hierarchical clustering - heatmap

Annotation

Annotation

Annotation = Descriptive text used for labeling features. For genes, extra information about their location in chromosomes, biological functions, etc.

Retrieved from multiple biological databases and stored as a single database in Chipster (generated by Bioconductor project).

Required by certain analysis tools (annotation, GO enrichment, promoter analysis, chromosomal plots)

• These tools don’t work for those chiptypes which don’t have Bioconductor annotation packages

Alternative CDF environments for Affy

CDF is a file that links individual probes to their location in genes (probesets)

Affymetrix default annotation use old CDF files that map a sizable number of probes to wrong genes

Alternative CDFs fix this problem In Chipster

• selecting ”custom chiptype” in Affymetrix normalization takes altCDFs to use• Note: if you have normalized using a custom chiptype, certain tools requiring

annotation won’t work (GO term enrichment, promotor analysis, annotation)

Dai et al, (2005) Nuc Acids Res, 33(20):e175 http://brainarray.mbni.med.umich.edu/Brainarray/Database/

CustomCDF/genomic_curated_CDF.asp

Microarray data analysis with Chipster 22.9.2008

Documents

Transcript of Microarray data analysis with Chipster 22.9.2008