Microarray data analysis with Chipster 22.9.2008
description
Transcript of Microarray data analysis with Chipster 22.9.2008
Microarray data analysis with Chipster22.9.2008
Jarno Tuimala
Program – an analysis workflow
Basic functionality of Chipster Data import Quality control Normalization
• Describing the experiment
Filtering and missing value considerations Statistical testing Clustering and visualization Annotation
Introduction to Chipster
Chipster Goal: Easy access to leading analysis tools such as those developed in the
R/Bioconductor project
Features• Easy to use graphical user interface• Comprehensive selection of tools• Support for different array types (Affymetrix, Agilent, Illumina, cDNA)• Compatible with Windows, Linux and Mac OS X• Easy to install and update• Wizards and workflows• Interactive graphics • Transparency (as opposed to “black box”)• Alternative annotations for Affymetrix arrays
• Automatic tracking of performed analyses
http://www.csc.fi/english/customers/university/useraccounts/scientificservices.pdf http://chipster.csc.fi
How does it work?
internet
front end
SSL
SOAP
international Web Services
ANALYSIS VISUALISATION
CSC desktop
clientJava Web Startinstalls and updates client automatically
Corona/Murska
analyser
security
Data Tools
Visual
izatio
n
Phenodata – describing your experiment
Phenodata file is created during normalization Fill in the group column with numbers describing your experimental setup
• e.g. 1 = healthy control, 2 = cancer sample• necessary for the statistical tests to work
If you bring in previously created normalized data and phenodata:• Choose ”import directly” in the import tool• Right click on normalized data, choose ”Link to” phenodata and link type ”Annotation”
If you brought in normalized data and need to create phenodata for it:• Utilities/ Generate phenodata (fill in the chiptype parameter!)• Right click on normalized data, choose ”Link to” phenodata and link type ”Annotation”• Fill in the group column
Visualizing the data
Data visualization panel• Maximize and redraw for better viewing
Two types of visualizations1. Interactive visualizations produced by the client program
• Select the visualization method from the pulldown menu of the data visualization panel
• Save by right clicking on the image
2. Static images produced by R/Bioconductor, Weeder, etc• Select from Analysis tools/ Visualisation• View by double clicking on the image file• Save by right clicking on the file name and choosing ”Export”
Interactive visualizations by the client
Spreadsheet Histogram Scatterplot 3D scatterplot Expression profiles Clustered profiles Hierarchical clustering SOM clustering Array pseudo-image Venn diagram
Available actions: Change titles, colors etc Zoom in/out
Static images produced by R/Bioconductor
Volcano plot Box plot Histogram Heatmap Venn diagram Idiogram Chromosomal position Correlogram Dendrogram QC stats plot RNA degradation plot K-means clustering SOM-clustering
Automatic tracking of analysis history
Running many analyses simultaneously
You can have max 5 analysis jobs running at the same time Use Task manager to
• view parameters, status,…• cancel jobs
Workspace – continue later/elsewhere
Saving your workspace allows you to continue later• File/ Save workspace• File/ Load workspace
Currently it is possible to have only one workspace saved at the time
If you would like to continue your work on another computer, you need to transfer the workspace-snapshot -folder to the corresponding location
• C:\Documents and Settings\ekorpela\nami-work-files\workspace-snapshot
Importing files
Affymetrix CEL-files are imported to Chipster automatically
Other files are imported using the Import tool
Import tool, step 1
Define• Header• Footer• Title row• Delimiter
Import tool, step 2
Define columns Modify flags
Importing Agilent files (required fields)
Sample (rMeanSignal) Sample background (rBGMedianSignal) Control (gMeanSignal) Control background (gBGMedianSignal) Identifier (ProbeName) Annotation (ControlType) Flag (IsManualFlag)
https://extras.csc.fi/biosciences/chipster-manual/data-formats.html
Quality control
Quality control tools
Quality control -tools• Affymetrix basic
RNA degradation + Affy QC• Agilent
MA-plot + density plot + boxplot
Visualization – dendrogram Statistics - NMDS
Affymetrix I
Quality control tools are run on raw data (CEL files).• Dendrogram and NMDS on normalized data
Agilent
General QC – dendrogram and NMDS
Scatterplots
Heatmaps (this took an hour to calculate)
QC-tools in Chipster
Quality control• Affymetrix basic• Affymetrix RLE and NUSE• Agilent 2-color
Visualization• Dendrogram• Heatmap• Correlogram
Statistics• NMDS
Normalization
What is normalization?
Normalization is the process of removing systematic variation from the data.
Typically you would normalize your data so that all the chips become comparable.
Methods
Affymetrix• Background correction + expression estimation + summarization• RMA (default) uses only PM probes, fits a model to them, and gives out
expression values after quantile normalization and median polishing
Agilent• Background correction + averaging duplicate spots + normalization
After normalization the expression values are always expressed on log2-scale
Affymetrix
Methods: MAS5, Plier, RMA, GCRMA, Li-Wong• MAS5 is the older Affymetrix method, Plier is a newer one• RMA is the default, and works rather nicely if you have more than a
few chips• GCRMA is similar to RMA, but takes also GC% content into account• Li-Wong is the method implemented in dChip
Variance stabilization makes the variance over all the chips similar
• Works only with MAS5 and Plier, since all others output log2-tranformed data by default (and thus corrected for the same phenomenon)
Custom chiptype• If you want to use reannotated probes (they are really assigned to
the genes where they belong), select one from this menu.
Agilent I
Background correction• Background treatment
None, Subtract, Edwards, Normexp• Background offset
0 or 50
Normalize chips• None, median, loess
Normalize genes (not typically used)• None, scale (to median), quantile
Chiptype• A must setting!
Agilent II
Background treatment typically generates many negative values that are coded as missing values after log2-transformation.
• Usual subtract option does this• Using normexp + offset 50 will generate no negative values,
and gives rather good estimates (best method reported)
Loess removes curvature from the data (suggested)
Checking normalization
Filtering
Gene filtering
Removing probes for genes that are• Not expressed• Expressed at constant level (not changing)
Often a good idea, and necessary before multiple testing correction can be adequately applied
• Some controversy on this…
Non-specific filtering• Expression, flags, SD, …
Specific filtering• Statistical testing
Non-specific filtering
Often used for removing bad quality data:• Intensity value too low• Intensity value saturated• Appearance of the spot is abnormal
Typically, non-changing genes are also removed These can be removed using
• Filter by standard deviation• Filter by interquartile range• Filter by expression
Specific filtering
Selecting genes that are associated with some phenotype
Typically involves statistical testing
Biologists typically concentrate on fold change (magnitude of effect), statisticians on p-value.
• Both tell a slightly different story. Fold change ignores knowledge of variability, p-value ignores the size of the effect.
• Take both into account by combining the filters.• Filter on expression value (what is biologically significant)
and test for differences (what is statistically significant)
Unspecific filtering in Chipster
Pre-processing• Filter by expression
• Select the upper and lower cut-offs• Select the number of chips this rule has to fulfilled on• Select whether to return genes inside or outside the range
• Filter by SD• Select the percentage of genes to filter out
• Filter by interquartile range (IQR)• Select the IQR
• Filter by coefficient of variation (CV)• Median is used for filtering on CV (cannot be changed)
Utilities1. Calculate descriptive statistics2. Filter using a column
Venn diagram
Select three datasets in Chipster Run the Venn diagram tool from Visualization tool
category
SD CV
IQR
Statistics
Some terminology
Usually tests for comparing means of two or more groups are used
• Variance might be of interest too, but in practise this is never done.
Parametric tests (assume data normally distributed)• Typically used for microarray data
Non-parametric tests (assume no normality)
P-value• Risk of saying that there is a difference when there really isn’t
• Traditionally 0.05 is used as a cut-off for significance
• False discovery range is a p-value corrected for multiple tests (more on this later)
Mean and variance, an example for 1 gene
-6 -4 -2 0 2 4 6
0.0
0.1
0.2
0.3
0.4
density.default(x = x1)
N = 100000 Bandwidth = 0.08956
De
nsi
ty
-10 -5 0 5 10
0.0
0.1
0.2
0.3
0.4
density.default(x = y1)
N = 100000 Bandwidth = 0.09006
De
nsi
ty
Statistical testing
Needs replication (>2 chips per group)• Replication makes it possible to estimate uncertainty or variability in the
measurements. This is typically measured by standard deviation.
Comparing means (parametric tests)• One-group tests
• Compare to a known mean
• Example: One-sample t-test
• Two-group tests
• Compare two groups’ means
• Example: Two-sample t-test
• Several group tests
• Compare several groups’ means
• Example: Analysis of variance (ANOVA)
• Two or more groups, two or more factors
• Compare means in the groups according to both factor simultaneously
• Example: multiple linear regression (linear modeling in Chipster)
t-test
Compares means of two groups• If the p-value is small that means that there is a difference between the groups.
• If the p-value is large (>0.05), there is no difference between the groups.
• p-value is a risk of saying that there is a difference when there actually isn’t.
A test for every gene is run separately -> thousands of tests and p-values
SE
xxt 21
ANOVA
A generalization of t-test. Compares means of several groups. Tells whether the means are different, but not which
means differ from each other.• For this you can use post-hoc tests (not implemented in
Chipster) or linear modelling (implemented in Chipster)
A test for every gene is run separately -> thousands of tests and p-values
Multiple testing correction I
After getting the results for all the genes, p-values are adjusted for the number of tests conducted.
When making several comparisons using the same test, some of the results will be chance findings.
• Example: if p threshold is 0.05, every 20th significant result might be due to chance alone. If there were 10000 genes that were tested, 500 genes would be expected to be chance findings. If we found 550 genes to be significant, most of those (500) would be false positives, and only a minority are true positives (50).
This can be corrected for (to some extent) by using a multiple testing correction.
• Benjamini and Hochberg FDR: If FDR threshold is 0.05, 5% of significant results are expected to be false positives (chance findings). If we tested 10000 genes, and 500 genes were significant after FDR correction, 25 of those are expected to be false positives, and 475 are expected to be true positives.
• Thus, FDR can be much higher than p-value, and the results can still be meaningful and worth investigating.
Multiple testing correction II
The ranking of the genes does not change after multiple testing correction!
• If you know that you can validate, say, 10 genes, then there’s no difference if you select the most significant genes before or after the multiple testing correction.
• If there are no significant genes left after multiple testing correction, you probably have some differences, but not enough power in your experiment to detect those differences. In that case the top 10 genes are still the ones that are most likely to validate.
Gene set test (”global test”)
A typical result of an microarray experiment is a list of differentially expressed genes.
Biologically, grouping these genes in pathways or functional categories would be more interesting.
Are pathways associated with our endpoints of interest?
• Is there a difference in nucleotide metabolism between 5-FU-treated cancer patients and their healthy controls?
Works on the expression values data.
Gene enrichment analysis
A typical result of an microarray experiment is a list of differentially expressed genes.
Biologically, grouping these genes in pathways or functional categories would be more interesting.
Takes a list of differentially expressed genes, and tests whether they are enriched in any functional categories.
Works on the gene list.
Statistical tests in Chipster
Statistics• One sample tests
• Are the genes expressed at all (different from 0)?• Two group tests• Several group tests• Linear modeling
Visualization• Volcano plot
Clustering
Clustering methods
Hierarchical clustering Non-hierarchical clustering
• K-means• QT-clustering• Self-organizing maps
Classification / class prediction• K-nearest neighbor (KNN)
Hierachical clustering
Two phases:• Pick a distance measure
• Euclidean distance• Standard / Pearson correlation
• Pick the dendrogram drawing method• Average linkage
Average linkage example
Hierarchical clustering - heatmap
Annotation
Annotation
Annotation = Descriptive text used for labeling features. For genes, extra information about their location in chromosomes, biological functions, etc.
Retrieved from multiple biological databases and stored as a single database in Chipster (generated by Bioconductor project).
Required by certain analysis tools (annotation, GO enrichment, promoter analysis, chromosomal plots)
• These tools don’t work for those chiptypes which don’t have Bioconductor annotation packages
Alternative CDF environments for Affy
CDF is a file that links individual probes to their location in genes (probesets)
Affymetrix default annotation use old CDF files that map a sizable number of probes to wrong genes
Alternative CDFs fix this problem In Chipster
• selecting ”custom chiptype” in Affymetrix normalization takes altCDFs to use• Note: if you have normalized using a custom chiptype, certain tools requiring
annotation won’t work (GO term enrichment, promotor analysis, annotation)
Dai et al, (2005) Nuc Acids Res, 33(20):e175 http://brainarray.mbni.med.umich.edu/Brainarray/Database/
CustomCDF/genomic_curated_CDF.asp