Microarrays: Common Analysis Approaches
description
Transcript of Microarrays: Common Analysis Approaches
![Page 1: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/1.jpg)
Microarrays:Common Analysis Approaches
![Page 2: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/2.jpg)
Missing Value Estimation Differentially Expressed Genes Clustering Algorithms Principal Components Analysis
Outline
![Page 3: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/3.jpg)
Missing data problem, basic concepts and terminology
Classes of procedures Case deletion Single imputation
Filling with zeroes Row averaging SVD imputation KNN imputation
Multiple imputation
Missing Data: Outline
![Page 4: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/4.jpg)
Causes for missing data Low resolution Image corruption Dust/scratched slides Missing measurements
Why estimate missing values? Many algorithms cannot deal with missing values
- Distance measure-dependent algorithms(e.g., clustering, similarity searches)
The Missing Data Problem
![Page 5: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/5.jpg)
Statistical overview
Population of complete data: θ
Sample of complete data: θs
Sample of incomplete data: θi
SampleMissing data mechanism
Need to estimate θ from the incomplete data and investigate its performance over repetitions of the
sampling procedure
Basic concepts and terminology
![Page 6: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/6.jpg)
Y = sample data
f(Y;θ) = distribution of sample data
θ = parameters to be estimated
R = indicators, whether elements of Y are observed or missing
g(R|Y) = missing data mechanism (maybe with other params)
Y = (Yobs, Ymis)
Yobs = observed part of Y
Ymis = missing part of Y
Goal:
Propose methods to estimate θ from Yobs and accurately assess its error
Basic concepts
![Page 7: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/7.jpg)
Classes of mechanisms (cf. Rubin, 1976, Biometrika)
• Missing Completely At Random (MCAR)g(R|Y) does not depend on Y
• Missing At Random (MAR)
g(R|Y) may depend on Yobs but not on Ymis
• Missing Not At Random (MNAR)
g(R|Y) depends on Ymis
Basic concepts (cont.)
![Page 8: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/8.jpg)
Suppose we measure age and income of a collection of individuals…
• MCAR• The dog ate the response sheets!
• MAR• Probability that the income measurement is missing
varies according to the age but not income• MNAR
• Probability that an income is recorded varies according to the income level with each age group
Note: we can disprove MCAR by examining the data, but we cannot disprove MAR or MNAR.
Example
![Page 9: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/9.jpg)
Missing data problem, basic concepts and terminology Classes of procedures
Case deletion Single imputation
Filling with zeroes Row averaging SVD imputation KNN imputation
Multiple imputation
Outline
![Page 10: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/10.jpg)
• Remove subjects with missing values on any item needed for analysis
Y1 Y2 Y3
1 1 3 42 5 ? 13 4 4 ?4 1 2 3
Advantages• Easy• Valid analysis under MCAR• OK if proportion of missing cases is small and they are
not overly influential
Disadvantages• Can be inefficient, may discard a very high proportion of
cases (5669 out of 6178 rows discarded in Spellman yeast data)
• May introduce substantial bias, if missing data are not MCAR (complete cases may be un-representative of the population)
Classes of procedures: Case Deletion
![Page 11: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/11.jpg)
Replace with zeroes• Fill-in all missing values with
zeroes
Y1 Y2 Y3
1 1 3 42 5 0 13 4 4 04 1 2 3
Advantages• Easy
Disadvantages• Distorts the data disproportionately (changes statistical
properties)• May introduce bias• Why zero?
Classes of procedures: Single Imputation (I)
![Page 12: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/12.jpg)
Row averaging• Replace missing values by
the row average for that row
Y1 Y2 Y3
1 1 3 42 5 2.6
71
3 4 4 3.67
4 1 2 3
Advantages• Easy• Keeps same mean
Disadvantages• Distorts distributions and relationships between variables
xx
x
x
x
x
xx
xx
x
x
x
x
xx
x x xx xx
Classes of procedures: Single Imputation (II)
![Page 13: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/13.jpg)
“Hot deck” imputation• Replace each missing value
by a randomly drawn observed value
Y1 Y2 Y3
1 1 3 42 5 1 13 4 4 24 1 2 3
Advantages• Easy• Preserves distributions very well
Disadvantages• May distort relationships• Can use, e.g., “similar” rows to draw random values from
(to help constrain distortion)• Depend on definition of “similar”
Classes of procedures: Single Imputation (III)
![Page 14: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/14.jpg)
• SVD imputation• Fill missing entries with regressed values from a set of
characteristic patterns, using coefficients determined by the proximity of the missing row to the patterns
• KNN imputation (more later)• Isolate rows whose values are similar to those of the
one with missing values (choosing (i) similarity measure, and (ii) size of this set)
• Fill missing values with averages from this set of genes, with weights inversely proportional to similarities
Regression imputation• Fit regression to observed values, use it to obtain
predictions for missing ones
• Computationally intensive
• May distort relationships between variables (could use Yimp+random residual)
Classes of procedures: Single Imputation (IV)
![Page 15: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/15.jpg)
Main Idea• Replace Ymis by M>1 independent draws
• {Y1mis,…, YM
mis } ~ P(Ymis| Yobs )
• Produce M different versions of complete data• Analyse each one in same fashion and combine results
at the end, with standard error estimates (Rubin, 1987)
• More difficult to implement• Requires (initially) more computations• More work involved in interpreting results
Classes of procedures: Multiple Imputation
![Page 16: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/16.jpg)
• Troyanskaya et al., Bioinformatics, 2001
The Algorithm
0. Given gene A with missing values
1. Find K other genes with values present in experiment 1, with expression most similar to A in other experiments
2. Weighted average of values in experiment 1 from the K closest genes is used as an estimate for the missing value in A
KNN Imputation
![Page 17: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/17.jpg)
• K – the number of nearest neighbours• Method appears to be relatively insensitive to K within
the range 10-20• Distance metric to be used for computing gene similarity
• Troyanskaya: “Euclidean is sufficient” • No clear comparison or reason – would expect that
metric to be used depends on the type of experiment• Not recommended on matrices with less than four columns• Computationally intensive!
• ~O(m2n) for m rows and n genes• “3.23 minutes on a Pentium III 500 MHz for 6153
genes, 14 experiments with 10% of the entries missing”
KNN Imputation: Considerations
![Page 18: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/18.jpg)
KNN Imputation: Expression Profiler
![Page 19: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/19.jpg)
Missing Value Estimation Differentially Expressed Genes Clustering Algorithms Principal Components Analysis
Outline
![Page 20: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/20.jpg)
Identifying Differentially Expressed Genes
[Slides courtesy of John Quackenbush, TIGR]
![Page 21: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/21.jpg)
Two vs. Multiple conditions
• Two conditions- t-test- Significance analysis of microarrays (SAM)- Volcano Plots- ANOVA
• Multiple conditions- Clustering- K-means
- PCA
![Page 22: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/22.jpg)
Where z/2 and z are normal percentile values atfalse positive rate Type I error ratefalse negative rate Type II error rate,
represents the minimum detectable log2 ratio;and represents the SD of log ratio values.
For = 0.001 and = 0.05, get z/2 = -3.29 and z = -1.65.Assume = 1.0 (2-fold change) and = 0.25,
n = 12 samples (6 query and 6 control) (Simon et al., Genetic Epidemiology 23: 21-36, 2002)
n = [4(z/2 + z)2] / [(/1.4)2]
How Many Replicates??
![Page 23: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/23.jpg)
Some Concepts from Statistics
![Page 24: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/24.jpg)
The probability of an event is the likelihood of its occurring.
It is sometimes computed as a relative frequency (rf), where
The probability of an event can sometimes be inferred from a“theoretical” probability distribution, such as a normal distribution.
the number of “favorable” outcomes for an eventthe total number of possible outcomes for that eventrf =
Probability Distributions
![Page 25: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/25.jpg)
σ = standard deviationof the distribution
X = μ (mean of the distribution)
Normal Distribution
![Page 26: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/26.jpg)
Population 1
Mean 1
Population 2
Mean 2
Less than a 5 % chance that the sample with mean s came from Population 1
s is significantly different from Mean 1 at the p < 0.05 significance level.
But we cannot reject the hypothesis that the sample came from Population 2
Sample mean “s”
![Page 27: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/27.jpg)
• Many biological variables, such as height and weight, can reasonably be assumed to approximate the normal distribution.
• But expression measurements? Probably not.
• Fortunately, many statistical tests are considered to be fairly robust to violations of the normality assumption, and other assumptions used in these tests.
• Randomization / resampling based tests can be used to get around the violation of the normality assumption.
• Even when parametric statistical tests (the ones that make use of normal and other distributions) are valid, randomization tests are still useful.
Probability and Expression Data
![Page 28: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/28.jpg)
1. Compute the value of interest (i.e., the test-statistic s) from your data set.
Original data set
s
2. Make “fake” data sets from your original data, by taking a random sub-sample of the data, or by re-arranging the data in a random fashion. Re-compute s from the “fake” data set.
“fake” s
“fake” s
“fake” s. . .
Randomized “fake” data sets
Outline of a Randomisation Test
![Page 29: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/29.jpg)
3. Repeat step 2 many times (often several hundred to several thousand times) and record of the “fake” s values from step 2
4. Draw inferences about the significance of your original s value by comparing it with the distribution of the randomized (“fake”) s values
Range of randomized s values
Original s value could be significantas it exceeds most of the randomized s values
Outline of a Randomisation Test (II)
![Page 30: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/30.jpg)
• Rationale
• Ideally, we want to know the “behavior” of the larger population from which the sample is drawn, in order to make statistical inferences.
• Here, we don’t know that the larger population “behaves” like a normal distribution, or some other idealized distribution. All we have to work with are the data in hand.
• Our “fake” data sets are our best guess about this behavior (i.e., if we had been pulling data at random from an infinitely large population, we might expect to get a distribution similar to what we get by pulling random sub-samples, or by reshuffling the order of the data in our sample)
Outline of a Randomisation Test (III)
![Page 31: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/31.jpg)
• Let’s imagine there are 10,000 genes on a chip, and
• none of them is differentially expressed.
• Suppose we use a statistical test for differential expression, where we consider a gene to be differentially expressed if it meets the criterion at a p-value of p < 0.05.
The Problem of Multiple Testing (I)
![Page 32: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/32.jpg)
• Let’s say that applying this test to gene “G1” yields a p-value of p = 0.01
• Remember that a p-value of 0.01 means that there is a 1% chance that the gene is not differentially expressed, i.e.,
• Even though we conclude that the gene is differentially expressed (because p < 0.05), there is a 1% chance that our conclusion is wrong.
• We might be willing to live with such a low probability of being wrong
• BUT .....
The Problem of Multiple Testing (II)
![Page 33: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/33.jpg)
• We are testing 10,000 genes, not just one!!!
• Even though none of the genes is differentially expressed, about 5% of the genes (i.e., 500 genes) will be erroneously concluded to be differentially expressed, because we have decided to “live with” a p-value of 0.05
• If only one gene were being studied, a 5% margin of error might not be a big deal, but 500 false conclusions in one study? That doesn’t sound too good.
The Problem of Multiple Testing (III)
![Page 34: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/34.jpg)
• There are “tricks” we can use to reduce the severity of this problem.
• They all involve “slashing” the p-value for each test (i.e., gene), so that while the critical p-value for the entire data set might still equal 0.05, each gene will be evaluated at a lower p-value.
• We’ll go into some of these techniques later.
The Problem of Multiple Testing (IV)
![Page 35: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/35.jpg)
• Don’t get too hung up on p-values.
• Ultimately, what matters is biological relevance.
• P-values should help you evaluate the strength of the evidence, rather than being used as an absolute yardstick of significance.
• Statistical significance is not necessarily the same as biological significance.
The Problem of Multiple Testing (V)
![Page 36: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/36.jpg)
• Assume we will compare two conditions with multiple replicates for each class
• Our goal is to find genes that are significantly different between these classes
• These are the genes that we will use for later data mining
Finding Significant Genes
![Page 37: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/37.jpg)
• Average Fold Change Difference for each gene• suffers from being arbitrary and not taking into
account systematic variation in the data
??????
Finding Significant Genes (II)
![Page 38: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/38.jpg)
• t-test for each gene• Tests whether the difference between the mean of
the query and reference groups are the same• Essentially measures signal-to-noise• Calculate p-value (permutations or distributions)• May suffer from intensity-dependent effects
t = signal = difference between means = <Xq> – <Xc>_noise variability of groups SE(Xq-Xc)
c
c
q
q
nn
XcXqt
22
Finding Significant Genes (III)
![Page 39: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/39.jpg)
A significantdifference
Probablynot
T-Tests
![Page 40: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/40.jpg)
1. Assign experiments to two groups, e.g., in the expression matrix below, assign Experiments 1, 2 and 5 to group A, and experiments 3, 4 and 6 to group B.
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
2. Question: Is mean expression level of a gene in group A significantly different from mean expression level in group B?
Exp 1 Exp 2 Exp 3 Exp 4Exp 5 Exp 6
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Group A Group B
T-Tests (I)
![Page 41: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/41.jpg)
3. Calculate t-statistic for each gene
4. Calculate probability value of the t-statistic for each gene either from:
A. Theoretical t-distribution
OR
B. Permutation tests.
T-Tests (II)
![Page 42: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/42.jpg)
Permutation testsi) For each gene, compute t-statistic
ii) Randomly shuffle the values of the gene between groups A and B, such that the reshuffled groups A and B respectively have the same number of elements as the original groups A and B.
Exp 1 Exp 2 Exp 3 Exp 4Exp 5 Exp 6
Gene 1
Group A Group B
Original grouping
Exp 1Exp 4Exp 5Exp 2Exp 3 Exp 6
Gene 1
Group A Group B
Randomized grouping
T-Tests (III)
![Page 43: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/43.jpg)
Permutation tests - continued
iii) Compute t-statistic for the randomized gene
iv) Repeat steps i-iii n times (where n is specified by the user).
v) Let x = the number of times the absolute value of the original t-statistic exceeds the absolute values of the randomized t-statistic over n randomizations.
vi) Then, the p-value associated with the gene = 1 – (x/n)
T-Tests (IV)
![Page 44: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/44.jpg)
5. Determine whether a gene’s expression levels are significantly different between the two groups by one of three methods:
A) “Just alpha” ( significance level): If the calculated p-value for a gene is less than or equal to the user-input a (critical p-value), the gene is considered significant.
ORUse Bonferroni corrections to reduce the probability of erroneously
classifying non-significant genes as significant.
B) Standard Bonferroni correction: The user-input alpha is divided by the total number of genes to give a critical p-value that is used as above –> pcritical = /N.
T-Tests (V)
![Page 45: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/45.jpg)
5C) Adjusted Bonferroni:
i) The t-values for all the genes are ranked in descending order.
ii) For the gene with the highest t-value, the critical p-value becomes (/N), where N is the total number of genes; for the gene with the second-highest t-value, the critical p-value will be (/[N-1]), and so on.
T-Tests (VI)
![Page 46: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/46.jpg)
• Significance Analysis of Microarrays (SAM)- Uses a modified t-test by estimating and adding a small
positive constant to the denominator- Significant genes are those which exceed the expected
values from permutation analysis.
Finding Significant Genes (IV)
![Page 47: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/47.jpg)
SAM
• SAM can be used to select significant genes based on differential expression between sets of conditions
• Currently implemented for two-class unpaired design – i.e., we can select genes whose mean expression level is significantly different between two groups of samples (analogous to t-test).
• Stanford University, Rob Tibshiranihttp://www-stat.stanford.edu/~tibs/SAM/index.html
![Page 48: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/48.jpg)
SAM
• SAM gives estimates of the False Discovery Rate (FDR), which is the proportion of genes likely to have been wrongly identified by chance as being significant.
• It is a very interactive algorithm – allows users to dynamically change thresholds for significance (through the tuning parameter delta) after looking at the distribution of the test statistic.
• The ability to dynamically alter the input parameters based on immediate visual feedback, even before completing the analysis, should make the data-mining process more sensitive.
![Page 49: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/49.jpg)
1. Assign experiments to two groups- in the expression matrix below:
Experiments 1, 2 and 5 to group AExperiments 3, 4 and 6 to group B
SAM Two-class
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Exp 1 Exp 2 Exp 3 Exp 4Exp 5 Exp 6
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Group A Group B
2. Question: Is mean expression level of a gene in group A significantly different from mean expression level in group B?
![Page 50: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/50.jpg)
Permutation tests
i) For each gene, compute d-value (analogous to t-statistic). This is the observed d-value for that gene.
ii) Randomly shuffle the values of the gene between groups A and B, such that the reshuffled groups A and B have the same number of elements as the original groups A and B. Compute the d-value for each randomized gene
SAM Two-class
Exp 1 Exp 2 Exp 3 Exp 4Exp 5 Exp 6
Gene 1
Group A Group B
Original grouping
Exp 1Exp 4 Exp 5Exp 2Exp 3 Exp 6
Gene 1
Group A Group B
Randomized grouping
![Page 51: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/51.jpg)
SAM Two-class
• Repeat step (ii) many times, so that each gene has many randomized d-values. Take the average of the randomized d-values for each gene. This is the expected d-value of that gene.
• Plot the observed d-values vs. the expected d-values
![Page 52: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/52.jpg)
SAM Two-classSignificant positive genes
( mean expression of group B >mean expression of group A) in red
Significant negative genes ( mean expression of group A > mean expression of group B) in green
“Observed d = expected d” line
The more a gene deviates from the “observed = expected” line, the more likely it is to be significant. Any gene beyond the first gene in the +ve or –ve direction on the x-axis (including the first gene), whose observed exceeds the expected by at least delta, is considered significant.
Tuning parameter“delta” limits, can be dynamically changed by using the slider bar or entering a value in the text field.
![Page 53: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/53.jpg)
SAM Two-class
• For each permutation of the data, compute the number of positive and negative significant genes for a given delta. The median number of significant genes from these permutations is the median False Discovery Rate.
• The rationale:Any gene designated as significant from the randomized data are being picked up purely by chance (i.e., “falsely” discovered). Therefore, the median number picked up over many randomisations is a good estimate of false discovery rate.
![Page 54: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/54.jpg)
Finding Significant Genes (V)
• Effect vs. Significance• Selections of items that
have both a large effect and are highly significant can be identified easily.
Boring stuff -ve effect +ve effect
High p
Low p
Volcano Plots
High Effect & Significance
![Page 55: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/55.jpg)
Volcano Plots
Using log10 for Y axis
p < 0.1
(1 decimal place)
p < 0.01
(2 decimal places)
Using log2 for X axis
![Page 56: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/56.jpg)
Volcano Plots (II)
Effect has doubled
21 (2 raised to the power of 1)
Two Fold Change
Effect has halved
20.5 (2 raised to the power of 0.5)
Using log10 for Y axis
Using log2 for X axis
![Page 57: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/57.jpg)
• Analysis of Variation (ANOVA)- Which genes are most significant for separating classes of samples?- Calculate p-value (permutations or distributions)- Reduces to a t-test for 2 samples- May suffer from intensity-dependent effects
??????
Finding Significant Genes (VI)
![Page 58: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/58.jpg)
• Goal is to identify genes (or conditions) which have“similar” patterns of expression
• This is a problem in data mining
• “Clustering Algorithms” are most widely used
• All depend on how one measures distance
Multiple Conditions/Experiments
![Page 59: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/59.jpg)
Pattern analysis
Pattern analysis
SupervisedLearning
UnsupervisedLearning
Hierarchical Non-hierarchical
Agglomerative Divisive K-means SOMs
Single linkage
Average linkage
Complete linkage
![Page 60: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/60.jpg)
Similar expression
• Each gene is represented by a vector where coordinates are its values log(ratio) in each experiment
- x = log(ratio)exp1
- y = log(ratio)exp2
- z = log(ratio)exp3
- etc.
Expression Vectors
x
y
z
![Page 61: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/61.jpg)
• Each gene is represented by a vector where coordinates are its values log(ratio) in each experiment
- x = log(ratio)exp1
- y = log(ratio)exp2
- z = log(ratio)exp3
- etc.
• For example, if we do six experiments, - Gene1 = (-1.2, -0.5, 0, 0.25, 0.75, 1.4) - Gene2 = (0.2, -0.5, 1.2, -0.25, -1.0, 1.5) - Gene3 = (1.2, 0.5, 0, -0.25, -0.75, -1.4) - etc.
Expression Vectors
![Page 62: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/62.jpg)
• These gene expression vectors of log(ratio) values can be used to construct an expression matrix
Ex
p 1
Ex
p 2
Ex
p 3
Ex
p 4
Ex
p 5
Ex
p 6
Gene1 -1.2 -0.5 0 0.25 0.75 1.4
Gene2 0.2 -0.5 1.2 -0.25 -1.0 1.5
Gene3 1.2 0.5 0 -0.25 -0.75 -1.4
• This is often represented as a red/green colored matrix
Expression Matrix
![Page 63: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/63.jpg)
Exp
1
Exp
2
Exp
3
Exp
4
Exp
5
Exp
6
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
The Expression Matrix is a representation of data from multiplemicroarray experiments.
Each element is a log ratio, usually log 2 (Cy5/Cy3)
Red indicates a positive log ratio ( Cy5 > Cy3 )
Green indicates anegative log ratio( Cy5 < Cy3 )
Black indicates a logratio of zero( Cy5 ~= Cy3 )
Gray indicates missing data
Expression Matrix
![Page 64: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/64.jpg)
Expression Vectors as points in“Expression Space”
Exp
1
Exp
2
Exp
3
Gene 1Gene 2Gene 3Gene 4Gene 5Gene 6
Similar Expression
Experiment 1
Experiment 2
Experiment 3
x
y
z
![Page 65: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/65.jpg)
• Distances are measured “between” expression vectors
• Distance measures define the way we measure distances
• Many different ways to measure distance:- Euclidean distance- Manhattan distance- Pearson correlation- Spearman correlation- etc.
• Each has different properties and can reveal different features of the data
Distance measures
![Page 66: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/66.jpg)
Euclidean distance
• Measures the 'as-the-crow-flies' distance• Deriving the Euclidean distance between two data points involves computing the square root of the sum of the squares of the differences between corresponding values
( Pythagoras theorem )
n
niii yxD 2)(
x
y
![Page 67: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/67.jpg)
Manhattan distance
• Computes the distance that would be traveled to get from
one data point to the other if a grid-like path is followed• Manhattan distance between two items is the sum of the differences of their corresponding components
n
iii yxD
1 x
y
![Page 68: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/68.jpg)
Pearson and Pearson squared
• Pearson Correlation measures the similarity in shape between two profiles• Pearson Squared distance measures the similarity in shape between two profiles, but can also capture inverse relationships
)/)(*)(( nyZxZD 1
Samples
Exp
ress
ion
Samples
Exp
ress
ion
)/)(*)(( nyZxZD 21
![Page 69: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/69.jpg)
Spearman Rank Correlation
• Spearman Rank Correlation measures the correlation between two sequences of values. • The two sequences are ranked separately and the differences in rank are calculated at each position, i.• Use Spearman Correlation to cluster together genes whose expression profiles have similar shapes or show similar general trends, but whose expression levels may
be very different
)(
))()((
1
61
21
2
nn
yrankxrankD
n
iii
Where Xi and Yi are the ith values of sequences X and Y respectively
![Page 70: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/70.jpg)
• Once a distance metric has been selected, the starting point for all clustering methods is a “distance matrix”
Gen
e 1
Gen
e 2
Gen
e 3
Gen
e 4
Gen
e 5
Gen
e 6
Gene1 0 1.5 1.2 0.25 0.75 1.4 Gene2 1.5 0 1.3 0.55 2.0 1.5 Gene3 1.2 1.3 0 1.3 0.75 0.3Gene4 0.25 0.55 1.3 0 0.25 0.4 Gene5 0.75 2.0 0.75 0.25 0 1.2 Gene6 1.4 1.5 0.3 0.4 1.2 0
• The elements of this matrix are the pair-wise distances. ( matrix is symmetric around the diagonal )
Distance Matrix
![Page 71: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/71.jpg)
Hierarchical Clustering
1. Calculate the distance between all genes. Find the smallest distance. If several pairs share the same similarity, use a predetermined rule to decide between alternatives.
G1G6
G3
G5
G4
G2
2. Fuse the two selected clusters to produce a new cluster that now contains at least two objects. Calculate the distance between the new cluster and all other clusters.
3. Repeat steps 1 and 2 until only a single cluster remains.
G1
G6
G3
G5
G4
G2
4. Draw a tree representing the results.
![Page 72: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/72.jpg)
Hierarchical Clustering
G8G1 G2 G3 G4 G5 G6 G7
G7G1 G8 G2 G3 G4 G5 G6
G1 is most like G8
G7G1 G8 G4 G2 G3 G5 G6
G4 is most like {G1, G8}
![Page 73: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/73.jpg)
G7G1 G8 G4 G2 G3 G5 G6
Hierarchical Clustering
G6G1 G8 G4 G2 G3 G5 G7
G5 is most like G7
G6G1 G8 G4 G5 G7 G2 G3
{G5,G7} is most like {G1, G4, G8}
![Page 74: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/74.jpg)
Hierarchical Tree
G6G1 G8 G4 G5 G7 G2 G3
![Page 75: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/75.jpg)
Agglomerative Linkage Methods
• Linkage methods are rules that determine which elements (clusters) should be linked.
• Three linkage methods that are commonly used: - Single Linkage- Average Linkage- Complete Linkage
![Page 76: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/76.jpg)
Cluster-to-cluster distance is defined as the minimum distance between members of one cluster and members of another cluster. Single linkage tends to create ‘elongated’ clusters with individual genes chained onto clusters.
DAB = min ( d(ui, vj) )
where u A and v Bfor all i = 1 to NA and j = 1 to NB
Single Linkage
DAB
![Page 77: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/77.jpg)
Cluster-to-cluster distance is defined as the average distance between all members of one cluster and all members of another cluster. Average linkage has a slight tendency to produce clusters of similar variance.
DAB = 1/(NANB) S S ( d(ui, vj) )
where u A and v Bfor all i = 1 to NA and j = 1 to NB
Average Linkage
DAB
![Page 78: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/78.jpg)
Cluster-to-cluster distance is defined as the maximum distance between members of one cluster and members of the another cluster. Complete linkage tends to create clusters of similar size and variability.
DAB = max ( d(ui, vj) )
where u A and v Bfor all i = 1 to NA and j = 1 to NB
Complete Linkage
DAB
![Page 79: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/79.jpg)
Comparison of Linkage Methods
Single Average Complete
![Page 80: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/80.jpg)
1. Specify number of clusters, e.g., 5
2. Randomly assign genes to clusters
G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13
K-Means/Medians Clustering
![Page 81: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/81.jpg)
K-Means/Medians Clustering
3. Calculate mean/median expression profile of each cluster
4. Shuffle genes among clusters such that each gene is now in the cluster whose mean expression profile (calculated in
step 3) is the closest to that gene’s expression profileG1 G2G3 G4 G5G6
G7
G8 G9G10
G11
G12
G13
5. Repeat steps 3 and 4 until genes cannot be shuffled around any more, OR a user-specified number of iterations has been reached
K-Means is most useful when the user has an a priori hypothesis about the number of clusters the genes should group into.
![Page 82: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/82.jpg)
MOTIVATION: Using different clustering methods often produces different results. How do these clustering results relate to each other?
Clustering comparison method that finds a many-to-many correspondence in two different clustering results.
• comparison of two flat clusterings
• comparison of a flat and a hierarchical clustering.
Clustering Comparison
![Page 83: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/83.jpg)
B1
B2
B3
B4
C2 = {B1, B2, B3, B4 }
A1
A2
A3
C1 = {A1, A2, A3 , A 4}
21: CCg
443
32
211
BAA
BA
BBA
We are interested in finding:
where the clusters are mapped as follows:
A4
Comparison of flat clusterings
![Page 84: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/84.jpg)
• Intersection size:
• Simpson´s index:
• Jaccard index:
)( jiij BAcardI
)}(),(min{
)(
ji
jiij BcardAcard
BAcards
)(
)(
ji
jiij BAcard
BAcardJ
Indices to measure the overlapping
![Page 85: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/85.jpg)
Selecting a point to cut the dendogram leads to s disjoint groups.
0
1
Comparison of flat and hierarchical clusterings
![Page 86: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/86.jpg)
ARTIFICIAL DATA: Four data sets with four clusters, constructed with the same four seeds and different levels of noise.
• 1000 genes, 10 conditions
• d = 20 initial partitions
Results
![Page 87: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/87.jpg)
![Page 88: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/88.jpg)
![Page 89: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/89.jpg)
Visualisation in Expression Profiler
![Page 90: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/90.jpg)
Missing Value Estimation Differentially Expressed Genes Clustering Algorithms Principal Components Analysis
Outline
![Page 91: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/91.jpg)
PCA (Dimensionality Reduction Methods)
![Page 92: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/92.jpg)
Dimensionality Problem Techniques Methods
Multidimensional Scaling Eigenanalysis-based ordination methods
Principal Component Analysis (PCA) Correspondence Analysis (CA)
Outline
![Page 93: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/93.jpg)
Problem? “Curse of dimensionality” Convergence of any estimator to the true value of a
smooth function on a space of high dimension is very slow
In other words, need many observations to obtain a good “estimate” of gene function
“Blessing?” – very few things really matter
Solutions Statistical techniques (corrections, etc.) Reduce dimensionality
Ignore non-variable genes Feature subset selection Eliminate coordinates that are less relevant
Dimensionality problem
![Page 94: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/94.jpg)
Idea: place data in a low-dimensional space so that “similar” objects are close to each other.
Multidimensional Scaling
The Algorithm (roughly)
1. Assign points to arbitrary coordinates in p-dimensional space.
2. Compute all-against-all distances, to form a matrix D’.
3. Compare D’ with the input matrix D by evaluating the stress function. The smaller the value, the greater the correspondence between the two.
4. Adjust coordinates of each point in the direction that best maximizes stress.
5. Repeat steps 2 through 4 until stress won't get any lower. However:• Computationally intensive• Axes are meaningless, orientation of the MDS map is
arbitrary• Difficult to interpret
![Page 95: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/95.jpg)
Eigenanalysis: BackgroundBasic Concepts
An eigenvalue and eigenvector of a square matrix A are a scalar λ and a nonzero vector x so that
Ax = λxQ: What is a matrix? A: A linear transformation.
Q: What are eigenvectors?A: Directions in which the transformation “takes place the most”
Exploratory example: EigenExplorer
![Page 96: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/96.jpg)
Eigenanalysis: BackgroundFinding eigenvalues Ax = λx
(A – λI)x = 0
Interpreting eigenvalues• Eigenvalues of a matrix provide a solid rotation in the directions of highest variance• Can pick N largest eigenvalues, capture a large proportion of the variance and represent every value in the original matrix as a linear combination of these values, e.g., xi = a1λ1+ . . . + aNλN
• Call this collection {aj} the eigengene/eigenarray (depending on which way we compute these)
![Page 97: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/97.jpg)
PCA
1. PCA simplifies the “views” of the data.
2. Suppose we have measurements for each gene on multiple experiments.
3. Suppose some of the experiments are correlated.
4. PCA will ignore the redundant experiments, and will take a weighted average of some of the experiments, thus possibly making the trends in the data more interpretable.
5. The components can be thought of as axes in n-dimensional space, where n is the number of components. Each axis represents a different trend in the data.
![Page 98: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/98.jpg)
PCA
“Cloud” of data points (e.g., genes) in N-dimensional space, N = # hybridizations
Data points resolved along 3 principalcomponent axes.
In this example:x-axis could mean a continuum from over-to under-expression
y-axis could mean that “blue” genes are over-expressed in first five expts and under expressed in the remaining expts, while “brown” genes are under-expressed in the first five expts, and over-expressed in the remaining expts.
z-axis might represent different cyclic patterns, e.g., “red” genes might be over-expressed in odd-numbered expts and under-expressed in even-numbered ones, whereas the opposite is true for “purple” genes.
Interpretation of components is somewhat subjective.
y
x
z
![Page 99: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/99.jpg)
xx
zz
yy
Principal Componentspick out the directionsin the data that capturethe greatest variability
![Page 100: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/100.jpg)
xx
zz
yyz’
y’
x’
The “new” axes are linearcombinations of the oldaxes – typically combinationsof genes or experiments.
=a1x+b1y+c1z
=a2x+b2y+c2z
=a3x+b3y+c3z
![Page 101: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/101.jpg)
Projecting the data into alower dimensional spacecan help visualize relationships
yy’’
xx’’
![Page 102: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/102.jpg)
yy’’
xx’’
Projecting the data into alower dimensional spacecan help visualize relationships
![Page 103: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/103.jpg)
PCA in Expression Profiler
![Page 104: Microarrays: Common Analysis Approaches](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814a96550346895db7a375/html5/thumbnails/104.jpg)
Further Reading• MDS
– http://www.analytictech.com/borgatti/mds.htm
• PCA, SVD– http://www.statsoftinc.com/textbook/stfacan.html– http://linneus20.ethz.ch:8080/2_2_1.html– Alter et al., Singular value decomposition for genome-wide
expression data processing and modelling, PNAS, 2000
• COA– Fellenberg et al., “Correspondence analysis applied to
microarray data”, PNAS, 2001
• General ordination– http://www.okstate.edu/artsci/botany/ordinate/– Legendre P. and Legendre L., Numerical Ecology, 1998