Advanced Methods of Data Analysis
description
Transcript of Advanced Methods of Data Analysis
![Page 1: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/1.jpg)
Advanced Methods of Data Analysis
• 9:00 - 10:00 CTWC
• 10:00 - 11:00 CTWC exercise
• 11:00 – 11:30 Break
• 11:30 - 12:00 SPIN
• 12:00 - 13:00 SPIN exercise
Course on Microarray Data Acquisition and AnalysisWeizmann Institute of Science16 May 2007
Presented by Tal Shay & Yuval TabachWeizmann Institute of ScienceRehovot, Israel
![Page 2: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/2.jpg)
Coupled Two-Way Clustering CTWC
Gad Getz, Erel Levine, and Eytan Domany Coupled two-way clustering analysis of gene microarray data PNAS 97: 12079-12084
Course on Microarray Data Acquisition and AnalysisWeizmann Institute of Science16 May 2007
Presented by Tal Shay & Yuval TabachWeizmann Institute of ScienceRehovot, Israel
![Page 3: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/3.jpg)
Talk Aim
Guide how to use the CTWC server to properly analyze micro-array data.
![Page 4: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/4.jpg)
Motivation
• Micro-array experiments generate millions of numbers containing a lot of biological information.
• The problem: Very complicated data contain large amount of noise. How to unravel the biological information which is masked by a mess of irrelevant information.
• CTWC is a simple heuristic clustering procedure that was developed especially to cope with micro-array data.
![Page 5: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/5.jpg)
Talk Outline
• Preprocessing and filtering
• Clustering of Genes and Conditions
• Super-Paramagnetic Clustering (SPC)
• Coupled Two-Way Clustering (CTWC)
• CTWC server
• Exercise
![Page 6: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/6.jpg)
Gene Expression Matrix – CTWC format
DB_NAME Name Sample1 Sample2 Sample3
Acc1 Gene1 E11 E12 E13
Acc2 Gene2 E21 E22 E23
Acc3 Gene3 E31 E32 E33
The DB_NAME is used to link genes to a database
![Page 7: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/7.jpg)
Visualization of Expression Matrix
• Column = chip (=sample)• Row = probeset• Color = expression level
gene
s
samples
![Page 8: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/8.jpg)
Preprocessing
Initial Expression Matrix
gene
s
samples
1. Select variable genes2. Standardize
![Page 9: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/9.jpg)
Preprocessing
1000 probesets with highest standard deviation
gene
s
samples
1. Select variable genes
2. Standardize
![Page 10: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/10.jpg)
Preprocessing
gene
s
samples
1. Select variable genes2. Standardize
1000 probesets with highest standard deviation, standardized
![Page 11: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/11.jpg)
Talk Outline
• Preprocessing and filtering
• Clustering of Genes and Conditions
• Super-Paramagnetic Clustering (SPC)
• Coupled Two-Way Clustering (CTWC)
• CTWC server
• Exercise
![Page 12: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/12.jpg)
What questions can we ask?
• Which genes are expressed differently in two known types of samples?
• What is the minimal set of genes needed to distinguish one type of samples from the others?
• Which genes behave similarly in the experiments?• How many different types of samples are there?
Supervised MethodsHypothesis Testing(use predefined labels)
Supervised MethodsHypothesis Testing(use predefined labels)
Unsupervised MethodsExploratory Analysis(use only the data)
Unsupervised MethodsExploratory Analysis(use only the data)
![Page 13: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/13.jpg)
All genes
Filtering
Clustering
samples
gen
es
Clustering – unsupervised analysis
Low variation genes
1
2
3
High variation genes
3 clusters, each contains highly
correlated genes
![Page 14: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/14.jpg)
• Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and might be co-regulated.Learn on the biology, infer function
• Goal B: Divide conditions to groups with similar gene expression profiles. Examples: Find sub-types of a disease, group or drugs according to their effect
Unsupervised Analysis
Clustering Methods
![Page 15: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/15.jpg)
Giraffe
DEFINITION OF THE CLUSTERING PROBLEM
![Page 16: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/16.jpg)
CLUSTER ANALYSIS YIELDS DENDROGRAM
T (RESOLUTION)
How many clusters we have ?The answer depends on the resolution
![Page 17: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/17.jpg)
Giraffe + Okapi
BUT WHAT ABOUT THE OKAPI ?
![Page 18: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/18.jpg)
Clustering problem definition
• Input: N data points, Xi, i=1,2,…,N in a D dimensional space.
• Goal: Find “natural” groups (clusters) of points. Points that belong to the same cluster – are “more similar”
![Page 19: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/19.jpg)
Clustering is not well defined
• Similarity: which points should be considered close?
• Clustering method:– Resolution: specify/hierarchical results– Shape of clusters: general, spherical.
![Page 20: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/20.jpg)
Agglomerative Hierarchical Clustering
• Results depend on distance update method– Single Linkage: elongated clusters– Average Linkage: sphere-like clusters
• Greedy iterative process
• NOT robust against noise
• Not always finds the “natural” clusters.
![Page 21: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/21.jpg)
Stop … think
• We want to identify the real (“natural”) clusters.
• We should have a reliability parameter that will help us to distinguish between significant and non-significant clusters.
![Page 22: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/22.jpg)
Talk Outline
• Preprocessing and filtering
• Clustering of Genes and Conditions
• Super-Paramagnetic Clustering (SPC)
• Coupled Two-Way Clustering (CTWC)
• CTWC server
• Exercise
![Page 23: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/23.jpg)
Super-Paramagnetic Clustering (SPC) M.Blatt, S.Weisman and E.Domany (1996) Neural Computation
• The idea behind SPC is based on the physical properties of dilute magnets.
• Calculating correlation between magnet orientations at different temperatures (T).
T=LowSmall elements,
Spins
![Page 24: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/24.jpg)
• The idea behind SPC is based on the physical properties of dilute magnets.
• Calculating correlation between magnet orientations at different temperatures (T).
T=High
Super-Paramagnetic Clustering (SPC) M.Blatt, S.Weisman and E.Domany (1996) Neural Computation
![Page 25: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/25.jpg)
Super-Paramagnetic Clustering (SPC) M.Blatt, S.Weisman and E.Domany (1996) Neural Computation
• The idea behind SPC is based on the physical properties of dilute magnets.
• Calculating correlation between magnet orientations at different temperatures (T).
T=Intermediate
![Page 26: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/26.jpg)
T=High
Phases of the Inhomogeneous Potts Ferromagnet
T=Low
T=Intermediate
Ferro
Para
Super-Para
![Page 27: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/27.jpg)
Super-Paramagnetic Clustering (SPC)
T=LowT=High
T=LowT=Intermediate
![Page 28: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/28.jpg)
• The algorithm simulates the magnets behavior at a range of temperatures and decides which interactions to break.
• The temperature (T) controls the resolution
Super-Paramagnetic Clustering (SPC)
Example: N=4800 points in D=2
![Page 29: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/29.jpg)
Identify the stable clusters
T=16
![Page 30: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/30.jpg)
Same data - Average Linkage
![Page 31: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/31.jpg)
Advantages of SPC
• Scans all resolutions (T)
• Robust against noise and initialization -calculates collective correlations.
• Identifies “natural” and stable clusters (T)
• No need to pre-specify number of clusters
• Clusters can be any shape
![Page 32: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/32.jpg)
Inside SPC: dendrogam and stable clusters
T
10
2224
2628
Min Cluster Size: 3Stable Delta T: 14Ignore dropout: 1
![Page 33: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/33.jpg)
Genes Samples
CTWC server - Setting the SPC parameters
![Page 34: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/34.jpg)
Talk Outline
• Preprocessing and filtering
• Clustering of Genes and Conditions
• Super-Paramagnetic Clustering (SPC)
• Coupled Two-Way Clustering (CTWC)
• CTWC server
• Exercise
![Page 35: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/35.jpg)
Back to gene expression data
• 2 Goals: Cluster Genes and Conditions
• 2 independent clustering:– Genes represented as vectors of expression in
all conditions– Conditions are represented as vectors of
expression of all genes
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Experiments
Ge
ne
s
Colon cancer data (normalized genes)
10 20 30 40 50 60
200
400
600
800
1000
1200
1400
1600
1800
2000
![Page 36: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/36.jpg)
1. Identify tissue classes (tumor/normal)
First clustering - Experiments
-0.4
-0.2
0 0.2
0.4
0.6
0.8
Experiments
Genes
Colon cancer data (norm
alized genes)
1020
3040
5060
200
400
600
800
1000
1200
1400
1600
1800
2000D = 2000
![Page 37: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/37.jpg)
2. Find Differentiating And Correlated Genes
Second Clustering - Genes
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Experiments
Gen
es
Colon cancer data (normalized genes)
10 20 30 40 50 60
200
400
600
800
1000
1200
1400
1600
1800
2000
D = 62
gene
s
samples
![Page 38: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/38.jpg)
Two-way clustering
S1(G1)
G1(S1)
TWO-WAYCLUSTERING:
![Page 39: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/39.jpg)
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Experiments
Ge
ne
s
Colon cancer data (normalized genes)
10 20 30 40 50 60
200
400
600
800
1000
1200
1400
1600
1800
2000
TWO-WAYCLUSTERING:
Two way clustering-ordered
S1(G1)
G1(S1)
![Page 40: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/40.jpg)
Song A
Song B
![Page 41: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/41.jpg)
Coupled Two-Way Clustering (CTWC)G. Getz, E. Levine and E. Domany (2000) PNAS
•Philosophy: Only a small subset of genes play a role in
a particular biological process; the other genes
introduce noise, which may mask the signal of the
important players. Only a subset of the samples exhibit
the expression patterns of interest.•New Goal: Use subsets of genes to study subsets of samples (and vice versa) •A non-trivial task – exponential number of subsets.•CTWC is a heuristic to solve this problem.
![Page 42: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/42.jpg)
Inside CTWC: IterationsDepth Genes Samples
Init G1 S1
1 G1(S1) G2,G3,…G5 S1(G1) S2,S3
2 G1(S2)
G1(S3)
G6,G7,….G13
G14,…G21
S1(G2)
…
S1(G5)
S4,S5,S6
S10,S11
None
3 G2(S1)…G2(S3)
…
G5(S1)…G5(S3)
G22…
…
…G97
S2(G1)…S2(G5)
S3(G1)…S3(G5)
S12,…
…S51
4 G1(S4)
…
G1(S11)
G98,..G105
…
G151,..G160
S1(G6)
…
S1(G21)
S52,...
S67
5 G2(S4)...G2(S11)
…
G5(S4)...G5(S11)
G161…
…
…G216
S2(G6)...S2(G21)
S3(G6)…S3(G21)
S68…
…S113
Two-way clustering
![Page 43: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/43.jpg)
E-mail notification
CTWC server - Setting the coupled two-way clustering parameters
![Page 44: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/44.jpg)
A
B
A
B
10 20 30 40 50 60
200
400
600
800
1000
1200
1400
1600
1800
2000
COUPLED TWO-WAY CLUSTERING OF COLON CANCER: TISSUES
G4
G12
S1(G4)
S1(G12)
![Page 45: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/45.jpg)
A
B
A
B
10 20 30 40 50 60
200
400
600
800
1000
1200
1400
1600
1800
2000
0 10 20 30 40 50 60
0
10
20
30
40
50
60
COUPLED TWO-WAY CLUSTERING OF COLON CANCER: TISSUES
CTWC colon cancer - tissues
S1(G4)
S1(G12)
0 10 20 30 40 50 60
0
10
20
30
40
50
60
S17
![Page 46: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/46.jpg)
What kind of results do you wish to find ?
type A /type B distance matrix
![Page 47: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/47.jpg)
Talk Outline
• Preprocessing and filtering
• Clustering of Genes and Conditions
• Super-Paramagnetic Clustering (SPC)
• Coupled Two-Way Clustering (CTWC)
• CTWC server
• Exercise
![Page 48: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/48.jpg)
CTWC software
• Web interface– ctwc.weizmann.ac.il – ctwc.bioz.unibas.ch
• Standalone– Write to [email protected]
![Page 49: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/49.jpg)
CTWC standalone
![Page 50: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/50.jpg)
Sample Labels
• Given as a binary file
• For a cluster Gx, label L with values L1 and L2:
• Purity(C1, L1) – how much of C1 is composed of L1?
• Efficiency(C1 , L1) – how much of L1 is contained in of C1?
#L1 in C
|L1|
#L1 in C
|C1|
![Page 51: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/51.jpg)
Biological Work
• Literature search for information on interesting genes.• Annotation analysis: classify the genes according to their
function.• Find whether there is a common function or biological
meaning for clusters of interest.• Find what is in common with sets of
experiments/conditions.• Genomics analysis: search for common regulatory signal
upstream of the genes
• Design next experiment – get more data to validate result.
Remember : most of your work is starting here - understanding the biology behind your results
![Page 52: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/52.jpg)
Summary
• Clustering methods are used to– find genes from the same biological process
– group the experiments to similar conditions
• Focusing on subsets of the genes and conditions can unravel structure that is masked when using all genes and conditions
ctwc.weizmann.ac.il
or
![Page 53: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/53.jpg)
Exercise - Course Experiment
NT 48hr 72hr 96hr
D8 D8_NT_s_1bD8_NT_c_1aD8_NT_c_2
D8_48h_s_1bD8_48h_c_1aD8_48h_c_2
D8_72h_s_1bD8_72h_c_1a
D8_96h_s_1bD8_96h_c_1aD8_96h_c_2
D11 D11_NT_s_2D11_NT_c_1aD11_NT_c_1b
D11_48h_c_1aD11_48h_c_1b
D11_72h_s_2D11_72h_c_1aD11_72h_c_1b
D11_96h_c_1aD11_96h_c_1b
On time 0 a treatment is given.
For D8, treatment suppresses mutp53.
For D11, treatment does not.
![Page 54: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/54.jpg)
The Data
Save and backup the CEL files!
![Page 55: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/55.jpg)
R Code – From CEL to ECXEL
> library(affy)
> A = ReadAffy()
> rma_data = rma(A)
> write.exprs(rma_data, file='rma_expression.txt')
> mas5_data = mas5(A)
> write.exprs(mas5_data, file = 'mas5_expression')
> mas5_calls = mas5calls(A)
> write.exprs(mas5_calls, file = 'mas5_detection')
![Page 56: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/56.jpg)
The EXCEL
Filter the genes – do not cluster all probesets on the chip!
![Page 57: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/57.jpg)
Edit the EXCEL for CTWC
Title #1: U133_AFFX
Title #2:NAME
Column #2:Probeset info Make the chip names clear!
![Page 58: Advanced Methods of Data Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062305/56814e41550346895dbbaf3d/html5/thumbnails/58.jpg)
Samples distance matrix