Advanced Methods of Data Analysis

Advanced Methods of Data Analysis

• 9:00 - 10:00 CTWC

• 10:00 - 11:00 CTWC exercise

• 11:00 – 11:30 Break

• 11:30 - 12:00 SPIN

• 12:00 - 13:00 SPIN exercise

Course on Microarray Data Acquisition and AnalysisWeizmann Institute of Science16 May 2007

Presented by Tal Shay & Yuval TabachWeizmann Institute of ScienceRehovot, Israel

Coupled Two-Way Clustering CTWC

Gad Getz, Erel Levine, and Eytan Domany Coupled two-way clustering analysis of gene microarray data PNAS 97: 12079-12084

Course on Microarray Data Acquisition and AnalysisWeizmann Institute of Science16 May 2007

Presented by Tal Shay & Yuval TabachWeizmann Institute of ScienceRehovot, Israel

Talk Aim

Guide how to use the CTWC server to properly analyze micro-array data.

Motivation

• Micro-array experiments generate millions of numbers containing a lot of biological information.

• The problem: Very complicated data contain large amount of noise. How to unravel the biological information which is masked by a mess of irrelevant information.

• CTWC is a simple heuristic clustering procedure that was developed especially to cope with micro-array data.

Talk Outline

• Preprocessing and filtering

• Clustering of Genes and Conditions

• Super-Paramagnetic Clustering (SPC)

• Coupled Two-Way Clustering (CTWC)

• CTWC server

• Exercise

Gene Expression Matrix – CTWC format

DB_NAME Name Sample1 Sample2 Sample3

Acc1 Gene1 E11 E12 E13



The DB_NAME is used to link genes to a database

Visualization of Expression Matrix

• Column = chip (=sample)• Row = probeset• Color = expression level

gene

s

samples

Preprocessing

Initial Expression Matrix

gene

s

samples

1. Select variable genes2. Standardize

Preprocessing

1000 probesets with highest standard deviation

gene

s

samples

1. Select variable genes

2. Standardize

Preprocessing

gene

s

samples

1. Select variable genes2. Standardize

1000 probesets with highest standard deviation, standardized

Talk Outline





• CTWC server

• Exercise

What questions can we ask?

• Which genes are expressed differently in two known types of samples?

• What is the minimal set of genes needed to distinguish one type of samples from the others?

• Which genes behave similarly in the experiments?• How many different types of samples are there?

Supervised MethodsHypothesis Testing(use predefined labels)

Supervised MethodsHypothesis Testing(use predefined labels)

Unsupervised MethodsExploratory Analysis(use only the data)

Unsupervised MethodsExploratory Analysis(use only the data)

All genes

Filtering

Clustering

samples

gen

es

Clustering – unsupervised analysis

Low variation genes

1

2

3

High variation genes

3 clusters, each contains highly

correlated genes

• Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and might be co-regulated.Learn on the biology, infer function

• Goal B: Divide conditions to groups with similar gene expression profiles. Examples: Find sub-types of a disease, group or drugs according to their effect

Unsupervised Analysis

Clustering Methods

Giraffe

DEFINITION OF THE CLUSTERING PROBLEM

CLUSTER ANALYSIS YIELDS DENDROGRAM

T (RESOLUTION)

How many clusters we have ?The answer depends on the resolution

Giraffe + Okapi

BUT WHAT ABOUT THE OKAPI ?

Clustering problem definition

• Input: N data points, Xi, i=1,2,…,N in a D dimensional space.

• Goal: Find “natural” groups (clusters) of points. Points that belong to the same cluster – are “more similar”

Clustering is not well defined

• Similarity: which points should be considered close?

• Clustering method:– Resolution: specify/hierarchical results– Shape of clusters: general, spherical.

Agglomerative Hierarchical Clustering

• Results depend on distance update method– Single Linkage: elongated clusters– Average Linkage: sphere-like clusters

• Greedy iterative process

• NOT robust against noise

• Not always finds the “natural” clusters.

Stop … think

• We want to identify the real (“natural”) clusters.

• We should have a reliability parameter that will help us to distinguish between significant and non-significant clusters.

Talk Outline





• CTWC server

• Exercise

Super-Paramagnetic Clustering (SPC) M.Blatt, S.Weisman and E.Domany (1996) Neural Computation

• The idea behind SPC is based on the physical properties of dilute magnets.

• Calculating correlation between magnet orientations at different temperatures (T).

T=LowSmall elements,

Spins



T=High





T=Intermediate

T=High

Phases of the Inhomogeneous Potts Ferromagnet

T=Low

T=Intermediate

Ferro

Para

Super-Para

Super-Paramagnetic Clustering (SPC)

T=LowT=High

T=LowT=Intermediate

• The algorithm simulates the magnets behavior at a range of temperatures and decides which interactions to break.

• The temperature (T) controls the resolution

Super-Paramagnetic Clustering (SPC)

Example: N=4800 points in D=2

Identify the stable clusters

T=16

Same data - Average Linkage

Advantages of SPC

• Scans all resolutions (T)

• Robust against noise and initialization -calculates collective correlations.

• Identifies “natural” and stable clusters (T)

• No need to pre-specify number of clusters

• Clusters can be any shape

Inside SPC: dendrogam and stable clusters

T

10

2224

2628

Min Cluster Size: 3Stable Delta T: 14Ignore dropout: 1

Genes Samples

CTWC server - Setting the SPC parameters

Talk Outline





• CTWC server

• Exercise

Back to gene expression data

• 2 Goals: Cluster Genes and Conditions

• 2 independent clustering:– Genes represented as vectors of expression in

all conditions– Conditions are represented as vectors of

expression of all genes

-0.4

-0.2

0

0.2

0.4

0.6

0.8

Experiments

Ge

ne

s

Colon cancer data (normalized genes)

10 20 30 40 50 60

200

400

600

800

1000

1200

1400

1600

1800

2000

1. Identify tissue classes (tumor/normal)

First clustering - Experiments

-0.4

-0.2

0 0.2

0.4

0.6

0.8

Experiments

Genes

Colon cancer data (norm

alized genes)

1020

3040

5060

200

400

600

800

1000

1200

1400

1600

1800

2000D = 2000

2. Find Differentiating And Correlated Genes

Second Clustering - Genes

-0.4

-0.2

0

0.2

0.4

0.6

0.8

Experiments

Gen

es


10 20 30 40 50 60

200

400

600

800

1000

1200

1400

1600

1800

2000

D = 62

gene

s

samples

Two-way clustering

S1(G1)

G1(S1)

TWO-WAYCLUSTERING:

-0.4

-0.2

0

0.2

0.4

0.6

0.8

Experiments

Ge

ne

s


10 20 30 40 50 60

200

400

600

800

1000

1200

1400

1600

1800

2000

TWO-WAYCLUSTERING:

Two way clustering-ordered

S1(G1)

G1(S1)

Song A

Song B

Coupled Two-Way Clustering (CTWC)G. Getz, E. Levine and E. Domany (2000) PNAS

•Philosophy: Only a small subset of genes play a role in

a particular biological process; the other genes

introduce noise, which may mask the signal of the

important players. Only a subset of the samples exhibit

the expression patterns of interest.•New Goal: Use subsets of genes to study subsets of samples (and vice versa) •A non-trivial task – exponential number of subsets.•CTWC is a heuristic to solve this problem.

Inside CTWC: IterationsDepth Genes Samples

Init G1 S1

1 G1(S1) G2,G3,…G5 S1(G1) S2,S3

2 G1(S2)

G1(S3)

G6,G7,….G13

G14,…G21

S1(G2)

…

S1(G5)

S4,S5,S6

S10,S11

None

3 G2(S1)…G2(S3)

…

G5(S1)…G5(S3)

G22…

…

…G97

S2(G1)…S2(G5)

S3(G1)…S3(G5)

S12,…

…S51

4 G1(S4)

…

G1(S11)

G98,..G105

…

G151,..G160

S1(G6)

…

S1(G21)

S52,...

S67

5 G2(S4)...G2(S11)

…

G5(S4)...G5(S11)

G161…

…

…G216

S2(G6)...S2(G21)

S3(G6)…S3(G21)

S68…

…S113

Two-way clustering

E-mail notification

CTWC server - Setting the coupled two-way clustering parameters

A

B

A

B

10 20 30 40 50 60

200

400

600

800

1000

1200

1400

1600

1800

2000

COUPLED TWO-WAY CLUSTERING OF COLON CANCER: TISSUES

G4

G12

S1(G4)

S1(G12)

A

B

A

B

10 20 30 40 50 60

200

400

600

800

1000

1200

1400

1600

1800

2000

0 10 20 30 40 50 60

0

10

20

30

40

50

60

COUPLED TWO-WAY CLUSTERING OF COLON CANCER: TISSUES

CTWC colon cancer - tissues

S1(G4)

S1(G12)

0 10 20 30 40 50 60

0

10

20

30

40

50

60

S17

What kind of results do you wish to find ?

type A /type B distance matrix

Talk Outline





• CTWC server

• Exercise

CTWC software

• Web interface– ctwc.weizmann.ac.il – ctwc.bioz.unibas.ch

• Standalone– Write to [email protected]

CTWC standalone

Sample Labels

• Given as a binary file

• For a cluster Gx, label L with values L1 and L2:

• Purity(C1, L1) – how much of C1 is composed of L1?

• Efficiency(C1 , L1) – how much of L1 is contained in of C1?

#L1 in C

|L1|

#L1 in C

|C1|

Biological Work

• Literature search for information on interesting genes.• Annotation analysis: classify the genes according to their

function.• Find whether there is a common function or biological

meaning for clusters of interest.• Find what is in common with sets of

experiments/conditions.• Genomics analysis: search for common regulatory signal

upstream of the genes

• Design next experiment – get more data to validate result.

Remember : most of your work is starting here - understanding the biology behind your results

Summary

• Clustering methods are used to– find genes from the same biological process

– group the experiments to similar conditions

• Focusing on subsets of the genes and conditions can unravel structure that is masked when using all genes and conditions

ctwc.weizmann.ac.il

or

[email protected]

Exercise - Course Experiment

NT 48hr 72hr 96hr

D8 D8_NT_s_1bD8_NT_c_1aD8_NT_c_2

D8_48h_s_1bD8_48h_c_1aD8_48h_c_2

D8_72h_s_1bD8_72h_c_1a

D8_96h_s_1bD8_96h_c_1aD8_96h_c_2

D11 D11_NT_s_2D11_NT_c_1aD11_NT_c_1b

D11_48h_c_1aD11_48h_c_1b

D11_72h_s_2D11_72h_c_1aD11_72h_c_1b

D11_96h_c_1aD11_96h_c_1b

On time 0 a treatment is given.

For D8, treatment suppresses mutp53.

For D11, treatment does not.

The Data

Save and backup the CEL files!

R Code – From CEL to ECXEL

> library(affy)

> A = ReadAffy()

> rma_data = rma(A)

> write.exprs(rma_data, file='rma_expression.txt')

> mas5_data = mas5(A)

> write.exprs(mas5_data, file = 'mas5_expression')

> mas5_calls = mas5calls(A)

> write.exprs(mas5_calls, file = 'mas5_detection')

The EXCEL

Filter the genes – do not cluster all probesets on the chip!

Edit the EXCEL for CTWC

Title #1: U133_AFFX

Title #2:NAME

Column #2:Probeset info Make the chip names clear!

Samples distance matrix

Advanced Methods of Data Analysis

Documents

Transcript of Advanced Methods of Data Analysis