Gibbs biclustering of microarray data Yves Moreau & Qizheng Sheng Katholieke Universiteit Leuven...

44
Gibbs biclustering of microarray data Yves Moreau & Qizheng Sheng Katholieke Universiteit Leuven ESAT-SCD (SISTA) on leave at Center for Biological Sequence analysis, Danish Technical University
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    1

Transcript of Gibbs biclustering of microarray data Yves Moreau & Qizheng Sheng Katholieke Universiteit Leuven...

Gibbs biclustering of microarray data

Yves Moreau& Qizheng Sheng

Katholieke Universiteit LeuvenESAT-SCD (SISTA)

on leave at Center for Biological Sequence analysis,

Danish Technical University

April 18, 2023 CBS Microarray Course 2

Clustering

Form coherent groups of Genes Patient samples (e.g., tumors) Drug or toxin response

Study these groups to get insight into biological processes Diagnostic and prognostic classes Genes in same clusters can have

same function or same regulation Clustering algorithms

Hierarchical clustering K-means Self-Organizing Maps ...

April 18, 2023 CBS Microarray Course 3

What’s wrong with clustering?

Clustering is a long-solved problem ?!?

Many problems with current clustering algorithms PCA does not do any form of grouping Hierarchical clustering does not produce distinct groups

Only a tree; it is then up to the user to pick nodes from the tree

K-means does not tell you how many clusters really are present in the data

...

April 18, 2023 CBS Microarray Course 4

A wish list for clustering We expect a lot from a clustering algorithm

Fast and not memory hungry Can run easily on a large microarray data set

10-100.000 genes, >100 experiments Partitioning of genes into distinct groups and automatically

determine the “right” number of groups Robust

If you remove some genes and some experiments, you want to obtain roughly the same groups

Rejection of outliers (genes that do not clearly belong to any group)

Probabilistic cluster membership One gene can belong to several clusters

Incorporation of biological knowledge into account Maybe you want some known genes to cluster together Meaning of the clusters?

Heterogeneous microarray data sources

April 18, 2023 CBS Microarray Course 5

Biclustering microarray data

April 18, 2023 CBS Microarray Course 6

Microarray cost per expression measurement Budgets and expertise

Publicly available microarray data Need for exchange standards & repositories

Big consortia set up big microarray projects Genome projects “transcriptome” projects (=

compendia)

Change in microarray projects ( sequence analysis) Analyze public data first to generate an hypothesis Design and perform your own microarray experiment

From genome projects to transcriptome projects

April 18, 2023 CBS Microarray Course 7

Data becomes more heterogeneous Gene clustering

Group genes that behave similarly over all conditions

Gene biclustering Group genes that behave similarly

over a subset of conditions “Feature selection” More suitable

for heterogeneous compendium

Why biclustering?

April 18, 2023 CBS Microarray Course 8

Probabilistic graphical models

Biostatistics

Bayesian statsClusteringDecision support

Genetics

Linkage analysisPhylogeny

Sequence analysis

Modeling protein familiesGene predictionRegulatory sequence analysis

Expression analysis

ClusteringGenetic network inference

Graphicalmodels

April 18, 2023 CBS Microarray Course 9

Distribution of expression values for a given gene

HighMediumLow

Bicluster Discretized microarray

data set

Discretizing microarray data Microarray data is

continuous Discretize by equal

frequency

gen

es

conditions

April 18, 2023 CBS Microarray Course 10

Bicluster

April 18, 2023 CBS Microarray Course 11

Likelihood

0

1

Background Pattern

April 18, 2023 CBS Microarray Course 12

Likelihood0

1

.9.9.9.9.9

.9.05.9.9.9

.9.9.9.9.9.05.9.9.9.9

.9.9.9.9.05

( | , , )P D g c

April 18, 2023 CBS Microarray Course 13

Likelihood0

1

.9.05.05.05.9

.05.9.9.05.05

.05.05.05.05.05

.05.05.9.9.05

( | ', , )

( | , , )

P D g c

P D g c

Get the right genes

April 18, 2023 CBS Microarray Course 14

Likelihood0

1

.9.9.05.05.9

.9.05.05.9.9

.9.9 .05 .05.9.05.9.05 .05.9

.9.9 .05 .05.05

( | , ', )

( | , , )

P D g c

P D g c

Get the right conditions

April 18, 2023 CBS Microarray Course 15

Likelihood0

1

.6.6.2.2.6

.6.2.2.2.6

.6.6.2.2.6.2.6.2.2.6

.2.6.2.2.2

( | , , ')

( | , , )

P D g c

P D g c

Get the right frequency pattern

April 18, 2023 CBS Microarray Course 16

Optimizing the bicluster Find the right bicluster

Genes Conditions Pattern

For a given choice of genes and conditions, the “best” pattern is given by the frequencies found in the extracted pattern No more need to optimize over the pattern

Maximum likelihood: find genes and conditions that maximize

Gibbs sampling: find genes and conditions that optimize

( | , )P D g c

( , | )P g c D

April 18, 2023 CBS Microarray Course 17

Gibbs sampling

April 18, 2023 CBS Microarray Course 18

Markov Chain Monte-Carlo Markov chain with transition matrix T

)|( 1 iXjXPT ttij

A C G TA 0.0643 0.8268 0.0659 0.0430

C 0.0598 0.0484 0.8515 0.0403

G 0.1602 0.3407 0.1736 0.3255

T 0.1507 0.1608 0.3654 0.3231

X=A

X=C X=G

X=T

April 18, 2023 CBS Microarray Course 19

Markov Chain Monte-Carlo Markov chains can sample from complex distributions

ACGCGGTGTGCGTTTGACGAACGGTTACGCGACGTTTGGTACGTGCGGTGTACGTGTACGACGGAGTTTGCGGGACGCGTACGCGCGTGACGTACGCGTGAGACGCGTGCGCGCGGACGCACGGGCGTGCGCGCGTCGCGAACGCGTTTGTGTTCGGTGCACCGCGTTTGACGTCGGTTCACGTGACGCGTAGTTCGACGACGTGACACGGACGTACGCGACCGTACTCGCGTTGACACGATACGGCGCGGCGGGCGCGGACGTACGCGTACACGCGGGAACGCGCGTGTTTACGACGTGACGTCGCACGCGTCGGTGTGACGGCGGTCGGTACACGTCGACGTTGCGACGTGCGTGCTGACGGAACGACGACGCGACGCACGGCGTGTTCGCGGTGCGG

AC

GT

%

Position

April 18, 2023 CBS Microarray Course 20

Gibbs sampling Markov chain for Gibbs sampling

1

1 1

0 0 0

( | , )1

( | , )1 1

( | , )1 1 1

( , , ) ( | , ) ( | , ) ( | , )

( , , )

( , , )

( , , )

( , , )

lim ( , , ) ( , , )

lim ( )

i i

i i

i i

P A B b C ci i i i

P B A a C ci i i i

P C A a B bi i i i

k k kk

kk

P A B C P A B C P B A C P C A B

a b c

a a b c

b a b c

c a b c

P A B C P A B C

P A

( ); lim ( ) ( ); lim ( ) ( )k kk k

P A P B P B P C P C

April 18, 2023 CBS Microarray Course 21

Gibbs sampling True target distribution (2D normal N(,))

0 1 0.5true

0 0.5 1

April 18, 2023 CBS Microarray Course 22

Gibbs sampling First 20 Gibbs sampling iterates (conditionals are 1D

normals)

April 18, 2023 CBS Microarray Course 23

Gibbs sampling Burn-in samples (1000 samples)

0 1 0.5true

0 0.5 1

0.3634 1.1243 0.7443burn-in

0.4190 0.7443 1.3724

April 18, 2023 CBS Microarray Course 24

Gibbs sampling Samples after Markov chain convergence (samples 1000-

2000)

0 1 0.5true

0 0.5 1

0.3634 1.1243 0.7443burn-in

0.4190 0.7443 1.3724

0.0187 1.0282 0.5052converged

0.0443 0.5052 1.0621

April 18, 2023 CBS Microarray Course 25

Data augmentation Gibbs sampling

Introducing unobserved variables often simplifies the expression of the likelihood

A Gibbs sampler can then be set up

Samples from the Gibbs sampler can be used to estimate parameters

( , | ) ( | , ) ( | , )

( | , , ) ( | , , )

model parameters, missing data, data

i ji ji j

P M D P M D P M D

P M D P M M D

M D

PME

1

1( | ) ( , | )

Nk

kM

E D P M D dMdN

April 18, 2023 CBS Microarray Course 26

Pros and cons Gibbs sampling

Explore the space of configuration of a probabilistic model of the data according to the probability of each configuration

Based on incrementaly perturbing the configuration one variable at a time, preferably choosing more likely configurations

Pros Clear probabilistic interpretation Bayesian framework “Global optimization”

Cons Mathematical details not easy to work out Relatively slow

April 18, 2023 CBS Microarray Course 27

Gibbs biclustering

April 18, 2023 CBS Microarray Course 28

Gibbs samplingCurrent configuration

1 1( 1| , , )?P g g c D2 2( 1| , , )?P g g c D

Next gene configuration

3 3( 1| , , )?P g g c D

April 18, 2023 CBS Microarray Course 29

Updated gene configuration

Next complete configuration iterate many times

April 18, 2023 CBS Microarray Course 30

Gibbs biclustering( , | ) ( | , , ) ( | , , )i ji ji j

P g c D P g g c D P c c g D

April 18, 2023 CBS Microarray Course 31

Simulated data

April 18, 2023 CBS Microarray Course 32

Remarks Gibbs biclustering allows noisy patterns Optimized configuration is obtained by averaging successive

iterated configurations

Biclustering is oriented Find subset of samples for which a subset of genes is

consistenly expressed across genes Find subset of genes that are consistently expressed across a

subset of samples

Searching for multiple patterns For gene biclustering, remove the data of

the genes from the current bicluster Search for a new pattern Stop if only empty pattern repeatedly found

April 18, 2023 CBS Microarray Course 33

Multiple biclusters

April 18, 2023 CBS Microarray Course 34

Leukemia fingerprints

April 18, 2023 CBS Microarray Course 35

Mixed-Lineage Leukemia Armstrong et al., Nature Genetics, 2002

Mixed-Lineage Leukemia (MLL) is a subtype of ALL Caused by chromosomal rearrangement in MLL gene Poorer prognosis than ALL

Microarray analysis shows that MLL is distinct from ALL

FLT3 tyrosine kinase distinguishes most strongly between MLL, ALL, and AML Candidate drug target

April 18, 2023 CBS Microarray Course 36

PCA Features

April 18, 2023 CBS Microarray Course 37

Biclustering leukemia data Bicluster patients

Find patients for which a subset of genes has a consistent expression profile across this group of patients

Discovery set 21 ALL, 17 MLL, 25 AML

Validation set 3 ALL, 3 MLL, 3 AML

April 18, 2023 CBS Microarray Course 38

Discovering ALL Bicluster 1: 18 out of 21 ALL patients

April 18, 2023 CBS Microarray Course 39

Discovering MLL Bicluster 2: 14 out of 17 MLL patients

April 18, 2023 CBS Microarray Course 40

Discovering AML Bicluster 3: 19 out of 25 AML patients

April 18, 2023 CBS Microarray Course 41

Rescoring ALL

April 18, 2023 CBS Microarray Course 42

Rescoring MLL

April 18, 2023 CBS Microarray Course 43

Rescoring AML

K.U.Leuven ESAT-SCD-Bioi

Qizheng Sheng