Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: [email protected] NYU Computer...

38
Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: [email protected] NYU Computer Science
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: [email protected] NYU Computer...

Page 1: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

Cis/TF discovery for Arabidopsis

Aristotelis Tsirigosemail: [email protected]

NYU Computer Science

Page 2: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

2

Outline

• Input data

• The proposed model

• Results on yeast

• Results on arabidopsis

• Unsupervised pattern discovery

Page 3: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

3

Input data

Page 4: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

4

Input data~

23,0

00 g

en

es

25 points1,500bp

upstream

gctaagc...

Page 5: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

5

Normalization~

23,0

00 g

en

es

25 points1,500bp

upstream

normalize columns(mean=0)

gctaagc...

Page 6: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

6

Filtering~

23,0

00 g

en

es

25 points1,500bp

upstream

normalize columns(mean=0, stdev=1)

~5,0

00 g

en

es

25 pointsgctaagc...motif

bitmap

001011…

filter outlow-variance

Page 7: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

7

The proposed model

Page 8: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

8

Assumption 1

A single TF binds on a single cis element (motif)

Source: U.S. Department of Energy Genomics (http://doegenomestolife.org)

Page 9: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

9

Assumption 2

TFs regulate genes sharing a motif only on subset of conditions

TF & regulated genes (group #1)

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

conditione

xp

res

sio

nTF & regulated genes (group #2)

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

condition

ex

pre

ss

ion

Page 10: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

10

Expression pattern #1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

conditionex

pre

ssio

nExpression pattern #2

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

condition

exp

ress

ion

Assumption 2 (cont’d)

TFs regulate genes sharing a motif only on subset of conditions

Page 11: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

11

Assumption 3The TF expression correlates with the

sum of the partially correlating expression patterns

sum of genes

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

condition

expr

essi

on

Page 12: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

12

Objective

• For each cis element (motif):

– discover groups of co-regulated genes

– compute aggregate motif expression

• For each TF:

– find best correlating motifs

Page 13: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

13

The algorithm – step 1~

5,0

00 g

en

es

step 1: clustering

25 points

.

.

.

.

.

.

Page 14: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

14

The algorithm – step 2~

5,0

00 g

en

es

step 1: clustering

25 points

step 2 for any motif

compute its gene set

.

.

.

Page 15: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

15

The algorithm – step 3~

5,0

00 g

en

es

step 1 clustering

25 points

step 2 for any motif

compute its gene set

step 3 compute the distribution of its genes into the clusters.

.

.

Page 16: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

16

The algorithm – step 4~

5,0

00 g

en

es

step 1 clustering

25 points

step 2 for any motif

compute its gene set

step 3 compute the distribution of its genes into the clusters

step 4 determine overrepresented

clusters using t-test

.

.

.

Page 17: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

17

The algorithm – final step~

5,0

00 g

en

es

25 points

final stepcompute motif

aggregate expression

25 points

.

.

.

Page 18: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

18

Yeast

Page 19: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

19

Example TF: BAS1

RANK MOTIF OCCUR corr score 1 gactcg 46 0.6446 66 2 cgagtc 46 0.6446 16 3 gactaa 163 0.6381 66 4 ttagtc 163 0.6381 33 5 tcggct 87 0.6374 33 ... 12 gctagt 110 0.6268 33 13 agtcac 137 0.6262 83 p-value=0.079 ... 27 gagtca 136 0.6192 100 p-value=0.004

Using cis/TF version 1:

Page 20: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

20

Example TF: BAS1

Using cis/TF version 2:

RANK MOTIF OCCUR signf corr score 1 ctgact 122 0.62 0.66 33 2 agtcag 122 0.62 0.66 83 3 ggttta 187 0.62 0.63 50 4 taaacc 187 0.62 0.63 33 5 gagtca 136 0.68 0.63 100 p-value=0.002 6 tgactc 136 0.68 0.63 33 7 atttga 378 0.64 0.63 33 8 tcaaat 378 0.64 0.63 50 9 agtggc 126 0.66 0.61 50 10 gccact 126 0.66 0.61 50

Page 21: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

21

Cluster #1: correlation = 0.02

-5

-4

-3

-2

-1

0

1

2

3

4

5

1

BAS1

#1

Page 22: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

22

Cluster #2: correlation = -0.05

-5

-4

-3

-2

-1

0

1

2

3

4

5

1

BAS1#2

Page 23: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

23

Cluster #0: correlation = 0.18

-5

-4

-3

-2

-1

0

1

2

3

4

5

1

BAS1#0

Page 24: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

24

Cluster #4: correlation = -0.35

-5

-4

-3

-2

-1

0

1

2

3

4

5

1

BAS1#4

Page 25: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

25

Cluster #4: correlation = 0.63

-5

-4

-3

-2

-1

0

1

2

3

4

5

1

BAS1

#3

Page 26: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

26

Conclusions

Advantages of version 2:

gives ability to focus on gene cluster that correlates best with a given TF

thus, increases overall correlation and motif rank

offers a measure of motif significance

can be extended to pairs of TFs/motifs

Page 27: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

27

Arabidopsis

Page 28: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

28

Procedure• Permute gene cluster assignment

• Compile list of putative motifs

• Compute significance score of known motifs

• Repeat 1000 times

• Compute p-value of the score

Page 29: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

ranking score

# o

f experi

men

ts

p-val = 0.006

Page 30: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

30

TF discovery?

Need data for training!

(TFs and their associated binding cites)

Parameters to be estimated: number of clusters

motif size & degeneracy

Page 31: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

31

Pattern discovery

Page 32: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

32

TF-driven pattern discovery

• Unsupervised pattern discovery

• Find groups of genes partially correlating with TF

• Apply statistical filter

• Look for over-represented motifs in genes’ upstream regions

• Data for validation?

Page 33: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

33-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

AT1G73230 (TF)

AT1G53290

AT5G59880

Page 34: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

34

Pattern discovery example

TF & regulated genes (group #2)

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

condition

expr

essi

on

Page 35: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

35

“Predicting Gene Expression form Sequence”Beer & Tavazoie, Cell 2004

• Group genes in 49 clusters• Predict gene cluster using motifs discovered in

its upstream region

Page 36: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

36

0.00

0.05

0.10

0.15

0.20

-1 -0.5 0 0.5 1

correlation

freq

uenc

yALL

2,500 genes

PAC

RRPE

PAC&RRPE

Page 37: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

37

Conclusions

Page 38: Cis/TF discovery for Arabidopsis Aristotelis Tsirigos email: tsirigos@cs.nyu.edu NYU Computer Science.

38

ConlusionsTwo options:

• Supervised training:

– uses background knowledge to construct model

– needs more training data

• Unsupervised pattern discovery:

– minimal model bias (no prior knowledge)

– needs more ‘expert’ help to filter results