Dataset collection and processing · 2015/06/03 · Dataset collection and processing ENCODE...

- 1 -

SI Methods Dataset collection and processing

ENCODE ChIP-seq data was downloaded from UCSC genome browser, including 686

profiles for 159 DNA binding proteins (1). TF recognition motifs are collected from

HTSELEX (2), Jaspar (3) and Transfac 6.0 (4). There are 853 recognition motifs collected

for 505 TFs. We also collected 172 recognition motifs for 133 RNA binding proteins (RBPs)

(5). TCGA datasets of gene expression, copy number alteration (CNA) and DNA methylation

were downloaded from TCGA Data Portal on 07/27/2014. Standardized somatic mutation

data is downloaded from Broad firehose on 10/23/2014. Since we observed that most CpG

islands distributed 1000nts upstream and downstream around gene transcription start sites

(TSS), we used the methylation array probes +/-1000nts around the TSS and got the average

value as promoter methylation level (SI Appendix, Fig. S12). The GTEx data on 01/17/2014

was downloaded for normal human tissues analysis (6).

Outlier removal for ChIP-seq profiles

Certain TF may have several ChIP-seq profiles available from different experimental

conditions, antibodies or laboratories. We found that certain ChIP-seq profile of a TF may be

very different from other profiles, which will lead to an ambiguous result for the regulatory

inference. We clustered their regulatory potential scores (computed from ChIP-seq peaks

near gene TSS) by hierarchical clustering with Pearson correlation as the distance measure

(SI Appendix, Fig. S13). The hierarchical tree is cut at correlation 0.2 to form clusters. We

ranked these clusters by numbers of profiles they contain, and only kept ChIP-seq profiles in

the largest cluster. If no clusters can be formed with correlation threshold 0.2 or several

largest clusters have the same size, we exclude the corresponding TF in further analysis.

After this step, there are 544 out of 686 ENCODE ChIP-seq profiles left for analysis,

representing 150 TFs.

Dataset normalization

All gene expression values measured by RNA sequencing platforms from TCGA and GTEx

were log2 transformed. Our analysis is focused on the expression difference between tumor

and normal tissues. For TCGA gene expression and DNA methylation profiles, there are very

limited numbers of tumor samples with paired normal tissue control. Thus, we grouped all

the normal tissue samples together and used their average value as background control in

- 2 -

each cancer. For CNA, TCGA provides very complete tumor normal paired measurements so

we use the gene CNA difference paired between tumor and normal samples. Since for each

GTEx sample, there is no normal tissue control as TCGA, we take the all tissue samples

average as background control.

Regulatory potential scores for TF and RBP binding

For each ChIP-seq profile, we searched the presence of ChIP-seq peaks 10000nts around

gene TSS annotated by RefSeq (7). A regulatory potential score is calculated between each

pair of ChIP-seq peak and gene TSS by multiplying the ENCODE ChIP-seq intensity score

with an exponential decay score exp(-A*Distance) of their distance between (8, 9). The

coefficient A is set as log(2)/1000, so that a binding peak 1000nts away from gene TSS will

decay by 50%. The ENCODE ChIP-seq intensity scores are linearly normalized into range

(0,1]. The exponential decay score also has a range (0, 1]. Thus the final regulatory potential

score has a range in (0,1]. For each gene TSS, if there are several ChIP-seq peaks of a TF

nearby, we merged their regulatory potential scores by noisy-or: 1− (1− 𝑠𝑐𝑜𝑟𝑒!)! .

For TF regulatory motifs, we searched for matches within the union DNaseI region in UCSC

multiple genome alignment of 33 placental mammals and derived a conservation score for

each motif site in human genome, using the CCAT package with default parameters (10).

Many TFs from the same TF family have very similar recognition motifs, thus the mapped

sites are highly overlapped. We clustered all TF motifs according to their mutual motif

similarity to 220 clusters, using CCAT package with default parameters (10). For all

overlapping motif sites on human genome, we merged them into one binding site of motif

cluster if more than 10% TFs from that cluster have motif hits included. The regulatory

potential scores for TF recognition motifs are calculated in the same way as ChIP-seq data

using exponential decay of distance, except that the conservation score from multiple genome

alignment is used instead of the ChIP-seq intensity score.

For most RBP motifs collected, they have lower information content comparing to TF

recognition motifs (5), and CCAT package cannot find significant hits with its statistical

model (10). Thus, we converted all RBP motifs into consensus sequences and searched for

their matches in the same strand of gene 3’UTRs by consensus matching (11). The 172 RBP

motifs are clustered into 73 clusters and overlapping binding sites are merged in the same

way as TF recognition motifs. The regulatory potential scores for RBP recognition motifs are

- 3 -

simply defined as the conservation scores on 3’UTR regions in multiple genome alignments.

When we profile the regulatory activity of RBP motifs, the promoter degree and CpG content

are replaced with 3’UTR degree (total number of RBP motifs in gene 3’UTR region) and

3’UTR AU content as gene expression background factors.

Frisch-Waugh-Lovell method of regression

When RABIT screens TFs driving tumor gene expression patterns, the multiple regressions

needs to be conducted against all 686 TF ChIP-seq profiles and in each of 7484 tumor

samples. Thus, RABIT uses the time efficient Frisch-Waugh-Lovell (FWL) method for

regression (12). In the regression, FWL separates factors whose values are not changed in

each tumor, such as CNA and Promoter degree, and only regresses against each variable

ChIP-seq profile to speed up the calculation. In Fig. 1B, the vector R represents ChIP-seq

regulatory potential scores across all target genes. The matrix B is composed of four columns

of background factors (Gene CNA, Promoter methylation, Promoter degree and CpG

content) that keep constant in the same tumor. We calculated the invariant matrix Q

(𝐼 − 𝐵(𝐵′𝐵)!!𝐵′) just one time for all regressions in the same tumor. For each ChIP-seq

profile, only the coefficient , which measures the effect of TF regulation, will be

calculated incrementally from the Q matrix.

The time complexity of linear regression is , where p is the number of covariates

(five in our analysis: four background factors plus one for TF regulatory potential score); and

N is the number of human genes (about 16000 in our analysis). For each tumor, if we run all

regressions one by one, the time complexity will be , where k is the number of

ChIP-seq profiles. With the FWL method, the computation of matrix takes

. The computation of and takes

for k ChIP-seq profiles. If we assume k > p, the

time complexity is reduced to from .

β̂r

O(p2N )

O(k * p2N )

(B 'B)−1

O((p−1)2N + (p−1)3) =O(p2N ) R 'QR R 'QY

k *O(N + (p−1)N + (p−1)2 ) =O(k * pN )

O(k * pN ) O(k * p2N )

- 4 -

Correlation between regulator gene expression, somatic mutation and target gene

expression for regulatory motif members

In Step three of RABIT framework, we tested the impact of TF gene expression and somatic

mutation variation on target gene expression with a linear regression. However, the

regression analysis is more complicated for regulatory motif because, unlike ChIP-seq

profile, one regulatory motif might represent several distinct TFs or RBPs (Fig. 4A). With the

TF regulatory activity score as the response variable, we applied the stepwise forward

regression to select among the covariates of gene expression and somatic mutation values of

all TF (or RBP) members (SI Appendix, Fig. S2B and Table S2). Instead of using F-test to

measure the effect of all covariates on regulatory activity scores across tumors, we used t-test

to assess the significance of each covariate (13). Among all regulators represented by a

regulatory motif, the p-values of covariates selected by forward regression are grouped

together and converted to FDRs by Benjamini-Hochberg procedure (14). This procedure is

applied for each cancer type, and a regulator is reported as cancer associated if at least one

covariate’s regression coefficient is statistically significant (FDR threshold 0.05).

Algorithm Comparison

In order to compare the performance of RABIT with other methods in finding cancer

associated TFs, we used receiver operating characteristic (ROC) curve and precision-recall

(PR) curve. The gold standard positive set is defined as TFs annotated as cancer associated in

at least two out of four cancer gene databases (NCI Cancer Index (15), Bushman (16, 17),

COSMIC (18) and CCGD (19)). The gold standard negative set is defined as the rest of TFs.

Among 150 TFs with ChIP-seq profile analyzed, there are 96 TFs classified as gold standard

positives and 54 classified as gold standard negatives. After running each method, we

generated a rank of TFs reflecting their relative relevance with a cancer type. We derived an

overall TF rank by averaging the TF ranks across all TCGA cancer types. We swept through

each TF rank list to generate the ROC and PR curves for each method (SI Appendix, Fig. S8

B and C). The parameters of running each method are listed as follows.

For LASSO, we used the glmnet R package and set the penalty weights of four background

factors (promoter degree, CpG content, CNA and promoter methylation) as zero and took all

other default parameters in running (20). By setting zero penalty weights, these background

factors will always be contained in the linear model as controls. For LAR, we used lars R

- 5 -

package (21). Because there is no way of inputting background covariates in lars, we first

calculated the residuals of tumor gene expression values and ChIP-seq regulatory potential

scores after regressing to four background factors. In this way, all impact of background

factors will be removed from our data. Then, we ran the lars package on these residual values

with default parameters.

Besides regression methods, we also included three methods designed for finding master

regulators in gene expression patterns. We ran MARINA and VIPER algorithms using

VIPER R package with default parameters (22, 23). We also ran the Expression-2-Kinase

(X2K) package with default parameters (24). Since the X2K only takes gene list as input, we

took the top 10% most up-regulated (or down-regulated) genes in each tumor to calculate the

master TFs in gene up-regulation (or down-regulation). We also included two methods for

baseline comparison. For each TF, we use t-test to check whether its ChIP-seq target genes

are significantly differentially regulated. The t-test p-values were converted to FDRs by

Benjamini-Hochberg procedure and FDR threshold 0.05 was used to select significantly TFs

in each tumor. For each TF, we also used its gene expression difference between all tumor

samples and all normal controls as a measurement of cancer relevance.

For RABIT, LAR, LASSO, X2K and t-test, a set of TFs was selected as important regulators

of gene expression patterns in each tumor. For each TF, the percentage of tumors with TF

selected was calculated for each TCGA cancer type (Fig. 2). The overall cancer relevance is

defined as average percentage of tumors with TF selected across all TCGA cancer types. For

MARINA and VIPER, the TFs were ranked by adjusted p-values from each algorithm. The

overall cancer relevance of TFs was defined from the average rank of TFs across all TCGA

cancer types. For TF expression, the TFs were ranked by the absolute value of tumor versus

normal expression difference averaged among all TCGA cancers. The ranked list of each

algorithm was swept through to generate the ROC and PR curves (SI Appendix, Fig. S8 B

and C). The area under ROC curve is compared between two algorithms using Delong-test

(25).

Besides comparing different methods, we also test the performance of RABIT without

controlling four background factors (promoter degree, CpG content, CNA and promoter

methylation). The area under curve of RABIT is significantly larger than the result without

background factors (SI Appendix, Fig. S8D, 0.730 > 0.682, Delong test P-value = 0.023),

- 6 -

and the precision of RABIT is consistently larger than the result without background factors

when the recall rate is higher than 0.5 (SI Appendix, Fig. S8E).

Correlation between TF regulatory activity and genome-wide CRISPR screening

For cell line K562 and HL60, there are gene expression data profiled by ENCODE and

genome-wide CRISPR screening data available from previous studies (26, 27). The CRISPR

screening scores were directly downloaded from each study. For each gene screened, a

positive score implied the cell growth became faster after TF CRISPR knock out, and a

negative score implied the cell growth became slower after TF CRISPR knock out.

We applied RABIT over ENCODE gene expression data over 76 cell lines. A regulatory

activity score was calculated for each TF to indicate whether the TF target genes are up

regulated or down regulated. For transcriptional repressors (defined below), the regulatory

activity scores were sign-reversed, since the direction of gene targets regulation is reverse to

the direction of TF activation. The spearman rank correlations between TF regulatory activity

scores and TF CRISPR screening scores were calculated and the p-values of correlation test

were calculated with R package.

In the analysis above, we only included transcriptional activators and repressors. The

correlations between TF gene expression values and target regulatory activity scores were

computed for all TCGA cancer data and significant correlations were selected with

correlation t-test FDR threshold 0.05. Transcriptional activators are defined as TFs with

positive correlations in more than 80% TCGA cancer types, and transcriptional repressors are

defined as TFs with negative correlations in more than 80% cancer types.

References

1. Consortium EP, et al. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414):57-74.

2. Jolma A, et al. (2013) DNA-binding specificities of human transcription factors. Cell 152(1-2):327-339.

3. Mathelier A, et al. (2014) JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic acids research 42(Database issue):D142-147.

4. Matys V, et al. (2003) TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic acids research 31(1):374-378.

- 7 -

5. Ray D, et al. (2013) A compendium of RNA-binding motifs for decoding gene regulation. Nature 499(7457):172-177.

6. Consortium GT (2013) The Genotype-Tissue Expression (GTEx) project. Nature genetics 45(6):580-585.

7. Pruitt KD, Tatusova T, & Maglott DR (2007) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic acids research 35(Database issue):D61-65.

8. Tang Q, et al. (2011) A comprehensive view of nuclear receptor cancer cistromes. Cancer research 71(22):6940-6947.

9. Wang S, et al. (2013) Target analysis by integration of transcriptome and ChIP-seq data with BETA. Nature protocols 8(12):2502-2515.

10. Jiang P & Singh M (2014) CCAT: Combinatorial Code Analysis Tool for transcriptional regulation. Nucleic acids research 42(5):2833-2847.

11. Kheradpour P, Stark A, Roy S, & Kellis M (2007) Reliable prediction of regulator targets using 12 Drosophila genomes. Genome research 17(12):1919-1931.

12. Frisch R & Waugh FV (1933) Partial Time Regressions as Compared with Individual Trends. Econometrica 1(4):387-401.

13. Freedman D (2009) Statistical models : theory and practice (Cambridge University Press, Cambridge ; New York) pp xiv, 442 p.

14. Benjamini Y & Hochberg Y (1995) Controlling the False Discovery Rate - a Practical and Powerful Approach to Multiple Testing. J Roy Stat Soc B Met 57(1):289-300.

15. NCI (2014) Cancer Gene Index Project. 16. Sadelain M, Papapetrou EP, & Bushman FD (2012) Safe harbours for the integration

of new DNA in the human genome. Nature reviews. Cancer 12(1):51-58. 17. Vogelstein B, et al. (2013) Cancer genome landscapes. Science 339(6127):1546-

1558. 18. Futreal PA, et al. (2004) A census of human cancer genes. Nature reviews. Cancer

4(3):177-183. 19. Abbott KL, et al. (2014) The Candidate Cancer Gene Database: a database of cancer

driver genes from forward genetic screens in mice. Nucleic acids research. 20. Friedman J, Hastie T, & Tibshirani R (2010) Regularization Paths for Generalized

Linear Models via Coordinate Descent. Journal of statistical software 33(1):1-22. 21. Efron B, Hastie T, Johnstone I, & Tibshirani R (2004) Least angle regression. The

Annals of statistics 32(2):407-499. 22. Lefebvre C, et al. (2010) A human B-cell interactome identifies MYB and FOXM1 as

master regulators of proliferation in germinal centers. Molecular systems biology 6:377.

23. Alvarez MJ (2013) viper: Master Regulator Analysis including MARINA and VIPER algorithms. R package version 0.99.0).

24. Chen EY, et al. (2012) Expression2Kinases: mRNA profiling linked to multiple upstream regulatory layers. Bioinformatics 28(1):105-111.

25. DeLong ER, DeLong DM, & Clarke-Pearson DL (1988) Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44(3):837-845.

26. Gilbert LA, et al. (2014) Genome-Scale CRISPR-Mediated Control of Gene Repression and Activation. Cell 159(3):647-661.

27. Wang T, Wei JJ, Sabatini DM, & Lander ES (2014) Genetic screens in human cells using the CRISPR-Cas9 system. Science 343(6166):80-84.

Supplementary Figures and Tables

A

B

−0

.10

.00

.2C

orr

ela

tio

n

LU

SC

.HiS

eq

LU

SC

.HiS

eq

.V2

OV.A

gile

nt

LU

AD

.HiS

eq

.V2

LU

AD

.HiS

eq

LIH

C.H

iSe

q

CO

AD

.HiS

eq

.V2

STA

D.H

iSe

q

RE

AD

.HiS

eq

.V2

BL

CA

.HiS

eq

GB

M.U

13

3A

BR

CA

.HiS

eq

BR

CA

.Ag

ilen

t

BR

CA

.HiS

eq

.V2

BL

CA

.HiS

eq

.V2

RE

AD

.Ag

ilen

t

LIH

C.H

iSe

q.V

2

GB

M.H

iSe

q.V

2

OV.U

13

3A

UC

EC

.GA

UC

EC

.GA

.V2

PR

AD

.HiS

eq

.V2

UC

EC

.HiS

eq

.V2

CE

SC

.HiS

eq

.V2

GB

M.A

gile

nt

CO

AD

.Ag

ilen

t

HN

SC

.HiS

eq

HN

SC

.HiS

eq

.V2

KIR

P.H

iSe

q.V

2

KIC

H.H

iSe

q.V

2

TH

CA

.HiS

eq

.V2

KIR

C.H

iSe

q

KIR

C.H

iSe

q.V

2

−0

.10

.0C

orr

ela

tio

n

LU

SC

.HiS

eq

.V2

LU

SC

.HiS

eq

OV.A

gile

nt

LIH

C.H

iSe

q

STA

D.H

iSe

q

LIH

C.H

iSe

q.V

2

CO

AD

.HiS

eq

.V2

LU

AD

.HiS

eq

.V2

BR

CA

.Ag

ilen

t

LU

AD

.HiS

eq

HN

SC

.HiS

eq

RE

AD

.HiS

eq

.V2

PR

AD

.HiS

eq

.V2

OV.U

13

3A

CO

AD

.Ag

ilen

t

BR

CA

.HiS

eq

HN

SC

.HiS

eq

.V2

RE

AD

.Ag

ilen

t

BR

CA

.HiS

eq

.V2

KIC

H.H

iSe

q.V

2

BL

CA

.HiS

eq

.V2

BL

CA

.HiS

eq

GB

M.U

13

3A

UC

EC

.GA

UC

EC

.HiS

eq

.V2

UC

EC

.GA

.V2

CE

SC

.HiS

eq

.V2

GB

M.A

gile

nt

TH

CA

.HiS

eq

.V2

KIR

P.H

iSe

q.V

2

GB

M.H

iSe

q.V

2

KIR

C.H

iSe

q

KIR

C.H

iSe

q.V

2

0.2

Fig. S1. Background confounding factors of gene expression. (A) For each tumor, we computed

the spearman rank correlation between the promoter degree of each gene and the tumor normal ex-

pression difference. For each cancer type, the correlation values across all tumors are shown. The

bottom and top of the box are the 25th and 75th percentiles (i.e., they give the interquartile range).

Whiskers on the top and bottom represent the maximum and minimum data points within the range

represented by 1.5 times the inter-quartile range. (B) The correlation values between the promoter

CpG content and gene tumor normal expression difference are shown by boxplots.

Background factors Regulators in selectionGene expression

n: number of genes pb: number of background factors

pr: number of regulators

Regulatory activity scores (t-value)

Regulator gene expression: E

or somatic mutation: M

n: number of tumors p: number of members

A

B

sub-matrix: B sub-matrix: R

Fig. S2. Linear model structure. (A) For each tumor, the TF regulatory effect is evaluated on target

gene expression by linear regression with each regression unit as a human gene. The covariate matrix

X is composed of two sub matrices B and R. Sub-matrix B contains the values of background factors,

which include Promoter degree, Promoter CpG content, Gene CNA and Promoter DNA methylation.

Sub-matrix R contains the regulatory potential scores that measure TF binding intensity near gene

TSS, and the set of TFs will be selected to accurately model the gene expression pattern in each

tumor. The response variable Y contains the gene expression differences between tumor sample and

normal controls. In our analysis, the number of regression unit n is about 16000, which represents the

number of human genes with TCGA gene expression measured and ChIP-seq binding peaks near its

TSS. The number of background factors pb, i.e., the dimension of B is 4. The dimension of R, pr, is 150,

which represents the number of TFs with ENCODE ChIP-seq profiles. (B) RABIT investigates whether

the public ChIP-seq profiles (or regulatory motifs) used can represent the active TF targets in each

cancer type. For each cancer type, we regress the response variable of TF regulatory activity scores

linearly against covariates of TF gene expression and somatic mutation, where each regression unit as

a tumor sample. The TF (or RBP) regulatory motif might contain several members, and all of them are

included as covariates. The response vector contains the TF regulatory activity scores, which are the

estimated coefficients normalized by their estimated standard errors (aka t-values) for TF regulatory

effects on target gene differential expression (example in Table 1A). The number of regression unit is

the number of tumors in each TCGA cancer type. The dimension of X is 2 for ChIP-seq analysis, which

represents the TF gene expression and somatic mutation. For regulatory motifs, there might be several

members included; such as RBP motif cluster 9 includes RBFOX1, RBFOX2, RBFOX3 and EIF2S1. To

determine which members are relevant with motif target genes expression patterns, we run a forward

selection among gene expression and mutation values (2p covariates) of all members with Mallows Cp

as model selection metric (example in SI Appendix, Table S2).

A BPercentage (%)

0 20 40 60

MCF−7+vehicle+UT−A

K562+IFNg30+Stanford

K562+Stanford

NB4+Stanford

MCF10A−Er−Src+EtOH_0.01pct+Harvard

K562+UT−A

HepG2+UT−A

MCF−7+serum_stimulated_media+UT−A

GM12878+UT−A

MCF10A−Er−Src+4OHTAM_1uM_4hr+Harvard

K562+IFNg6h+Yale

H1−hESC+Stanford

K562+IFNa6h+Yale

K562+IFNa30+Yale

HeLa−S3+Yale

t−value0 4 8 12




K562+Stanford

K562+IFNg6h+Yale

MCF−7+estrogen+UT−A

K562+IFNa6h+Yale

K562+UT−A

NB4+Stanford

K562+IFNa30+Yale

MCF−7+serum_starved_media+UT−A

HeLa−S3+Yale

GM12878+UT−A

K562+Yale

HepG2+UT−A

H1−hESC+Stanford


HUVEC+UT−A

HeLa−S3+UT−A


Fig. S3. Selection of ChIP-seq profile with the largest statistical effect. When one TF has several

ChIP-seq profiles available, RABIT only uses the ChIP-seq profile that gives the most significant co-

efficient (the largest absolute t-value) in the regression analysis of TF regulation on target genes. (A)

In this example, all ENCODE MYC ChIP-seq profiles are analyzed together with TCGA data of breast

tumor TCGA-AO-A03P-01A. (B) For each TCGA breast tumor, only one most relevant ENCODE MYC

ChIP-seq profile is selected. We show the percentage of tumors that each ChIP-seq profile is selected.

RB

BP

5E

P300

YY

1F

OX

A1

FO

XP

2C

EB

PB

SR

FT

EA

D4

CT

CF

MA

FK

RA

D21

MA

ZR

FX

5N

R3C

1E

BF

1H

NF

4A

EZ

H2

ZN

F217

TA

F1

RE

ST

SP

I1S

TAT

2S

UZ

12

STAT

3F

OS

IKZ

F1

MA

XR

UN

X3

ZB

TB

7A

ZE

B1

E2F

4M

YC

SA

P30

PH

F8

MY

BL2

E2F

1F

OX

M1

LUSC.HiSeq.V2LUSC.HiSeqLUAD.HiSeqLUAD.HiSeq.V2UCEC.GAUCEC.GA.V2UCEC.HiSeq.V2BRCA.HiSeqBRCA.HiSeq.V2BRCA.AgilentCOAD.HiSeq.V2READ.HiSeq.V2COAD.AgilentBLCA.HiSeq.V2PRAD.HiSeq.V2THCA.HiSeq.V2STAD.HiSeqKICH.HiSeq.V2CESC.HiSeq.V2LIHC.HiSeq.V2BLCA.HiSeqHNSC.HiSeq.V2HNSC.HiSeqOV.AgilentOV.U133AGBM.U133AKIRP.HiSeq.V2KIRC.HiSeq.V2KIRC.HiSeqGBM.AgilentGBM.HiSeq.V2

100 50 0 50 100

Up regulated (%)Down regulated (%)

12 3 4

Fig. S4. Structure of transcriptional regulation in cancer. In order to derive a structure of transcrip-

tional regulation in cancer, we clustered TFs by the difference between the percentage of tumors with

TF target genes up regulated and the percentage down regulated in diverse cancer types. We split the

hierarchical tree into four clusters, with an outlier ZEB1 that are not clustered in any group.

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●●

●

●

−1 1 2 3 4

−20

−5

5

R= −0.275 (P=2.27e−10)

Gene expression

Regula

tory

activity

Fig. S5. Correlation of gene expression between RAD21 and its target genes. The spearman rank

correlation between RAD21 gene expression values and TF target regulatory activity scores is shown

for TCGA breast tumors.

0 400 1000

040

80

HighLow

p-value = 0.003P

erc

enta

ge (

%)

Survival (days)

A

0 4000

040

80 High

Low

p-value = 0.0229

Survival (days)

Perc

enta

ge (

%)

B

0 4000

040

80 High

Low

Perc

enta

ge (

%)

p-value = 3.82e-5C

Survival (days)

0 1000 2500

040

80

HighLow

p-value = 2.37e-4D

Perc

enta

ge (

%) High

Low

0 4000

040

80

p-value = 7.65e-8EP

erc

enta

ge (

%)

040

80 High

Low

0 4000

p-value = 5.68e-9

Perc

enta

ge (

%)

F

Survival (days) Survival (days) Survival (days)

Fig. S6. Survival analysis of several predictions. (A) The patients were ordered by the SPI1

expression values using TCGA GBM data. The top half patients are classified as “High” and the bottom

half are classified as “Low”. The overall survival days are plotted by Kaplan-Meier curve and the p-

value is estimated by Weibull model with age and gender as background factors. (B) Survival analysis

is done with Rembrandt data in the same way as A. (C) Survival analysis is done with Gravendeel data

in the same way as A. (D) Survival analysis is done with TCGA KIRC data in the same way as A, except

the clinical stage of KIRC is added as one background factor in Weibull regression. (E) The Rembrandt

cohort is used and the regulatory activity of RBP motif cluster 9 is analyzed for survival in the same

way as Fig. 4C. (F) The Gravendeel cohort is used and the regulatory activity of RBP motif cluster 9 is

analyzed in the same way as Fig. 4C.

Percentage (%)0 20 40 60

Up

Down

PHF8

RBBP5

TBP

MAFK

SAP30

SPI1

EBF1

ZKSCAN1

BHLHE40

SMC3

TAF1

UBTF

IRF3

TEAD4

GTF2F1

RFX5

GTF3C2

ATF1

MXI1

GTF2B

CDKN3

0

2

26

0

3

137

19

1

0

4

4

5

27

0

0

0

0

82

173

6

21

−

+

−

+

−

+

+

−

−

−

−

−

−

−

−

+

−

+

+

−

+

−

−

−

−

−

−

+

−

−

−

−

−

−

−

−

−

−

+

−

−

−

+

−

+

+

−

+

−

+

+

+

+

+

−

−

+

+

−

+

+

−

+

NCI Cancer Index

Bushman

COSM

IC

CCGD

Fig. S7. TFs with potential roles in breast cancer. We show all TFs with target genes differentially

regulated in more than 10% breast tumors, averaged among datasets of TCGA and METABRIC. We ex-

cluded all TFs related with breast cancer by NCI cancer index annotation and Google literature search.

The rest of TFs represent regulators that may have global impact for breast tumor gene expression but

are not well studied until now. For each TF, the fraction of tumors with target genes up regulated is

shown in red and the fraction down regulated is shown in blue. The NCI cancer index value and cancer

gene annotation from all databases are shown on left.

B C

D E

A0

20

40

Pe

rce

nta

ge (

%)

Cancer Other

Bushm

an

COSM

IC

CCGD

*** **

False positive rate

Tru

e p

ositiv

e r

ate

0.0 0.4 0.8

0.0

0.4

0.8

RABIT

No background

RecallP

recis

ion

0.0 0.4 0.8

0.7

0.9

False positive rate

Tru

e p

ositiv

e r

ate

0.0 0.4 0.8

0.0

0.4

0.8

Recall

Pre

cis

ion

0.0 0.4 0.8

0.5

0.7

0.9

RABIT LAR LASSO t test

X2K MARINA VIPER TF expression

Fig. S8. Algorithm performance comparison. (A) We divided TFs according to the cancer gene

annotation in each database, and compared the average percentage of tumors (across all TCGA can-

cers) with target gene differentially regulated between two categories by the Mann-Whitney U test. One

asterisk indicates P -value < 0.05, two asterisks indicate P -value < 0.01, and three asterisks indicate

P -value < 0.001. (B) Receiver operating characteristic (ROC) curves were plotted for RABIT and seven

other methods. LAR and LASSO are two regression based feature selection methods. X2K, MARINA

and VIPER are integrative algorithms developed to find master regulators in driving gene expression

patterns. T-test and TF expression are two base-line methods by only using the differential expression

of target genes (T-test) or TF itself (TF expression). (C) Precision-recall (PR) curves were plotted for

all methods. (D) We ran the RABIT algorithm without controlling four background factors (promoter

degree, CpG content, CNA and promoter methylation). The algorithm performance was compared be-

tween initial result and result without background factor control using ROC curve. The area under curve

of RABIT is significantly larger than the result without background factors (0.730 > 0.682, Delong test

P -value = 0.023). (E) The algorithm performance was plotted with PR curve.

CV Error Improve (%)

0 2 4 6 8

GBM.HiSeq.V2

LUSC.HiSeq

CESC.HiSeq.V2

GBM.U133A

BLCA.HiSeq

BLCA.HiSeq.V2

LUSC.HiSeq.V2

UCEC.HiSeq.V2

BRCA.HiSeq

UCEC.GA

UCEC.GA.V2

STAD.HiSeq

LUAD.HiSeq

BRCA.HiSeq.V2

OV.U133A

GBM.Agilent

LIHC.HiSeq

LIHC.HiSeq.V2

HNSC.HiSeq

OV.Agilent

BRCA.Agilent

READ.HiSeq.V2

LUAD.HiSeq.V2

HNSC.HiSeq.V2

COAD.HiSeq.V2

READ.Agilent

KIRC.HiSeq

KIRP.HiSeq.V2

COAD.Agilent

KIRC.HiSeq.V2

PRAD.HiSeq.V2

KICH.HiSeq.V2

THCA.HiSeq.V2

RABIT

LAR

LASSO

time(sec)

0.21.05.020.0200.0

Fig. S9. RABIT outperforms the state-of-the-art feature selection algorithms. For each tumor, we

selected the top ten most important TFs prioritized by each algorithm to train a linear model. The leave-

one-out cross validation (CV) error was used to estimate the error of predicting gene expression values

from TF binding. The CV errors were calculated for the model with only background factors of gene

expression and for model with top ten most important TFs. In each tumor, the improvement between

two CV errors was converted to a relative fraction by dividing the CV error of the base model with

only four background factors. In each cancer type, the relative CV error improvements were averaged

across all tumors and taken as algorithm performance. For each method, the average running time

across all tumors in a cancer type was shown in log scale.

A

B

C

D

0 20 40 60 80 100

Up regulated (%)

0 20 40 60 80 100

Down regulated (%)

GTEx Breast

METABRIC

Rembrandt

Gravendeel

GTEx Brain

RB

BP

5R

AD

21

FO

SF

OX

A1

MA

FK

CE

BP

BF

OX

P2

EP

30

0R

ES

TF

OX

M1

ZE

B1

E2

F4

MY

BL

2S

AP

30

GATA

3E

LF

1M

YC

E2

F1

PH

F8

GTEx Breast

Agilent

METABRIC

HiSeq.V2

HiSeq

ZE

B1

RE

ST

EZ

H2

CT

CF

MA

ZE

GR

1S

UZ

12

PH

F8

FO

XM

1M

YB

L2

IKZ

F1

CE

BP

BE

2F

1TA

F1

MY

CE

2F

4E

LF

1S

TAT

2S

TAT

3S

PI1

GTEx Brain

Rembrandt

Gravendeel

U133A

Agilent

HiSeq.V2

Correlation−0.2 0.2 0.6

HiSeq

HiSeq.V2

Agilent

Correlation−0.2 0.4

U133A

HiSeq.V2

Agilent

Fig. S10. Cross datasets comparison of TF regulation. For breast cancer and Glioblastoma mul-

tiforme (GBM), we applied RABIT framework on other cohorts as independent study control of TCGA.

To contrast the regulatory behavior of TFs in tumor, we also applied RABIT on the GTEx gene ex-

pression cohorts for TF regulatory activities in normal breast and brain tissues. (A) For breast cancer,

we compared the transcriptional regulation landscape from TCGA data with the results computed from

METABRIC and GTEx datasets. There are 1992 breast tumor samples included in METABRIC and 66

normal breast samples included in GTEx. The percentage of tumors or GTEx samples with TF targets

differential regulated is plotted in the same way as Fig. 2. (B) For GBM, we compared the results of

TCGA datasets with the results computed from other Glioma datasets and GTEx dataset. There are

381 tumor samples included in Rembrandt cohort, 276 tumor samples included in Gavendeel cohort

and 357 normal brain samples included in GTEx. The percentage of samples with TF targets differ-

entially regulated is plotted. (C) For each TF, we used the difference between the percentage of up

regulated samples and the percentage of down regulated samples as an overall measure of regula-

tion in each cancer type. We compared this overall measure between each TCGA dataset and other

datasets by spearman rank correlation across all TFs. The correlation values are shown for breast

cancer. (D) The correlation values between TCGA and other datasets are shown for GBM.

GC

GG

AC

SB

KS

VR

SG

TC

CG

C.H

INF

P−

dim

eri

cW

AAT

CR

ATA

.clu

ste

r 1

9R

TA

AA

YA

WM

AA

CA

.clu

ste

r 8

8A

AA

YA

AA

YA

.clu

ste

r 2

8C

TA

WA

AATA

G.c

luste

r 3

3A

CC

ATATA

WG

G.S

RF

YA

CT

TT

CV

CT

TT.c

luste

r 4

8T

TC

YD

RG

AA

.clu

ste

r 3

5W

DA

AC

AA

WR

V.c

luste

r 4

0R

AA

WS

VG

GA

AG

T.c

luste

r 5

3G

AA

AS

YG

AA

AS

Y.c

luste

r 1

2V

SA

GG

AW

RY

VN

BW

.clu

ste

r 7

2R

MAT

WC

YD

NR

MAT

WC

Y.c

luste

r 7

4Y

RC

AT

TC

CW

SN

B.T

EA

D1

WR

WR

TA

AA

YA

.clu

ste

r 1

6G

TTA

AT

NAT

TA

AY.c

luste

r 7

0S

RK

TG

CM

SS

VN

BB

VB

.HIC

1V

RG

TC

CA

AA

GT

CC

A.H

NF

4A

GT

WR

CY

AT

RG

YA

AC

.clu

ste

r 3

1Y

VA

AG

GT

CA

.clu

ste

r 1

8C

AC

CT

GB

.clu

ste

r 1

5R

GG

TC

AN

VV

B.c

luste

r 8

2T

GC

CA

CG

TG

GC

A.C

RE

B3

L1

BR

RC

CA

AT

SR

S.N

FY

AY

RC

GC

AT

GC

GY.N

RF

1S

CG

GA

AS

CG

GA

AG

YR

.ET

V6−

dim

eri

cM

CM

CG

CC

CM

Y.c

luste

r 1

1T

TT

GG

CG

CC

AA

A.c

luste

r 6

9C

CG

GA

WR

Y.c

luste

r 0

DT

TT

SS

CG

SS

.E2

F1

TT

TC

CC

GC

CA

AA

.clu

ste

r 8

0

LUSC.HiSeq

LUSC.HiSeq.V2

UCEC.GA

UCEC.GA.V2

UCEC.HiSeq.V2

LUAD.HiSeq.V2

LUAD.HiSeq

OV.Agilent

OV.U133A

COAD.Agilent

COAD.HiSeq.V2

READ.HiSeq.V2

READ.Agilent

BRCA.HiSeq

BRCA.HiSeq.V2

BRCA.Agilent

CESC.HiSeq.V2

BLCA.HiSeq

BLCA.HiSeq.V2

THCA.HiSeq.V2

KIRC.HiSeq

KIRC.HiSeq.V2

HNSC.HiSeq

HNSC.HiSeq.V2

STAD.HiSeq

KICH.HiSeq.V2

KIRP.HiSeq.V2

LIHC.HiSeq.V2

PRAD.HiSeq.V2

GBM.Agilent

GBM.HiSeq.V2

GBM.U133A

02

04

06

08

01

00

Up

re

gu

late

d (

%)

02

04

06

08

01

00

Dow

n r

eg

ula

ted

(%

)

A

0.0

0.4

Co

rre

latio

n

BLCA.HiSeq

READ.Agilent

LIHC.HiSeq

BLCA.HiSeq.V2

BRCA.HiSeq

OV.U133A

GBM.Agilent

UCEC.GA

BRCA.HiSeq.V2

HNSC.HiSeq

GBM.U133A

KIRP.HiSeq.V2

BRCA.Agilent

UCEC.HiSeq.V2

LIHC.HiSeq.V2

STAD.HiSeq

OV.Agilent

CESC.HiSeq.V2

HNSC.HiSeq.V2

UCEC.GA.V2

COAD.Agilent

GBM.HiSeq.V2

READ.HiSeq.V2

LUSC.HiSeq

LUSC.HiSeq.V2

THCA.HiSeq.V2

LUAD.HiSeq

COAD.HiSeq.V2

PRAD.HiSeq.V2

LUAD.HiSeq.V2

KICH.HiSeq.V2

KIRC.HiSeq

KIRC.HiSeq.V2

B

Pe

rce

nta

ge

(%)

01

02

5

−0

.1

0.0

0.1

0.2

0.3

0.4

0.5

C

Fig. S11. Transcriptional regulation by TF recognition motifs. (A) The percentage of tumors with

motif target genes differentially regulated between all cancer types and regulatory motif clusters are

shown in the same way as Fig. 2. Each TF motif cluster is labeled with the consensus sequence

of centroid motif averaged among all members, followed with TF name or cluster index if there are

multiple members. (B) For each tumor, we computed the spearman rank correlation between the

regulatory activity scores profiled by ChIP-seq profiles and the regulatory activity scores profiled by

their matched recognition motifs. For each cancer type, all correlation values are shown by boxplots.

The bottom and top of the boxes are the 25th and 75th percentiles (i.e., they give the interquartile

range). Whiskers on the top and bottom represent the maximum and minimum data points within the

range represented by 1.5 times the inter-quartile range. The width of each box is proportional to the

square root of sample size in that group. (C) We computed the direct correlation between ChIP-seq and

motif binding data. For each TF, we computed the correlation of regulatory potential scores on target

genes between ChIP-seq binding and PWM binding data. The histogram of spearman rank correlation

values is plotted.

0

2000

4000

-200

0

-400

0

040

20

Coun

t

TSS offset (nt)

A

0

2000

4000

-200

0

-400

0

15

23

4

Coun

t (1

e4)

TSS offset (nt)

B

Fig. S12. Gene promoter CpG distribution. (A) The annotation of CpG islands was downloaded

from UCSC genome browser. We plotted the accumulated count of annotated CpG islands 5000 nts

around gene TSS. (B) The CpG di-nucleotide counts are plotted 5000nts around gene TSS.

H1−hESC+UT−A

HeLa−S3+UT−A





MCF−7+estrogen+UT−A

MCF−7+serum_starved_media+UT−A

HeLa−S3+Yale

K562+Yale

K562+IFNa6h+Yale

K562+IFNa30+Yale

NB4+Stanford

K562+UT−A

K562+IFNg6h+Yale


K562+Stanford

GM12878+UT−A

H1−hESC+Stanford

HUVEC+UT−A

HepG2+UT−A

1.0 0.6 0.2Correlation

Fig. S13. Outlier removal of ChIP-seq profiles. All ChIP-seq profiles for the same TF are hierarchical

clustered with Pearson correlation distance. The hierarchical tree is cut with correlation threshold 0.2

to generate clusters. Only ChIP-seq profiles in the largest cluster will be kept for further analysis. All

ENCODE MYC ChIP-seq profiles are shown as example.

Cancer Platform #Expression #CNA #Methylation #Mutation #Complete

BRCA

HiSeq.V2 1069 1025 1053 979 933

HiSeq 782 753 781 758 736

Agilent 534 516 533 513 498

KIRCHiSeq.V2 532 501 531 415 400

HiSeq 470 445 469 391 377

THCA HiSeq.V2 506 486 506 403 388

OVAgilent 592 592 591 317 317

U133A 586 586 585 316 316

HNSCHiSeq.V2 498 479 498 304 290

HiSeq 263 251 263 261 249

PRAD HiSeq.V2 419 400 333 258 247

UCEC

GA.V2 370 349 370 241 226

GA 333 313 333 239 225

HiSeq.V2 159 155 159 0 0

LIHCHiSeq.V2 212 198 200 193 182

HiSeq 17 17 17 15 15

LUSCHiSeq 225 224 225 179 179

HiSeq.V2 491 477 490 179 179

CESC HiSeq.V2 208 187 207 191 176

LUADHiSeq.V2 490 470 487 171 164

HiSeq 126 122 124 108 104

KIRP HiSeq.V2 226 218 226 161 156

STAD HiSeq 249 227 249 170 153

COADAgilent 155 115 153 142 109

HiSeq.V2 274 264 274 0 0

READAgilent 69 48 68 64 46

HiSeq.V2 92 89 91 0 0

BLCAHiSeq.V2 267 234 252 129 107

HiSeq 56 46 56 51 41

GBM

Agilent 596 521 105 286 92

U133A 548 474 74 250 63

HiSeq.V2 169 162 60 149 45

KICH HiSeq.V2 66 66 0 66 0

Table S1. Number of tumor samples in TCGA cancer types. For each TCGA dataset, we listed

the number of tumor samples measured with gene expression by different platforms (#Expression).

We also counted the number of tumors with copy number alteration (#CNA), DNA methylation data

(#Methylation) and Somatic mutation data (#Mutation). Finally, we counted the number of tumor sam-

ples with all types of data available (#Complete).

Regulator Estimate Std. Error t-value p-value

TCGA.Agilent

RBFOX3 0.719 0.098 7.36 8.64e-11

RBFOX2 0.388 0.124 3.13 2.35e-03

TCGA.U133A *

RBFOX1 0.469 0.111 4.23 8.05e-05

RBFOX2 0.521 0.222 2.34 2.25e-02

TCGA.HiSeq

RBFOX3 0.363 0.062 5.82 6.79e-07

Rembrandt.U133+2.0

RBFOX2 4.55 0.378 12.03 2.00e-28

RBFOX3 2.67 0.584 4.57 6.51e-06

RBFOX1 0.776 0.215 3.60 3.56e-04

EIF2S1 -0.926 0.406 -2.28 2.33e-02

Gravendeel.U133+2.0

RBFOX2 1.44 0.120 12.03 5.04e-27

RBFOX1 0.539 0.087 6.20 2.05e-09

*RBFOX3 value is missed in TCGA level 3 data of U133A.

Table S2. Determinant RBP members of motif clusters. For each motif cluster with several mem-

bers, RABIT uses stepwise forward regression to select relevant members that drive regulatory activity

score variation across all tumors in same cancer type. The RBP motif cluster no. 9 is used as exam-

ple here (Fig. 4A), and the regression statistics are shown for all RBP members selected for different

Glioma cohorts.

Dataset collection and processing · 2015/06/03 · Dataset collection and processing ENCODE...

Documents

Transcript of Dataset collection and processing · 2015/06/03 · Dataset collection and processing ENCODE...