Dataset collection and processing · 2015/06/03 · Dataset collection and processing ENCODE...
Transcript of Dataset collection and processing · 2015/06/03 · Dataset collection and processing ENCODE...
- 1 -
SI Methods Dataset collection and processing
ENCODE ChIP-seq data was downloaded from UCSC genome browser, including 686
profiles for 159 DNA binding proteins (1). TF recognition motifs are collected from
HTSELEX (2), Jaspar (3) and Transfac 6.0 (4). There are 853 recognition motifs collected
for 505 TFs. We also collected 172 recognition motifs for 133 RNA binding proteins (RBPs)
(5). TCGA datasets of gene expression, copy number alteration (CNA) and DNA methylation
were downloaded from TCGA Data Portal on 07/27/2014. Standardized somatic mutation
data is downloaded from Broad firehose on 10/23/2014. Since we observed that most CpG
islands distributed 1000nts upstream and downstream around gene transcription start sites
(TSS), we used the methylation array probes +/-1000nts around the TSS and got the average
value as promoter methylation level (SI Appendix, Fig. S12). The GTEx data on 01/17/2014
was downloaded for normal human tissues analysis (6).
Outlier removal for ChIP-seq profiles
Certain TF may have several ChIP-seq profiles available from different experimental
conditions, antibodies or laboratories. We found that certain ChIP-seq profile of a TF may be
very different from other profiles, which will lead to an ambiguous result for the regulatory
inference. We clustered their regulatory potential scores (computed from ChIP-seq peaks
near gene TSS) by hierarchical clustering with Pearson correlation as the distance measure
(SI Appendix, Fig. S13). The hierarchical tree is cut at correlation 0.2 to form clusters. We
ranked these clusters by numbers of profiles they contain, and only kept ChIP-seq profiles in
the largest cluster. If no clusters can be formed with correlation threshold 0.2 or several
largest clusters have the same size, we exclude the corresponding TF in further analysis.
After this step, there are 544 out of 686 ENCODE ChIP-seq profiles left for analysis,
representing 150 TFs.
Dataset normalization
All gene expression values measured by RNA sequencing platforms from TCGA and GTEx
were log2 transformed. Our analysis is focused on the expression difference between tumor
and normal tissues. For TCGA gene expression and DNA methylation profiles, there are very
limited numbers of tumor samples with paired normal tissue control. Thus, we grouped all
the normal tissue samples together and used their average value as background control in
- 2 -
each cancer. For CNA, TCGA provides very complete tumor normal paired measurements so
we use the gene CNA difference paired between tumor and normal samples. Since for each
GTEx sample, there is no normal tissue control as TCGA, we take the all tissue samples
average as background control.
Regulatory potential scores for TF and RBP binding
For each ChIP-seq profile, we searched the presence of ChIP-seq peaks 10000nts around
gene TSS annotated by RefSeq (7). A regulatory potential score is calculated between each
pair of ChIP-seq peak and gene TSS by multiplying the ENCODE ChIP-seq intensity score
with an exponential decay score exp(-A*Distance) of their distance between (8, 9). The
coefficient A is set as log(2)/1000, so that a binding peak 1000nts away from gene TSS will
decay by 50%. The ENCODE ChIP-seq intensity scores are linearly normalized into range
(0,1]. The exponential decay score also has a range (0, 1]. Thus the final regulatory potential
score has a range in (0,1]. For each gene TSS, if there are several ChIP-seq peaks of a TF
nearby, we merged their regulatory potential scores by noisy-or: 1− (1− 𝑠𝑐𝑜𝑟𝑒!)! .
For TF regulatory motifs, we searched for matches within the union DNaseI region in UCSC
multiple genome alignment of 33 placental mammals and derived a conservation score for
each motif site in human genome, using the CCAT package with default parameters (10).
Many TFs from the same TF family have very similar recognition motifs, thus the mapped
sites are highly overlapped. We clustered all TF motifs according to their mutual motif
similarity to 220 clusters, using CCAT package with default parameters (10). For all
overlapping motif sites on human genome, we merged them into one binding site of motif
cluster if more than 10% TFs from that cluster have motif hits included. The regulatory
potential scores for TF recognition motifs are calculated in the same way as ChIP-seq data
using exponential decay of distance, except that the conservation score from multiple genome
alignment is used instead of the ChIP-seq intensity score.
For most RBP motifs collected, they have lower information content comparing to TF
recognition motifs (5), and CCAT package cannot find significant hits with its statistical
model (10). Thus, we converted all RBP motifs into consensus sequences and searched for
their matches in the same strand of gene 3’UTRs by consensus matching (11). The 172 RBP
motifs are clustered into 73 clusters and overlapping binding sites are merged in the same
way as TF recognition motifs. The regulatory potential scores for RBP recognition motifs are
- 3 -
simply defined as the conservation scores on 3’UTR regions in multiple genome alignments.
When we profile the regulatory activity of RBP motifs, the promoter degree and CpG content
are replaced with 3’UTR degree (total number of RBP motifs in gene 3’UTR region) and
3’UTR AU content as gene expression background factors.
Frisch-Waugh-Lovell method of regression
When RABIT screens TFs driving tumor gene expression patterns, the multiple regressions
needs to be conducted against all 686 TF ChIP-seq profiles and in each of 7484 tumor
samples. Thus, RABIT uses the time efficient Frisch-Waugh-Lovell (FWL) method for
regression (12). In the regression, FWL separates factors whose values are not changed in
each tumor, such as CNA and Promoter degree, and only regresses against each variable
ChIP-seq profile to speed up the calculation. In Fig. 1B, the vector R represents ChIP-seq
regulatory potential scores across all target genes. The matrix B is composed of four columns
of background factors (Gene CNA, Promoter methylation, Promoter degree and CpG
content) that keep constant in the same tumor. We calculated the invariant matrix Q
(𝐼 − 𝐵(𝐵′𝐵)!!𝐵′) just one time for all regressions in the same tumor. For each ChIP-seq
profile, only the coefficient , which measures the effect of TF regulation, will be
calculated incrementally from the Q matrix.
The time complexity of linear regression is , where p is the number of covariates
(five in our analysis: four background factors plus one for TF regulatory potential score); and
N is the number of human genes (about 16000 in our analysis). For each tumor, if we run all
regressions one by one, the time complexity will be , where k is the number of
ChIP-seq profiles. With the FWL method, the computation of matrix takes
. The computation of and takes
for k ChIP-seq profiles. If we assume k > p, the
time complexity is reduced to from .
β̂r
O(p2N )
O(k * p2N )
(B 'B)−1
O((p−1)2N + (p−1)3) =O(p2N ) R 'QR R 'QY
k *O(N + (p−1)N + (p−1)2 ) =O(k * pN )
O(k * pN ) O(k * p2N )
- 4 -
Correlation between regulator gene expression, somatic mutation and target gene
expression for regulatory motif members
In Step three of RABIT framework, we tested the impact of TF gene expression and somatic
mutation variation on target gene expression with a linear regression. However, the
regression analysis is more complicated for regulatory motif because, unlike ChIP-seq
profile, one regulatory motif might represent several distinct TFs or RBPs (Fig. 4A). With the
TF regulatory activity score as the response variable, we applied the stepwise forward
regression to select among the covariates of gene expression and somatic mutation values of
all TF (or RBP) members (SI Appendix, Fig. S2B and Table S2). Instead of using F-test to
measure the effect of all covariates on regulatory activity scores across tumors, we used t-test
to assess the significance of each covariate (13). Among all regulators represented by a
regulatory motif, the p-values of covariates selected by forward regression are grouped
together and converted to FDRs by Benjamini-Hochberg procedure (14). This procedure is
applied for each cancer type, and a regulator is reported as cancer associated if at least one
covariate’s regression coefficient is statistically significant (FDR threshold 0.05).
Algorithm Comparison
In order to compare the performance of RABIT with other methods in finding cancer
associated TFs, we used receiver operating characteristic (ROC) curve and precision-recall
(PR) curve. The gold standard positive set is defined as TFs annotated as cancer associated in
at least two out of four cancer gene databases (NCI Cancer Index (15), Bushman (16, 17),
COSMIC (18) and CCGD (19)). The gold standard negative set is defined as the rest of TFs.
Among 150 TFs with ChIP-seq profile analyzed, there are 96 TFs classified as gold standard
positives and 54 classified as gold standard negatives. After running each method, we
generated a rank of TFs reflecting their relative relevance with a cancer type. We derived an
overall TF rank by averaging the TF ranks across all TCGA cancer types. We swept through
each TF rank list to generate the ROC and PR curves for each method (SI Appendix, Fig. S8
B and C). The parameters of running each method are listed as follows.
For LASSO, we used the glmnet R package and set the penalty weights of four background
factors (promoter degree, CpG content, CNA and promoter methylation) as zero and took all
other default parameters in running (20). By setting zero penalty weights, these background
factors will always be contained in the linear model as controls. For LAR, we used lars R
- 5 -
package (21). Because there is no way of inputting background covariates in lars, we first
calculated the residuals of tumor gene expression values and ChIP-seq regulatory potential
scores after regressing to four background factors. In this way, all impact of background
factors will be removed from our data. Then, we ran the lars package on these residual values
with default parameters.
Besides regression methods, we also included three methods designed for finding master
regulators in gene expression patterns. We ran MARINA and VIPER algorithms using
VIPER R package with default parameters (22, 23). We also ran the Expression-2-Kinase
(X2K) package with default parameters (24). Since the X2K only takes gene list as input, we
took the top 10% most up-regulated (or down-regulated) genes in each tumor to calculate the
master TFs in gene up-regulation (or down-regulation). We also included two methods for
baseline comparison. For each TF, we use t-test to check whether its ChIP-seq target genes
are significantly differentially regulated. The t-test p-values were converted to FDRs by
Benjamini-Hochberg procedure and FDR threshold 0.05 was used to select significantly TFs
in each tumor. For each TF, we also used its gene expression difference between all tumor
samples and all normal controls as a measurement of cancer relevance.
For RABIT, LAR, LASSO, X2K and t-test, a set of TFs was selected as important regulators
of gene expression patterns in each tumor. For each TF, the percentage of tumors with TF
selected was calculated for each TCGA cancer type (Fig. 2). The overall cancer relevance is
defined as average percentage of tumors with TF selected across all TCGA cancer types. For
MARINA and VIPER, the TFs were ranked by adjusted p-values from each algorithm. The
overall cancer relevance of TFs was defined from the average rank of TFs across all TCGA
cancer types. For TF expression, the TFs were ranked by the absolute value of tumor versus
normal expression difference averaged among all TCGA cancers. The ranked list of each
algorithm was swept through to generate the ROC and PR curves (SI Appendix, Fig. S8 B
and C). The area under ROC curve is compared between two algorithms using Delong-test
(25).
Besides comparing different methods, we also test the performance of RABIT without
controlling four background factors (promoter degree, CpG content, CNA and promoter
methylation). The area under curve of RABIT is significantly larger than the result without
background factors (SI Appendix, Fig. S8D, 0.730 > 0.682, Delong test P-value = 0.023),
- 6 -
and the precision of RABIT is consistently larger than the result without background factors
when the recall rate is higher than 0.5 (SI Appendix, Fig. S8E).
Correlation between TF regulatory activity and genome-wide CRISPR screening
For cell line K562 and HL60, there are gene expression data profiled by ENCODE and
genome-wide CRISPR screening data available from previous studies (26, 27). The CRISPR
screening scores were directly downloaded from each study. For each gene screened, a
positive score implied the cell growth became faster after TF CRISPR knock out, and a
negative score implied the cell growth became slower after TF CRISPR knock out.
We applied RABIT over ENCODE gene expression data over 76 cell lines. A regulatory
activity score was calculated for each TF to indicate whether the TF target genes are up
regulated or down regulated. For transcriptional repressors (defined below), the regulatory
activity scores were sign-reversed, since the direction of gene targets regulation is reverse to
the direction of TF activation. The spearman rank correlations between TF regulatory activity
scores and TF CRISPR screening scores were calculated and the p-values of correlation test
were calculated with R package.
In the analysis above, we only included transcriptional activators and repressors. The
correlations between TF gene expression values and target regulatory activity scores were
computed for all TCGA cancer data and significant correlations were selected with
correlation t-test FDR threshold 0.05. Transcriptional activators are defined as TFs with
positive correlations in more than 80% TCGA cancer types, and transcriptional repressors are
defined as TFs with negative correlations in more than 80% cancer types.
References
1. Consortium EP, et al. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414):57-74.
2. Jolma A, et al. (2013) DNA-binding specificities of human transcription factors. Cell 152(1-2):327-339.
3. Mathelier A, et al. (2014) JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic acids research 42(Database issue):D142-147.
4. Matys V, et al. (2003) TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic acids research 31(1):374-378.
- 7 -
5. Ray D, et al. (2013) A compendium of RNA-binding motifs for decoding gene regulation. Nature 499(7457):172-177.
6. Consortium GT (2013) The Genotype-Tissue Expression (GTEx) project. Nature genetics 45(6):580-585.
7. Pruitt KD, Tatusova T, & Maglott DR (2007) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic acids research 35(Database issue):D61-65.
8. Tang Q, et al. (2011) A comprehensive view of nuclear receptor cancer cistromes. Cancer research 71(22):6940-6947.
9. Wang S, et al. (2013) Target analysis by integration of transcriptome and ChIP-seq data with BETA. Nature protocols 8(12):2502-2515.
10. Jiang P & Singh M (2014) CCAT: Combinatorial Code Analysis Tool for transcriptional regulation. Nucleic acids research 42(5):2833-2847.
11. Kheradpour P, Stark A, Roy S, & Kellis M (2007) Reliable prediction of regulator targets using 12 Drosophila genomes. Genome research 17(12):1919-1931.
12. Frisch R & Waugh FV (1933) Partial Time Regressions as Compared with Individual Trends. Econometrica 1(4):387-401.
13. Freedman D (2009) Statistical models : theory and practice (Cambridge University Press, Cambridge ; New York) pp xiv, 442 p.
14. Benjamini Y & Hochberg Y (1995) Controlling the False Discovery Rate - a Practical and Powerful Approach to Multiple Testing. J Roy Stat Soc B Met 57(1):289-300.
15. NCI (2014) Cancer Gene Index Project. 16. Sadelain M, Papapetrou EP, & Bushman FD (2012) Safe harbours for the integration
of new DNA in the human genome. Nature reviews. Cancer 12(1):51-58. 17. Vogelstein B, et al. (2013) Cancer genome landscapes. Science 339(6127):1546-
1558. 18. Futreal PA, et al. (2004) A census of human cancer genes. Nature reviews. Cancer
4(3):177-183. 19. Abbott KL, et al. (2014) The Candidate Cancer Gene Database: a database of cancer
driver genes from forward genetic screens in mice. Nucleic acids research. 20. Friedman J, Hastie T, & Tibshirani R (2010) Regularization Paths for Generalized
Linear Models via Coordinate Descent. Journal of statistical software 33(1):1-22. 21. Efron B, Hastie T, Johnstone I, & Tibshirani R (2004) Least angle regression. The
Annals of statistics 32(2):407-499. 22. Lefebvre C, et al. (2010) A human B-cell interactome identifies MYB and FOXM1 as
master regulators of proliferation in germinal centers. Molecular systems biology 6:377.
23. Alvarez MJ (2013) viper: Master Regulator Analysis including MARINA and VIPER algorithms. R package version 0.99.0).
24. Chen EY, et al. (2012) Expression2Kinases: mRNA profiling linked to multiple upstream regulatory layers. Bioinformatics 28(1):105-111.
25. DeLong ER, DeLong DM, & Clarke-Pearson DL (1988) Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44(3):837-845.
26. Gilbert LA, et al. (2014) Genome-Scale CRISPR-Mediated Control of Gene Repression and Activation. Cell 159(3):647-661.
27. Wang T, Wei JJ, Sabatini DM, & Lander ES (2014) Genetic screens in human cells using the CRISPR-Cas9 system. Science 343(6166):80-84.
Supplementary Figures and Tables
A
B
−0
.10
.00
.2C
orr
ela
tio
n
LU
SC
.HiS
eq
LU
SC
.HiS
eq
.V2
OV.A
gile
nt
LU
AD
.HiS
eq
.V2
LU
AD
.HiS
eq
LIH
C.H
iSe
q
CO
AD
.HiS
eq
.V2
STA
D.H
iSe
q
RE
AD
.HiS
eq
.V2
BL
CA
.HiS
eq
GB
M.U
13
3A
BR
CA
.HiS
eq
BR
CA
.Ag
ilen
t
BR
CA
.HiS
eq
.V2
BL
CA
.HiS
eq
.V2
RE
AD
.Ag
ilen
t
LIH
C.H
iSe
q.V
2
GB
M.H
iSe
q.V
2
OV.U
13
3A
UC
EC
.GA
UC
EC
.GA
.V2
PR
AD
.HiS
eq
.V2
UC
EC
.HiS
eq
.V2
CE
SC
.HiS
eq
.V2
GB
M.A
gile
nt
CO
AD
.Ag
ilen
t
HN
SC
.HiS
eq
HN
SC
.HiS
eq
.V2
KIR
P.H
iSe
q.V
2
KIC
H.H
iSe
q.V
2
TH
CA
.HiS
eq
.V2
KIR
C.H
iSe
q
KIR
C.H
iSe
q.V
2
−0
.10
.0C
orr
ela
tio
n
LU
SC
.HiS
eq
.V2
LU
SC
.HiS
eq
OV.A
gile
nt
LIH
C.H
iSe
q
STA
D.H
iSe
q
LIH
C.H
iSe
q.V
2
CO
AD
.HiS
eq
.V2
LU
AD
.HiS
eq
.V2
BR
CA
.Ag
ilen
t
LU
AD
.HiS
eq
HN
SC
.HiS
eq
RE
AD
.HiS
eq
.V2
PR
AD
.HiS
eq
.V2
OV.U
13
3A
CO
AD
.Ag
ilen
t
BR
CA
.HiS
eq
HN
SC
.HiS
eq
.V2
RE
AD
.Ag
ilen
t
BR
CA
.HiS
eq
.V2
KIC
H.H
iSe
q.V
2
BL
CA
.HiS
eq
.V2
BL
CA
.HiS
eq
GB
M.U
13
3A
UC
EC
.GA
UC
EC
.HiS
eq
.V2
UC
EC
.GA
.V2
CE
SC
.HiS
eq
.V2
GB
M.A
gile
nt
TH
CA
.HiS
eq
.V2
KIR
P.H
iSe
q.V
2
GB
M.H
iSe
q.V
2
KIR
C.H
iSe
q
KIR
C.H
iSe
q.V
2
0.2
Fig. S1. Background confounding factors of gene expression. (A) For each tumor, we computed
the spearman rank correlation between the promoter degree of each gene and the tumor normal ex-
pression difference. For each cancer type, the correlation values across all tumors are shown. The
bottom and top of the box are the 25th and 75th percentiles (i.e., they give the interquartile range).
Whiskers on the top and bottom represent the maximum and minimum data points within the range
represented by 1.5 times the inter-quartile range. (B) The correlation values between the promoter
CpG content and gene tumor normal expression difference are shown by boxplots.
Background factors Regulators in selectionGene expression
n: number of genes pb: number of background factors
pr: number of regulators
Regulatory activity scores (t-value)
Regulator gene expression: E
or somatic mutation: M
n: number of tumors p: number of members
A
B
sub-matrix: B sub-matrix: R
Fig. S2. Linear model structure. (A) For each tumor, the TF regulatory effect is evaluated on target
gene expression by linear regression with each regression unit as a human gene. The covariate matrix
X is composed of two sub matrices B and R. Sub-matrix B contains the values of background factors,
which include Promoter degree, Promoter CpG content, Gene CNA and Promoter DNA methylation.
Sub-matrix R contains the regulatory potential scores that measure TF binding intensity near gene
TSS, and the set of TFs will be selected to accurately model the gene expression pattern in each
tumor. The response variable Y contains the gene expression differences between tumor sample and
normal controls. In our analysis, the number of regression unit n is about 16000, which represents the
number of human genes with TCGA gene expression measured and ChIP-seq binding peaks near its
TSS. The number of background factors pb, i.e., the dimension of B is 4. The dimension of R, pr, is 150,
which represents the number of TFs with ENCODE ChIP-seq profiles. (B) RABIT investigates whether
the public ChIP-seq profiles (or regulatory motifs) used can represent the active TF targets in each
cancer type. For each cancer type, we regress the response variable of TF regulatory activity scores
linearly against covariates of TF gene expression and somatic mutation, where each regression unit as
a tumor sample. The TF (or RBP) regulatory motif might contain several members, and all of them are
included as covariates. The response vector contains the TF regulatory activity scores, which are the
estimated coefficients normalized by their estimated standard errors (aka t-values) for TF regulatory
effects on target gene differential expression (example in Table 1A). The number of regression unit is
the number of tumors in each TCGA cancer type. The dimension of X is 2 for ChIP-seq analysis, which
represents the TF gene expression and somatic mutation. For regulatory motifs, there might be several
members included; such as RBP motif cluster 9 includes RBFOX1, RBFOX2, RBFOX3 and EIF2S1. To
determine which members are relevant with motif target genes expression patterns, we run a forward
selection among gene expression and mutation values (2p covariates) of all members with Mallows Cp
as model selection metric (example in SI Appendix, Table S2).
A BPercentage (%)
0 20 40 60
MCF−7+vehicle+UT−A
K562+IFNg30+Stanford
K562+Stanford
NB4+Stanford
MCF10A−Er−Src+EtOH_0.01pct+Harvard
K562+UT−A
HepG2+UT−A
MCF−7+serum_stimulated_media+UT−A
GM12878+UT−A
MCF10A−Er−Src+4OHTAM_1uM_4hr+Harvard
K562+IFNg6h+Yale
H1−hESC+Stanford
K562+IFNa6h+Yale
K562+IFNa30+Yale
HeLa−S3+Yale
t−value0 4 8 12
MCF−7+vehicle+UT−A
MCF−7+serum_stimulated_media+UT−A
K562+IFNg30+Stanford
K562+Stanford
K562+IFNg6h+Yale
MCF−7+estrogen+UT−A
K562+IFNa6h+Yale
K562+UT−A
NB4+Stanford
K562+IFNa30+Yale
MCF−7+serum_starved_media+UT−A
HeLa−S3+Yale
GM12878+UT−A
K562+Yale
HepG2+UT−A
H1−hESC+Stanford
MCF10A−Er−Src+EtOH_0.01pct+Harvard
HUVEC+UT−A
HeLa−S3+UT−A
MCF10A−Er−Src+4OHTAM_1uM_4hr+Harvard
Fig. S3. Selection of ChIP-seq profile with the largest statistical effect. When one TF has several
ChIP-seq profiles available, RABIT only uses the ChIP-seq profile that gives the most significant co-
efficient (the largest absolute t-value) in the regression analysis of TF regulation on target genes. (A)
In this example, all ENCODE MYC ChIP-seq profiles are analyzed together with TCGA data of breast
tumor TCGA-AO-A03P-01A. (B) For each TCGA breast tumor, only one most relevant ENCODE MYC
ChIP-seq profile is selected. We show the percentage of tumors that each ChIP-seq profile is selected.
RB
BP
5E
P300
YY
1F
OX
A1
FO
XP
2C
EB
PB
SR
FT
EA
D4
CT
CF
MA
FK
RA
D21
MA
ZR
FX
5N
R3C
1E
BF
1H
NF
4A
EZ
H2
ZN
F217
TA
F1
RE
ST
SP
I1S
TAT
2S
UZ
12
STAT
3F
OS
IKZ
F1
MA
XR
UN
X3
ZB
TB
7A
ZE
B1
E2F
4M
YC
SA
P30
PH
F8
MY
BL2
E2F
1F
OX
M1
LUSC.HiSeq.V2LUSC.HiSeqLUAD.HiSeqLUAD.HiSeq.V2UCEC.GAUCEC.GA.V2UCEC.HiSeq.V2BRCA.HiSeqBRCA.HiSeq.V2BRCA.AgilentCOAD.HiSeq.V2READ.HiSeq.V2COAD.AgilentBLCA.HiSeq.V2PRAD.HiSeq.V2THCA.HiSeq.V2STAD.HiSeqKICH.HiSeq.V2CESC.HiSeq.V2LIHC.HiSeq.V2BLCA.HiSeqHNSC.HiSeq.V2HNSC.HiSeqOV.AgilentOV.U133AGBM.U133AKIRP.HiSeq.V2KIRC.HiSeq.V2KIRC.HiSeqGBM.AgilentGBM.HiSeq.V2
100 50 0 50 100
Up regulated (%)Down regulated (%)
12 3 4
Fig. S4. Structure of transcriptional regulation in cancer. In order to derive a structure of transcrip-
tional regulation in cancer, we clustered TFs by the difference between the percentage of tumors with
TF target genes up regulated and the percentage down regulated in diverse cancer types. We split the
hierarchical tree into four clusters, with an outlier ZEB1 that are not clustered in any group.
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●●
●
●
−1 1 2 3 4
−20
−5
5
R= −0.275 (P=2.27e−10)
Gene expression
Regula
tory
activity
Fig. S5. Correlation of gene expression between RAD21 and its target genes. The spearman rank
correlation between RAD21 gene expression values and TF target regulatory activity scores is shown
for TCGA breast tumors.
0 400 1000
040
80
HighLow
p-value = 0.003P
erc
enta
ge (
%)
Survival (days)
A
0 4000
040
80 High
Low
p-value = 0.0229
Survival (days)
Perc
enta
ge (
%)
B
0 4000
040
80 High
Low
Perc
enta
ge (
%)
p-value = 3.82e-5C
Survival (days)
0 1000 2500
040
80
HighLow
p-value = 2.37e-4D
Perc
enta
ge (
%) High
Low
0 4000
040
80
p-value = 7.65e-8EP
erc
enta
ge (
%)
040
80 High
Low
0 4000
p-value = 5.68e-9
Perc
enta
ge (
%)
F
Survival (days) Survival (days) Survival (days)
Fig. S6. Survival analysis of several predictions. (A) The patients were ordered by the SPI1
expression values using TCGA GBM data. The top half patients are classified as “High” and the bottom
half are classified as “Low”. The overall survival days are plotted by Kaplan-Meier curve and the p-
value is estimated by Weibull model with age and gender as background factors. (B) Survival analysis
is done with Rembrandt data in the same way as A. (C) Survival analysis is done with Gravendeel data
in the same way as A. (D) Survival analysis is done with TCGA KIRC data in the same way as A, except
the clinical stage of KIRC is added as one background factor in Weibull regression. (E) The Rembrandt
cohort is used and the regulatory activity of RBP motif cluster 9 is analyzed for survival in the same
way as Fig. 4C. (F) The Gravendeel cohort is used and the regulatory activity of RBP motif cluster 9 is
analyzed in the same way as Fig. 4C.
Percentage (%)0 20 40 60
Up
Down
PHF8
RBBP5
TBP
MAFK
SAP30
SPI1
EBF1
ZKSCAN1
BHLHE40
SMC3
TAF1
UBTF
IRF3
TEAD4
GTF2F1
RFX5
GTF3C2
ATF1
MXI1
GTF2B
CDKN3
0
2
26
0
3
137
19
1
0
4
4
5
27
0
0
0
0
82
173
6
21
−
+
−
+
−
+
+
−
−
−
−
−
−
−
−
+
−
+
+
−
+
−
−
−
−
−
−
+
−
−
−
−
−
−
−
−
−
−
+
−
−
−
+
−
+
+
−
+
−
+
+
+
+
+
−
−
+
+
−
+
+
−
+
NCI Cancer Index
Bushman
COSM
IC
CCGD
Fig. S7. TFs with potential roles in breast cancer. We show all TFs with target genes differentially
regulated in more than 10% breast tumors, averaged among datasets of TCGA and METABRIC. We ex-
cluded all TFs related with breast cancer by NCI cancer index annotation and Google literature search.
The rest of TFs represent regulators that may have global impact for breast tumor gene expression but
are not well studied until now. For each TF, the fraction of tumors with target genes up regulated is
shown in red and the fraction down regulated is shown in blue. The NCI cancer index value and cancer
gene annotation from all databases are shown on left.
B C
D E
A0
20
40
Pe
rce
nta
ge (
%)
Cancer Other
Bushm
an
COSM
IC
CCGD
*** **
False positive rate
Tru
e p
ositiv
e r
ate
0.0 0.4 0.8
0.0
0.4
0.8
RABIT
No background
RecallP
recis
ion
0.0 0.4 0.8
0.7
0.9
False positive rate
Tru
e p
ositiv
e r
ate
0.0 0.4 0.8
0.0
0.4
0.8
Recall
Pre
cis
ion
0.0 0.4 0.8
0.5
0.7
0.9
RABIT LAR LASSO t test
X2K MARINA VIPER TF expression
Fig. S8. Algorithm performance comparison. (A) We divided TFs according to the cancer gene
annotation in each database, and compared the average percentage of tumors (across all TCGA can-
cers) with target gene differentially regulated between two categories by the Mann-Whitney U test. One
asterisk indicates P -value < 0.05, two asterisks indicate P -value < 0.01, and three asterisks indicate
P -value < 0.001. (B) Receiver operating characteristic (ROC) curves were plotted for RABIT and seven
other methods. LAR and LASSO are two regression based feature selection methods. X2K, MARINA
and VIPER are integrative algorithms developed to find master regulators in driving gene expression
patterns. T-test and TF expression are two base-line methods by only using the differential expression
of target genes (T-test) or TF itself (TF expression). (C) Precision-recall (PR) curves were plotted for
all methods. (D) We ran the RABIT algorithm without controlling four background factors (promoter
degree, CpG content, CNA and promoter methylation). The algorithm performance was compared be-
tween initial result and result without background factor control using ROC curve. The area under curve
of RABIT is significantly larger than the result without background factors (0.730 > 0.682, Delong test
P -value = 0.023). (E) The algorithm performance was plotted with PR curve.
CV Error Improve (%)
0 2 4 6 8
GBM.HiSeq.V2
LUSC.HiSeq
CESC.HiSeq.V2
GBM.U133A
BLCA.HiSeq
BLCA.HiSeq.V2
LUSC.HiSeq.V2
UCEC.HiSeq.V2
BRCA.HiSeq
UCEC.GA
UCEC.GA.V2
STAD.HiSeq
LUAD.HiSeq
BRCA.HiSeq.V2
OV.U133A
GBM.Agilent
LIHC.HiSeq
LIHC.HiSeq.V2
HNSC.HiSeq
OV.Agilent
BRCA.Agilent
READ.HiSeq.V2
LUAD.HiSeq.V2
HNSC.HiSeq.V2
COAD.HiSeq.V2
READ.Agilent
KIRC.HiSeq
KIRP.HiSeq.V2
COAD.Agilent
KIRC.HiSeq.V2
PRAD.HiSeq.V2
KICH.HiSeq.V2
THCA.HiSeq.V2
RABIT
LAR
LASSO
time(sec)
0.21.05.020.0200.0
Fig. S9. RABIT outperforms the state-of-the-art feature selection algorithms. For each tumor, we
selected the top ten most important TFs prioritized by each algorithm to train a linear model. The leave-
one-out cross validation (CV) error was used to estimate the error of predicting gene expression values
from TF binding. The CV errors were calculated for the model with only background factors of gene
expression and for model with top ten most important TFs. In each tumor, the improvement between
two CV errors was converted to a relative fraction by dividing the CV error of the base model with
only four background factors. In each cancer type, the relative CV error improvements were averaged
across all tumors and taken as algorithm performance. For each method, the average running time
across all tumors in a cancer type was shown in log scale.
A
B
C
D
0 20 40 60 80 100
Up regulated (%)
0 20 40 60 80 100
Down regulated (%)
GTEx Breast
METABRIC
Rembrandt
Gravendeel
GTEx Brain
RB
BP
5R
AD
21
FO
SF
OX
A1
MA
FK
CE
BP
BF
OX
P2
EP
30
0R
ES
TF
OX
M1
ZE
B1
E2
F4
MY
BL
2S
AP
30
GATA
3E
LF
1M
YC
E2
F1
PH
F8
GTEx Breast
Agilent
METABRIC
HiSeq.V2
HiSeq
ZE
B1
RE
ST
EZ
H2
CT
CF
MA
ZE
GR
1S
UZ
12
PH
F8
FO
XM
1M
YB
L2
IKZ
F1
CE
BP
BE
2F
1TA
F1
MY
CE
2F
4E
LF
1S
TAT
2S
TAT
3S
PI1
GTEx Brain
Rembrandt
Gravendeel
U133A
Agilent
HiSeq.V2
Correlation−0.2 0.2 0.6
HiSeq
HiSeq.V2
Agilent
Correlation−0.2 0.4
U133A
HiSeq.V2
Agilent
Fig. S10. Cross datasets comparison of TF regulation. For breast cancer and Glioblastoma mul-
tiforme (GBM), we applied RABIT framework on other cohorts as independent study control of TCGA.
To contrast the regulatory behavior of TFs in tumor, we also applied RABIT on the GTEx gene ex-
pression cohorts for TF regulatory activities in normal breast and brain tissues. (A) For breast cancer,
we compared the transcriptional regulation landscape from TCGA data with the results computed from
METABRIC and GTEx datasets. There are 1992 breast tumor samples included in METABRIC and 66
normal breast samples included in GTEx. The percentage of tumors or GTEx samples with TF targets
differential regulated is plotted in the same way as Fig. 2. (B) For GBM, we compared the results of
TCGA datasets with the results computed from other Glioma datasets and GTEx dataset. There are
381 tumor samples included in Rembrandt cohort, 276 tumor samples included in Gavendeel cohort
and 357 normal brain samples included in GTEx. The percentage of samples with TF targets differ-
entially regulated is plotted. (C) For each TF, we used the difference between the percentage of up
regulated samples and the percentage of down regulated samples as an overall measure of regula-
tion in each cancer type. We compared this overall measure between each TCGA dataset and other
datasets by spearman rank correlation across all TFs. The correlation values are shown for breast
cancer. (D) The correlation values between TCGA and other datasets are shown for GBM.
GC
GG
AC
SB
KS
VR
SG
TC
CG
C.H
INF
P−
dim
eri
cW
AAT
CR
ATA
.clu
ste
r 1
9R
TA
AA
YA
WM
AA
CA
.clu
ste
r 8
8A
AA
YA
AA
YA
.clu
ste
r 2
8C
TA
WA
AATA
G.c
luste
r 3
3A
CC
ATATA
WG
G.S
RF
YA
CT
TT
CV
CT
TT.c
luste
r 4
8T
TC
YD
RG
AA
.clu
ste
r 3
5W
DA
AC
AA
WR
V.c
luste
r 4
0R
AA
WS
VG
GA
AG
T.c
luste
r 5
3G
AA
AS
YG
AA
AS
Y.c
luste
r 1
2V
SA
GG
AW
RY
VN
BW
.clu
ste
r 7
2R
MAT
WC
YD
NR
MAT
WC
Y.c
luste
r 7
4Y
RC
AT
TC
CW
SN
B.T
EA
D1
WR
WR
TA
AA
YA
.clu
ste
r 1
6G
TTA
AT
NAT
TA
AY.c
luste
r 7
0S
RK
TG
CM
SS
VN
BB
VB
.HIC
1V
RG
TC
CA
AA
GT
CC
A.H
NF
4A
GT
WR
CY
AT
RG
YA
AC
.clu
ste
r 3
1Y
VA
AG
GT
CA
.clu
ste
r 1
8C
AC
CT
GB
.clu
ste
r 1
5R
GG
TC
AN
VV
B.c
luste
r 8
2T
GC
CA
CG
TG
GC
A.C
RE
B3
L1
BR
RC
CA
AT
SR
S.N
FY
AY
RC
GC
AT
GC
GY.N
RF
1S
CG
GA
AS
CG
GA
AG
YR
.ET
V6−
dim
eri
cM
CM
CG
CC
CM
Y.c
luste
r 1
1T
TT
GG
CG
CC
AA
A.c
luste
r 6
9C
CG
GA
WR
Y.c
luste
r 0
DT
TT
SS
CG
SS
.E2
F1
TT
TC
CC
GC
CA
AA
.clu
ste
r 8
0
LUSC.HiSeq
LUSC.HiSeq.V2
UCEC.GA
UCEC.GA.V2
UCEC.HiSeq.V2
LUAD.HiSeq.V2
LUAD.HiSeq
OV.Agilent
OV.U133A
COAD.Agilent
COAD.HiSeq.V2
READ.HiSeq.V2
READ.Agilent
BRCA.HiSeq
BRCA.HiSeq.V2
BRCA.Agilent
CESC.HiSeq.V2
BLCA.HiSeq
BLCA.HiSeq.V2
THCA.HiSeq.V2
KIRC.HiSeq
KIRC.HiSeq.V2
HNSC.HiSeq
HNSC.HiSeq.V2
STAD.HiSeq
KICH.HiSeq.V2
KIRP.HiSeq.V2
LIHC.HiSeq.V2
PRAD.HiSeq.V2
GBM.Agilent
GBM.HiSeq.V2
GBM.U133A
02
04
06
08
01
00
Up
re
gu
late
d (
%)
02
04
06
08
01
00
Dow
n r
eg
ula
ted
(%
)
A
0.0
0.4
Co
rre
latio
n
BLCA.HiSeq
READ.Agilent
LIHC.HiSeq
BLCA.HiSeq.V2
BRCA.HiSeq
OV.U133A
GBM.Agilent
UCEC.GA
BRCA.HiSeq.V2
HNSC.HiSeq
GBM.U133A
KIRP.HiSeq.V2
BRCA.Agilent
UCEC.HiSeq.V2
LIHC.HiSeq.V2
STAD.HiSeq
OV.Agilent
CESC.HiSeq.V2
HNSC.HiSeq.V2
UCEC.GA.V2
COAD.Agilent
GBM.HiSeq.V2
READ.HiSeq.V2
LUSC.HiSeq
LUSC.HiSeq.V2
THCA.HiSeq.V2
LUAD.HiSeq
COAD.HiSeq.V2
PRAD.HiSeq.V2
LUAD.HiSeq.V2
KICH.HiSeq.V2
KIRC.HiSeq
KIRC.HiSeq.V2
B
Pe
rce
nta
ge
(%)
01
02
5
−0
.1
0.0
0.1
0.2
0.3
0.4
0.5
C
Fig. S11. Transcriptional regulation by TF recognition motifs. (A) The percentage of tumors with
motif target genes differentially regulated between all cancer types and regulatory motif clusters are
shown in the same way as Fig. 2. Each TF motif cluster is labeled with the consensus sequence
of centroid motif averaged among all members, followed with TF name or cluster index if there are
multiple members. (B) For each tumor, we computed the spearman rank correlation between the
regulatory activity scores profiled by ChIP-seq profiles and the regulatory activity scores profiled by
their matched recognition motifs. For each cancer type, all correlation values are shown by boxplots.
The bottom and top of the boxes are the 25th and 75th percentiles (i.e., they give the interquartile
range). Whiskers on the top and bottom represent the maximum and minimum data points within the
range represented by 1.5 times the inter-quartile range. The width of each box is proportional to the
square root of sample size in that group. (C) We computed the direct correlation between ChIP-seq and
motif binding data. For each TF, we computed the correlation of regulatory potential scores on target
genes between ChIP-seq binding and PWM binding data. The histogram of spearman rank correlation
values is plotted.
0
2000
4000
-200
0
-400
0
040
20
Coun
t
TSS offset (nt)
A
0
2000
4000
-200
0
-400
0
15
23
4
Coun
t (1
e4)
TSS offset (nt)
B
Fig. S12. Gene promoter CpG distribution. (A) The annotation of CpG islands was downloaded
from UCSC genome browser. We plotted the accumulated count of annotated CpG islands 5000 nts
around gene TSS. (B) The CpG di-nucleotide counts are plotted 5000nts around gene TSS.
H1−hESC+UT−A
HeLa−S3+UT−A
MCF10A−Er−Src+4OHTAM_1uM_4hr+Harvard
MCF10A−Er−Src+EtOH_0.01pct+Harvard
MCF−7+vehicle+UT−A
MCF−7+serum_stimulated_media+UT−A
MCF−7+estrogen+UT−A
MCF−7+serum_starved_media+UT−A
HeLa−S3+Yale
K562+Yale
K562+IFNa6h+Yale
K562+IFNa30+Yale
NB4+Stanford
K562+UT−A
K562+IFNg6h+Yale
K562+IFNg30+Stanford
K562+Stanford
GM12878+UT−A
H1−hESC+Stanford
HUVEC+UT−A
HepG2+UT−A
1.0 0.6 0.2Correlation
Fig. S13. Outlier removal of ChIP-seq profiles. All ChIP-seq profiles for the same TF are hierarchical
clustered with Pearson correlation distance. The hierarchical tree is cut with correlation threshold 0.2
to generate clusters. Only ChIP-seq profiles in the largest cluster will be kept for further analysis. All
ENCODE MYC ChIP-seq profiles are shown as example.
Cancer Platform #Expression #CNA #Methylation #Mutation #Complete
BRCA
HiSeq.V2 1069 1025 1053 979 933
HiSeq 782 753 781 758 736
Agilent 534 516 533 513 498
KIRCHiSeq.V2 532 501 531 415 400
HiSeq 470 445 469 391 377
THCA HiSeq.V2 506 486 506 403 388
OVAgilent 592 592 591 317 317
U133A 586 586 585 316 316
HNSCHiSeq.V2 498 479 498 304 290
HiSeq 263 251 263 261 249
PRAD HiSeq.V2 419 400 333 258 247
UCEC
GA.V2 370 349 370 241 226
GA 333 313 333 239 225
HiSeq.V2 159 155 159 0 0
LIHCHiSeq.V2 212 198 200 193 182
HiSeq 17 17 17 15 15
LUSCHiSeq 225 224 225 179 179
HiSeq.V2 491 477 490 179 179
CESC HiSeq.V2 208 187 207 191 176
LUADHiSeq.V2 490 470 487 171 164
HiSeq 126 122 124 108 104
KIRP HiSeq.V2 226 218 226 161 156
STAD HiSeq 249 227 249 170 153
COADAgilent 155 115 153 142 109
HiSeq.V2 274 264 274 0 0
READAgilent 69 48 68 64 46
HiSeq.V2 92 89 91 0 0
BLCAHiSeq.V2 267 234 252 129 107
HiSeq 56 46 56 51 41
GBM
Agilent 596 521 105 286 92
U133A 548 474 74 250 63
HiSeq.V2 169 162 60 149 45
KICH HiSeq.V2 66 66 0 66 0
Table S1. Number of tumor samples in TCGA cancer types. For each TCGA dataset, we listed
the number of tumor samples measured with gene expression by different platforms (#Expression).
We also counted the number of tumors with copy number alteration (#CNA), DNA methylation data
(#Methylation) and Somatic mutation data (#Mutation). Finally, we counted the number of tumor sam-
ples with all types of data available (#Complete).
Regulator Estimate Std. Error t-value p-value
TCGA.Agilent
RBFOX3 0.719 0.098 7.36 8.64e-11
RBFOX2 0.388 0.124 3.13 2.35e-03
TCGA.U133A *
RBFOX1 0.469 0.111 4.23 8.05e-05
RBFOX2 0.521 0.222 2.34 2.25e-02
TCGA.HiSeq
RBFOX3 0.363 0.062 5.82 6.79e-07
Rembrandt.U133+2.0
RBFOX2 4.55 0.378 12.03 2.00e-28
RBFOX3 2.67 0.584 4.57 6.51e-06
RBFOX1 0.776 0.215 3.60 3.56e-04
EIF2S1 -0.926 0.406 -2.28 2.33e-02
Gravendeel.U133+2.0
RBFOX2 1.44 0.120 12.03 5.04e-27
RBFOX1 0.539 0.087 6.20 2.05e-09
*RBFOX3 value is missed in TCGA level 3 data of U133A.
Table S2. Determinant RBP members of motif clusters. For each motif cluster with several mem-
bers, RABIT uses stepwise forward regression to select relevant members that drive regulatory activity
score variation across all tumors in same cancer type. The RBP motif cluster no. 9 is used as exam-
ple here (Fig. 4A), and the regression statistics are shown for all RBP members selected for different
Glioma cohorts.