1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu...
Transcript of 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu...
![Page 1: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/1.jpg)
DeepPASTA: Deep neural network based polyadenylation site analysis(Supplementary Materials)
Ashraful Arefeen, Xinshu Xiao, and Tao Jiang
1 Supplementary figures
Genomic sequence
RNA secondary structure 1 Likelihood value for the
middle position of the input sequence to be a polyA siteRNA secondary
structure 2
RNA secondary structure 3
PolyA site prediction model of DeepPASTA
Input Output
(a)
Genomic sequence
RNA secondary structure 1
RNA secondary structure 2
RNA secondary structure 3
Tissue-specific polyAsite prediction model of DeepPASTA
Likelihood value for the middle position of the input sequence to be a polyA site for tissue 1
Likelihood value for the middle position of the input sequence to be a polyA site for tissue 9
Input Output
(b)
Genomic sequence 1
RNA secondary structure 1
Probability for the 1st polyAsite to be more dominant
Genomic sequence 2
RNA secondary structure 2
Tissue-specific relatively dominant polyA site prediction model of DeepPASTA
Input Output
Probability for the 2nd polyAsite to be more dominant
(c)
Genomic sequence
RNA secondary structure
Likelihood value for the input polyA site to be the absolutely dominant polyA site of the gene
Tissue-specific absolutely dominant polyA site prediction model of DeepPASTA
Input Output
(d)
Figure S1: The input and output of the four models of DeepPASTA. a) The polyA site prediction modelof DeepPASTA takes a genomic sequence of 200 nts and three energy efficient RNA secondary structurespredicted by RNAshapes [Steffen et al., 2006] from the sequence as the input and predicts whether the inputsequence contains a polyA site at the middle or not. b) Similar to the previous model, the tissue-specificpolyA site prediction model of DeepPASTA takes a sequence and three corresponding secondary structuresgenerated by RNAshapes as the input and predicts whether the input sequence contains a polyA site atthe middle or not for the nine tissues studied in [Derti et al., 2012]. c) The tissue-specific relatively domi-nant polyA site prediction model of DeepPASTA takes a couple of sequences and corresponding secondarystructures containing polyA sites of some gene at the middle as the input and predicts which polyA site isrelatively dominant. d) Unlike the relatively dominant model, the absolutely dominant model of DeePASTAtakes a sequence and corresponding secondary structure containing a polyA site of some gene at the middleas the input and predicts whether the polyA site is an absolutely dominant polyA site of the gene.
1
![Page 2: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/2.jpg)
Concatenation layer
Multiple fully connected layers
Sequ
ence
sub-
mod
elGenomic sequence RNA secondary structure 1
Single neuron output layer
Fully connected layer
Seco
ndar
y st
ruct
ure
sub-
mod
el
Filters
Convolution layer
LSTM LSTM
Flattening layer
One-hot encoder
Filters
Convolution layer
Average pooling layer
LSTM LSTMLSTM
Flattening layer
Fully connected layer
Seco
ndar
y st
ruct
ure
sub-
mod
el
Seco
ndar
y st
ruct
ure
sub-
mod
el
One-hot encoder
LSTM
Max pooling layer
Addition layer
RNA secondary structure 2 RNA secondary structure 3
One-hot encoder One-hot encoder
M3
M4
Figure S2: Architectures of the polyA site prediction model of DeepPASTA, M3 and M4. The polyA siteprediction model has four sub-models: a sequence and three secondary structure sub-models. Each sub-model consists of a convolution layer, a maxpooling layer, a recurrent layer (i.e., a bi-directional LSTM), aflattening layer, and a fully connected layer. On the other hand, M3 (model represented by the red dottedline) consists of a sequence sub-model. M4 (model represented by the yellow dotted line) is similar to M3,but its sequence sub-model does not contains a recurrent layer.
2
![Page 3: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/3.jpg)
Input
Output
Ground truthLoss function
Deep learning model in Figure S2
Update param
etersbased on loss value
Figure S3: The training phase of the polyA site prediction model of DeepPASTA. In each iteration of thetraining phase, the model predicts a likelihood value for the given input. This prediction is compared withthe ground truth using a loss function. The loss value is then used to tune the parameters of the deeplearning model.
3
![Page 4: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/4.jpg)
Concatenation layer
Sequence sub-model(similar to the sequence sub-model of Figure S2)
Genomic sequence RNA secondary structure 1
Secondary structure sub-modelSecondary structure sub-model(similar to the secondary structure sub-model of Figure S2)
RNA secondary structure 2 RNA secondary structure 3
Concatenation layer
Fully connected layer
Multiple fully connected layers
Output layer with nine neurons (one neuron per tissue)
One-hot encoder
Secondary structure sub-model
One-hot encoder One-hot encoder One-hot encoder
Figure S4: Architecture of the tissue-specific polyA site prediction model of DeepPASTA. Similar to thepolyA site prediction model of DeepPASTA, this model has a sequence and three secondary structure sub-models. Each of these sub-models consists of a convolution layer, a maxpooling layer, a recurrent layer, aflattening layer, and a fully connected layer. This model is a multi-label classification model that has nineneurons in the output layer for predicting polyA sites in the nine tissues studied in [Derti et al., 2012].
4
![Page 5: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/5.jpg)
Genomic sequence 1 RNA secondary structure 1 Genomic sequence 2
Sequence sub-model(Similar to the sequence sub-model of Figure S2)
Secondary structure sub-model(Similar to the secondary structure sub-model of Figure S2)
Concatenation layer
Multiple fully connected layers
One neuron layer
Output layer with two neurons
One-hot encoder One-hot encoder One-hot encoder
RNA secondary structure 2
One-hot encoder
Sequence sub-model(Similar to the sequence sub-model of Figure S2)
Secondary structure sub-model(Similar to the secondary structure sub-model of Figure S2)
Concatenation layer
Multiple fully connected layers
One neuron layerSub-unit Sub-unit
Figure S5: Architecture of the model of DeepPASTA for predicting relative dominance in a particular tissue.The model takes two sequences of 200 nts and corresponding secondary structures generated by RNAshapescontaining polyA sites of some gene at the middle as the input. Each of these sequences and secondarystructures is processed by a sub-unit, which consists of a sequence and a secondary structure sub-models.The output layer compares the outputs from the two sub-units to predict the relatively dominant polyA site.
5
![Page 6: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/6.jpg)
Genomic sequence Secondary structure
Sequence sub-model(similar to the sequence sub-model of Figure S2)
Secondary structure sub-model(similar to the secondary structure sub-model of Figure S2)
Concatenation layer
Multiple fully connected layers
Single neuron output layer
One-hot encoder One-hot encoder
Figure S6: Architecture of the model of DeepPASTA for predicting absolutely dominate polyA sites of eachgene in a particular tissue. The model has a sequence and a secondary structure sub-models. The outputlayer predicts whether the input polyA site is an absolutely dominant polyA site or not.
6
![Page 7: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/7.jpg)
AUPRC=0.9700AUC=0.9838
(a)
AUPRC=0.9620AUC=0.9580
(b)
AUPRC=0.9669AUC=0.9639
(c)
Figure S7: The impact of negative examples on the performance of DeepPASTA. In order to test theperformance of DeepPASTA in predicting polyA sites on different negative examples, three datasets areconsidered: datasets with shifted negative examples where positive examples are shifted left and right by 50bases, with random negative examples that do not contain the hexamer signal and with random negativeexamples containing the hexamer signal. The positive examples of these datasets are the same. The numberof examples for these three datasets are 286218, 190812 and 190744, respectively. Plots in a show the AUCand AUPRC performance of of DeepPASTA on the dataset with the shifted negative examples. Plots inb show the AUC and AUPRC performance on the dataset with the random examples that do not containthe hexamer signal. Plots in c show the AUC and AUPRC performance on the dataset with the negativeexamples containing the hexamer signal.
7
![Page 8: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/8.jpg)
TTGTACCACTGCTAATAAATGACCAGTTTACCTGAAACCCTTTGTGATCAGTTCTTTAATGATACCTAAATGAAAGCTAATTAAAACAATAGGTTTCTCCCAAGGGTCTGGAGTAAATATATTTTGGGTGCAAATGAAATGGCAAAAATCTAGTATCTTAAATTGTATAAGGGGACATTATATAAAAACTGAAAATATAG
Chr 16:Genename:COX4I1PolyA site:85840671
ACTTAGAGGCAATGGAAAAATAAAAGTTGACTGTACTAAAAATGTATACTTGTTGCCAGGAAGGTGACCTCAAAAATTAAAAGTATAATTATTCggccgggcatggtggctcacacctgtaattccagcactttgggaggccaaggcaggcggatcacgaggtcaggagttcaaaaccagcctgtccaatatagtgaaac
Chr 17:Genename:ADORA2BPolyA site:15879114
TTGTACCACT
GCT
AATAAATG A CC A G T
T T A CC
T GAAACCCT T TGTG
ATC A
GTTCTTTAAT G A
TACCT
AA
ATG
AAAG
CTAA
TTAAAAC
AATAGG
TTTC
TCC
_[_[]_]_[_[[_[[[]_]_]_]_]_]_ energy: -12.30
ACTTAGA GGCAATGGAAAAATAAAAGTTG
ACT G T
A C TAA A A A T G
T A T AC T TGTTG C
CAG G
AAG
GT
GAC C T C
AAA
AATTAAAA
GTAT
AATTA
TTCgg
ccgg
_[_[]_[[_[]_]_]_]_ energy: -14.50
PolyA site
Figure S8: The RNA secondary structures of genes COX4I1 and ADORA2B helped DeepPASTA in predict-ing polyA sites. The figure shows the secondary structures generated by RNAshapes for the 100-nt upstreamsequences of some polyA sites of the genes. Both polyA sites have AATAAA as the polyadenylation signal(PAS), but the locations of the signal in each input are far away from the polyA sites (the PASs and thepolyA sites are colored red in the sequences). It is well known that the PAS often occurs 10-30 nts upstreamof a polyA site ([Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005]). Hence, one might conjecturethat a PAS has to be near a polyA site in order for it to be functional. The folding of the RNA secondarystructures reduces the distance between the PAS and polyA site in each gene. Similar phenomena are alsodescribed in [Bar-Shira et al., 1991].
8
![Page 9: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/9.jpg)
1-49nts 50-100ntsPolyA site
101-149nts 150-200nts
0 500 1000 1500 2000 2500
UUUUUUAAUAAAUUUUUGAUUAAACUUUUUGUUUUUAUUUUUAUAAAAAAAUAAUUUGUUUUGUUGAGUAAAUUUUGAUUUUUCUCUUUUUAAUAAUUUUUACAAUAAUAAAAAAUAAAU
0 5000 15000 25000
AAUAAAAUUAAAAUAAAAAAAAAAAAAUAAAUAAAUAUAAAGUAAUAAAUAAACCAAUAAUAAAAAUAAAAUAGUAAAGAAUAAUAAAUAAAUUAAUAUAAAUUAAAAAAAAUAAAUACA
1-49nts 50-100nts
0 100 200 300 400 500
AAAUUUUUUGUUUUAUUUUUCUUUUGUCUUUGUUUUUUUAUUUUUUGUAAUUUUGUCUUUGUGUUUUCAUUUUUUCUUGUGUGUAAAAUUUGUGUGUCUUUUUGUGUUGUGUCUCUGUGU
101-149nts
0 5 10 15 20
AUAAAUAUUAAAAAAAUAUAAAAUGGGGGCGGUGGGAAAUAAAAUAAAAUAAAAAUAAUAAAUUAAGGAGGGAAAUAUGUAGCUUCCCAACAAAAUACAAAAAAAAUUAAAGGAAUAAAG
150-200nts
0 5000 15000 25000
AAUAAAAUUAAAAAAAAAAUAAAAAAAUAAAUAAAUAUAAAGUUUUUUUAAUAAAUAAACCAAUAAUAAAAAUAAAAUAGUAAAGAAUAAUAAAUAAAUUAAUAUAAAUUAAAAAAAAUA
FrequencyofhexamersignalsusedbyDeepPASTA inpolyAsiteprediction
Inputsequence(200nts)
Figure S9: Hexamer signals extracted from the true positive polyA sites predicted by DeepPASTA ondataset 1. In order to identify the most frequently used signals, we consider the top three high strength6-mers in each input sequence based on saliency maps. The barplot on the left shows the overall 20 mostfrequently used hexamer signals in polyA site prediction. Most of these signals are annotated in the litera-ture [Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005]. In addition, DeepPASTA used some novelhexamer signals: UAAAAU, GAAUAAA, UAAAUA, AAUUAA, and UUAAAA. The four barplots on theright show the most frequently used hexamer signals in four equally divided regions (as illustrated at thebottom of the figure) of the input sequence. From the four barplots, it is seen that DeepPASTA used fewersignals from the fourth region (150-200 nts) in polyA site prediction. Similar to previous studies, Deep-PASTA identified the U-rich signals as auxiliary upstream elements (AUEs) in the first region (1-49 nts),U/GU-rich signals as downstream elements in the third region (101-149 nts) and G-rich signals as auxiliarydownstream elements (ADEs) in the fourth region (150-200 nts).
9
![Page 10: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/10.jpg)
1-49nts 50-100ntsPolyA site
101-149nts 150-200nts
FrequencyofhexamersignalsusedbyDeepPASTA inpolyAsitepredictionforthebraintissue
Inputsequence(200nts)
0 500 1000 1500
AAUAAAAAAAAAAAAAAGAAAUAACAAUAAAUAAAGUAAUAAAUAAAAAUUAAAAUAAACAAAAGAGAAUAAAAAAGGUCAAUAAAAGAAAUAAAUAAAAACCAAAAAAAGAAAAAUUAA
0 1 2 3 4 5 6
AGCCUGUGUGUGAGUGAGGUGUGUGGGGCCGUGAGCAGUUCGUGUGUUGGGGGGGUUCGAAAUAAAGGAGUUGUGGGGGGGGAGAGUUUGAGAUUGUGUGUAUUGUAAUGAGCCCCAAGA
1-49nts
0 500 1000 1500
AAUAAAAAAAAAAAAAAGAAAUAACAAUAAAUAAAGUAAUAAAUAAAAAUUAAAAAAAGAAUAAACGAAUAAAAAAGGUCAAUAAUAAAUAAAGAAAAAAACAAGAAACAAAAAAAUUAA
50-100nts
0 10 20 30 40 50 60 70
AAAAAACUGUGUUGUGUGGUGUCUGUGUGUUGUCUUUGUGUCUUGUGUGGGGGGUGUCUGUUUUUUUGGUGUUUCUUUAAUAAAUGUGUUGGGGGCGUGUGGUGUCUCCUGUCUCAAAAA
101-149nts
0 1 2 3 4 5
GGGGCCAAUAAAAUGAUAAUUAAAGGGGGCAUAAAGGGGGGGCCUGCCAAUAUAAUAUAUCAGGAUAAUAAUAAAAUACAUCUCAUAAACGCUGGCGGAAUAGGGCUGGACCCAAAGGUU
150-200nts
Figure S10: Hexamer signals extracted from the true positive polyA sites predicted by the tissue-specificmodel of DeepPASTA on dataset 1 for the brain tissue. In order to identify the most frequently used signals,we consider the top three high strength 6-mers in each input sequence based on saliency maps. The barploton the left shows the overall 20 most frequently used hexamer signals in polyA site prediction for the braintissue. The four barplots on the right show the most frequently used hexamer signals in four equally dividedregions (as illustrated at the bottom of the figure) of the input sequence. From these four barplots, it isseen that DeepPASTA used the most hexamer signals from the second region (50-100 nts) in polyA siteprediction for the brain tissue. The most frequently used signals in that region are AAUAAA, AAAAAA,AAAAAG, AAAUAA, and CAAUAA. These signals are known as the polyadenylation signals (PASs) in theliterature [Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005].
10
![Page 11: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/11.jpg)
1-49nts 50-100ntsPolyA site
101-149nts 150-200nts
FrequencyofhexamersignalsusedbyDeepPASTA inpolyAsitepredictionforthekidneytissue Inputsequence(200nts)
0 500 1000 1500
AAUAAAAUAAAAAAAUAAAUAAAGCAAUAAAUUAAAUAAUAAAUAAACGAAUAAAUAAAUUAAAGUUAAAAGGUGUGUCAUUAAAGUAAAUAAAUGUGUGUGUUAAAGAAUUAAUAUUAA
0.0 0.5 1.0 1.5 2.0 2.5 3.0
GGGGGGUGUGUGUAUAGGUCUCAGGGGGCCAUAGGCGUUUUCGCGGGGGAUAAAUAGGCACCGAAUACCGAAAUGAUAGUGUGUUGGGGGGGCGGGUGAUAAUUUUCCCUCAGCCGAAUC
1-49nts
0 500 1000 1500
AAUAAAAUAAAAAAAUAAAUAAAGCAAUAAAUUAAAUAAUAAAUAAACGAAUAAAUAAAUUAAAGUUAAAAGCAUUAAAGUAAAUAAAUGUUAAAGAAUUAAUAUUAAAAACCAUAAAGG
50-100nts
0 10 20 30 40 50 60 70
GUGUGUUGUGUGGUGUCUUGUGUCCUGUGUUGGUGUUGUGUUUUGUGUAGUGUGGUGUGGGUCUGUUGUGGCUGUGGUGUGUUUUGUCUUCUUGUGCGUGUGGGUGUGGGUGUCCGUGUC
101-149nts
0.0 0.5 1.0 1.5 2.0
UGGGGCGGGGCCUUGGGCGAGUGGGGGGGCUGGGUCGGGGGGGGGCCUCUGGGGGUGGCC
150-200nts
Figure S11: Hexamer signals extracted from the true positive polyA sites predicted by the tissue-specificmodel of DeepPASTA on dataset 1 for the kidney tissue. In order to identify the most frequently usedsignals, we consider the top three high strength 6-mers in each input sequence based on saliency maps. Thebarplot on the left shows the overall 20 most frequently used hexamer signals in polyA site prediction forthe kidney tissue. The four barplots on the right show the most frequently used hexamer signals in fourequally divided regions (as illustrated at the bottom of the figure) of the input sequence. From these fourbarplots, it is seen that DeepPASTA used the most hexamer signals from the second region (50-100 nts) inpolyA site prediction for the kidney tissue. The most frequently used signals in that region are AAUAAA,AUAAAA, AAAUAA, AUAAAG, and CAAUAA. Again, these signals are known as the polyadenylationsignals (PASs) in the literature [Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005].
11
![Page 12: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/12.jpg)
1-49nts 50-100ntsPolyA site
101-149nts 150-200nts
FrequencyofhexamersignalsusedbyDeepPASTA inpolyAsitepredictionforthelivertissue Inputsequence(200nts)
0 200 400 600 800
AAUAAAAUAAAAAAAUAAAUAAAGAUUAAACAAUAAUAAUAAAUAAACAUAAAUGAAUAAAAUUAAUAUUAAUUAAAGCAUUAAUUAAAAUAAAGUAGUAAAUAAAAUUAAAUAUAAAGG
0.0 0.5 1.0 1.5 2.0
UGCCACGCCACUACCUGCAACCUGGAACCUCCACUA
0 200 400 600 800
AAUAAAAUAAAAAAAUAAAUAAAGAUUAAACAAUAAUAAUAAAUAAACAUAAAUGAAUAAAAUUAAUAUUAAUUAAAGCAUUAAUUAAAAUAAAGUAGUAAAUAAAAUUAAAUAUAAAGG
0 5 10 15 20 25
UGUGUGUGGUGUCUGGUGGUGUGUCUGUGUGGUGUGGUGUCUGUGUGGUGUGGUUGUCUGGUGUGCGGGGCCGGGGGGGUGGUCUUGUGGUUGUGUUGUGUCUUGGUGUGGUGGCCUGUG
0.0 0.5 1.0 1.5 2.0
GGGAAUCCUGGG
1-49nts 50-100nts
101-149nts 150-200nts
Figure S12: Hexamer signals extracted from the true positive polyA sites predicted by the tissue-specificmodel of DeepPASTA on dataset 1 for the liver tissue. In order to identify the most frequently used signals,we consider the top three high strength 6-mers in each input sequence based on saliency maps. The barploton the left shows the overall 20 most frequently used hexamer signals in polyA site prediction for the livertissue. The four barplots on the right show the most frequently used hexamer signals in four equally dividedregions (as illustrated at the bottom of the figure) of the input sequence. From these four barplots, it isseen that DeepPASTA used the most hexamer signals from the second region (50-100 nts) in polyA siteprediction for the liver tissue. The most frequently used signals in that region are AAUAAA, AUAAAA,AAAUAA, AUAAAG, and AUUAAA. Again, these signals are known as the polyadenylation signals (PASs)in the literature [Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005].
12
![Page 13: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/13.jpg)
1-49nts 50-100ntsPolyA site
101-149nts 150-200nts
FrequencyofhexamersignalsusedbyDeepPASTA inpolyAsitepredictionfortheMAQC_Brain1tissue Inputsequence(200nts)
0 500 1500 2500
AAUAAAAAAUAAAUAAAAAUUAAAAUAAAGCAAUAAUAAUAAAUAAACAUAAAUGAAUAAAAUUAAUAUUAAUAAAUACAUUAAAGUAAAAAAAUAUCAAUACAAAUAUUAAUAUGUGUG
0 1 2 3 4 5
UUUUUUAUUAAAAGGCCUAUUACACAUUCAGGGGGCCUGAGGUAUCAUUUUUUAUGUGAAAAUAAAUGCUUUGCUGUGAAAUGUUUGGGGGUCCCCGUUUGUGCUUUGGGCCUGAUCAUA
1-49nts
0 500 1500 2500
AAUAAAAAAUAAAUAAAAAUUAAAAUAAAGCAAUAAUAAUAAAUAAACAUAAAUGAAUAAAAUUAAUAUUAAUAAAUACAUUAAAGUAAAAAAAUAUCAAUACAAAUAUUAAUAAAAGCA
50-100nts
0 10 20 30 40 50 60 70
UGUGUGGUGUGUCUGUGUUGUGUCGUGUCUCUGUCUUGUCUGUGUCUUGUGUGGUUGUGUUGUGGCUGUGGUUGGUGUGCUGUGGUGUCCUGUCUCUUUUUUGUCUGUUGUUGUCUGUGG
101-149nts
0.0 0.5 1.0 1.5 2.0
UGGGGGAGGGGGGUGGGGGGUGGUGCCCCUCAGGGGACCCACGGGAAUGGGGAGCGGGGGGGGAGGGGGGGGGUAGGGGGUGGGCCCUGGGGAAUGGGGGGCGGGAGCGGGGGU
150-200nts
Figure S13: Hexamer signals extracted from the true positive polyA sites predicted by the tissue-specificmodel of DeepPASTA on dataset 1 for the MAQC Brain1 tissue. In order to identify the most frequentlyused signals, we consider the top three high strength 6-mers in each input sequence based on saliency maps.The barplot on the left shows the overall 20 most frequently used hexamer signals in polyA site predictionfor the MAQC Brain1 tissue. The four barplots on the right show the most frequently used hexamer signalsin four equally divided regions (as illustrated at the bottom of the figure) of the input sequence. From thesefour barplots, it is seen that DeepPASTA used the most hexamer signals from the second region (50-100nts) in polyA site prediction for the MAQC Brain1 tissue. The most frequently used signals in that regionare AAUAAA, AAAUAA, AUAAAA, AUUAAA, and AUAAAG. Again, these signals are known as thepolyadenylation signals (PASs) in the literature [Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005].
13
![Page 14: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/14.jpg)
1-49nts 50-100ntsPolyA site
101-149nts 150-200nts
FrequencyofhexamersignalsusedbyDeepPASTA inpolyAsitepredictionfortheMAQC_brain2tissue Inputsequence(200nts)
0 500 1000 1500
AAUAAAAAAUAAAUAAAGCAAUAAUAAUAAAUAAAAAUAAACAUUAAAGAAUAAAUAAAUUAAAGUAGUAAAUAAAUAAAUUAACAUUAAUAAAAUAAAAUAUAUUAAUAAAAGUAAAGC
0.0 0.5 1.0 1.5 2.0 2.5 3.0
GGGGGGUGUUGUUGUUAAUUCACCGUUCAGGGGGGCCAGUUCAAUAAAUGGUGAGGUGAGUACAGUUCCCUCACAGUUAGGCUGGGUUCAGAUUCUGUGUGUCCCCUCUUAAUAAUUCUU
0 10 20 30 40
CUGUGUUGUGUGGUGUCUUGUGUCGUGUGUCCUGUGUGGUGUUUGUGUUGUCUCUGUGGUGUGUUUUCUGUGUGUGUUUUUUUUGUGUGGGUCUCUCUGGUGUGCCUCGUGCCUGUGUCC
0 500 1000 1500
AAUAAAAAAUAAAUAAAGCAAUAAUAAUAAAUAAAAAUAAACAUUAAAGAAUAAAUAAAUUAAAGUAGUAAAUAAAUAAAUUAACAUUAAUAAAAUAAAAUAUAUUAAUAAAAGUAAAGC
0.0 0.5 1.0 1.5 2.0
AAUAAACGAGGGCGUCCCGCGGGGCAGGGGGGGCCGGUGGGGGGGUCUGAGGGCUGGGCCAUAAAU
1-49nts 50-100nts
101-149nts 150-200nts
Figure S14: Hexamer signals extracted from the true positive polyA sites predicted by the tissue-specificmodel of DeepPASTA on dataset 1 for the MAQC Brain2 tissue. In order to identify the most frequentlyused signals, we consider the top three high strength 6-mers in each input sequence based on saliency maps.The barplot on the left shows the overall 20 most frequently used hexamer signals in polyA site predictionfor the MAQC Brain2 tissue. The four barplots on the right show the most frequently used hexamer signalsin four equally divided regions (as illustrated at the bottom of the figure) of the input sequence. From thesefour barplots, it is seen that DeepPASTA used the most hexamer signals from the second region (50-100nts) in polyA site prediction for the MAQC Brain2 tissue. The most frequently used signals in that regionare AAUAAA, AAAUAA, AUAAAG, CAAUAA, and UAAUAA. Again, these signals are known as thepolyadenylation signals (PASs) in the literature [Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005].
14
![Page 15: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/15.jpg)
1-49nts 50-100ntsPolyA site
101-149nts 150-200nts
FrequencyofhexamersignalsusedbyDeepPASTA inpolyAsitepredictionfortheMAQC_UHR1tissue Inputsequence(200nts)
0 500 1500 2500 3500
AAUAAAAAAUAAAUAAAAAUUAAAUAAUAAAUAAAGCAAUAAAUAAACAUAAAUGAAUAAAAUUAAUAUUAAUAAAAUAGUAAACAUUAAUAAAGUUAAAUAUUAAAAUAAAAAUAAAUG
0.0 0.5 1.0 1.5 2.0 2.5 3.0
GUGUAAGUCUUUCGGUGGACUUGCCUUGCUGGGGGAAUAAAAGCGUCCAAUAAAGGGGUUGUUUUUGGUCUUGGGCCAAAUCAAGCGGGCGUCCUGCGGGGGGGCACAAAUGUCCCGCAC
0 5 10 15 20
CUGUGUUUGUGUUGUGUCUGUGUGGUGUCUCACUUGACUUUGGUGUGUGGGGGGCAUUUGCCUGUGUGUGCCCACUUUGGGAGGCUUUGUGUGGUGAAAUGUUGUGGUUGGUGUGUCUUU
0 500 1500 2500 3500
AAUAAAAAAUAAAUAAAAAUUAAAUAAUAAAUAAAGCAAUAAAUAAAUAUAAACGAAUAAAAUUAAUAAAAUUAUUAAAGUAAACAUUAAUAAAGUUAAAUAUUAAAAUAAAAAUAUAAA
0.0 0.5 1.0 1.5 2.0 2.5 3.0
GGGUGGGGGGGGGGAGGGAAAUAAGGGCCCUGGGGGGGGGGAGGGAGGGAGGGGGGUGGGGGCGGGCCCCUUGGGUCCUGUAAAGGGCGGGGUCCC
1-49nts 50-100nts
101-149nts 150-200nts
Figure S15: Hexamer signals extracted from the true positive polyA sites predicted by the tissue-specificmodel of DeepPASTA on dataset 1 for the MAQC UHR1 tissue. In order to identify the most frequentlyused signals, we consider the top three high strength 6-mers in each input sequence based on saliency maps.The barplot on the left shows the overall 20 most frequently used hexamer signals in polyA site predictionfor the MAQC UHR1 tissue. The four barplots on the right show the most frequently used hexamer signalsin four equally divided regions (as illustrated at the bottom of the figure) of the input sequence. From thesefour barplots, it is seen that DeepPASTA used the most hexamer signals from the second region (50-100nts) in polyA site prediction for the MAQC UHR1 tissue. The most frequently used signals in that regionare AAUAAA, AAAUAA, AUAAAA, AUUAAA, and UAAUAA. Again, these signals are known as thepolyadenylation signals (PASs) in the literature [Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005].
15
![Page 16: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/16.jpg)
1-49nts 50-100ntsPolyA site
101-149nts 150-200nts
FrequencyofhexamersignalsusedbyDeepPASTA inpolyAsitepredictionfortheMAQC_UHR2tissue Inputsequence(200nts)
0 1000 2000 3000
AAUAAAAAAUAAAUUAAAAUAAAAUAAUAACAAUAAAUAAAGAUAAAUAUAAACGAAUAAAAUUAAAGUAAAUAAAAUAAGCCAAAACCACAUUAAUAUUAAUAAAUAUAAAAAAAAAUA
0.0 0.5 1.0 1.5 2.0 2.5 3.0
UGUAAAUUGUAAUUUGUAGUAAACGGAUCAGUAUUUACUCCUUGGUUUUGUGGUUGUGAAAAUAAAGGGAUCCCUUUGACUCAAUCAAGAUGGGAUUACUGUUGGAGAGUGCCGUUGGAA
0 1000 2000 3000
AAUAAAAAAUAAAUUAAAAUAAAAUAAUAACAAUAAAUAAAGAUAAAUAUAAACGAAUAAAAUUAAAGUAAAUAAAAUAAGCCAAAACCACAUUAAUAUUAAUAAAUAUAAAAAAAAAUA
0 5 10 15 20 25
GUGUGUGUGUCUUGUGUGUUUUUUUGGUGUUGUGUCCUGUGUUGUCUUCAUUUGUUGUGUCAUUUCUGUGUUCACUUGCAAACAAACCUGCACUUUCAAACUUGGGGGUGUCUCGUGUUU
0.0 0.5 1.0 1.5 2.0
UUUUUUACGAAGGUUUUGGGGGGGAUGGGGAAUUUUGGCGCG
1-49nts 50-100nts
101-149nts
150-200nts
Figure S16: Hexamer signals extracted from the true positive polyA sites predicted by the tissue-specificmodel of DeepPASTA on dataset 1 for the MAQC UHR2 tissue. In order to identify the most frequentlyused signals, we consider the top three high strength 6-mers in each input sequence based on saliency maps.The barplot on the left shows the overall 20 most frequently used hexamer signals in polyA site predictionfor the MAQC UHR2 tissue. The four barplots on the right show the most frequently used hexamer signalsin four equally divided regions (as illustrated at the bottom of the figure) of the input sequence. From thesefour barplots, it is seen that DeepPASTA used the most hexamer signals from the second region (50-100nts) in polyA site prediction for the MAQC UHR2 tissue. The most frequently used signals in that regionare AAUAAA, AAAUAA, AUUAAA, AUAAAA, and UAAUAA. Again, these signals are known as thepolyadenylation signals (PASs) in the literature [Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005].
16
![Page 17: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/17.jpg)
1-49nts 50-100ntsPolyA site
101-149nts 150-200nts
FrequencyofhexamersignalsusedbyDeepPASTA inpolyAsitepredictionforthemuscletissue Inputsequence(200nts)
0 500 1000 1500
AAUAAAUUUUUUAUUAAAAAAUAAAUAAAAAUAAAGCAAUAAUAAUAAAUAAAUAUAAACGAAUAACAUUAAAAUUAAUAAAGAUUAAAGUUAAAAUCAAUAAUUUUUAGUAAAUAAAAA
0 100 300 500 700
UUUUUUAUUUUUCUUUUUUCUUUUUUUUUGUUUUUCGUUUUUCAGGCUUUUUGUUUUUUAAAUUUUUGGGGGUUCUUUCACGAUCCAGGCUUUCUUACCCAGUUUUCUUUUCCAGGGGGG
0 10 20 30 40
UUUUUUUGUGUGGUGUCUUGUCUGCUGUGUAAAUUUAAUUUUUUGUGUAAAAUUAUUUUUGUGUGUUUUUUACAAACUUGUGUCUCUGUGCAACCUAAAUGUAACCUGUGCCUGAAAGUG
0 500 1000 1500
AAUAAAAUUAAAAAAUAAAUAAAAAUAAAGCAAUAAUAAUAAUUUUUUAUAAAUAUAAACGAAUAACAUUAAAAUUAAUAAAGAUUAAAGUUAAAAUCAAUAAGUAAAUAAAAAUAAAAG
0 10 20 30 40
UUUUUUAUUUUUAGUAGAUUUUUAUGGGGGUAGUAGAAUUUUUUUUAUGGGGCGAUUACAGGGCGUCAACUUUCACCAGCUCCAGGGGACCAAGACUUAUUUGUUUUUGGGGGUCCAGCC
1-49nts 50-100nts
101-149nts 150-200nts
Figure S17: Hexamer signals extracted from the true positive polyA sites predicted by the tissue-specificmodel of DeepPASTA on dataset 1 for the muscle tissue. In order to identify the most frequently usedsignals, we consider the top three high strength 6-mers in each input sequence based on saliency maps. Thebarplot on the left shows the overall 20 most frequently used hexamer signals in polyA site prediction forthe muscle tissue. The four barplots on the right show the most frequently used hexamer signals in fourequally divided regions (as illustrated at the bottom of the figure) of the input sequence. From these fourbarplots, it is seen that DeepPASTA used the most hexamer signals from the second region (50-100 nts) inpolyA site prediction for the muscle tissue. The most frequently used signals in that region are AAUAAA,AUUAAA, AAAUAA, AUAAAA, and AUAAAG. Again, these signals are known as the polyadenylationsignals (PASs) in the literature [Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005].
17
![Page 18: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/18.jpg)
1-49nts 50-100ntsPolyA site
101-149nts 150-200nts
FrequencyofhexamersignalsusedbyDeepPASTA inpolyAsitepredictionforthetestistissue Inputsequence(200nts)
0 500 1000 1500 2000
AAUAAAAUAAAAAAAUAAAUAAAGAUUAAAUAAUAAUUUUUUCAAUAAAUAAACAUAAAUGAAUAAUAUUAAUUAAAGUAAAGUUUAAAACAUUAAAAUUAAUAAAAGAGUAAAUAAAAA
0 20 40 60 80
UUUUUUAUUUUUGGGGGCGGGGGGUUGUAUCUUUUUCUUUCCAGAUGGGUAUUUUUUCCUUGUAUUAAUUUUUUUUAUUUCCUUUCUUUCCCAAGCCCUCCCCAAGCCUCUAUCGUCUUU
0 500 1000 1500 2000
AAUAAAAUAAAAAAAUAAAUAAAGAUUAAAUAAUAACAAUAAAUAAACUUUUUUAUAAAUGAAUAAUAUUAAUUAAAGUAAAGUCAUUAAUUAAAAAAUUAAUAAAAGAGUAAAUAAAAA
0 5 10 15 20
GUGUGUCUGUGUGUGUCUUUUUUUUGUGUCUGUCUUAAACUUAAAAUUCAGCCUCAAACUCAACCUAACCUCUGUGUGAAACUGAAAUGUUUGUGUAAACUCGUCUUCAACCUGAAAUUU
0 5 10 15 20
UUUUUUAUUUUUUAUUUUUUUUUAUGGGGGGGGGGGUCUAAUUGGGGUGGGGAGUUUUUGGAGGGGAUGGGGGGGGUUUAAAUGUUUAUUGGAGGG
1-49nts 50-100nts
101-149nts 150-200nts
Figure S18: Hexamer signals extracted from the true positive polyA sites predicted by the tissue-specificmodel of DeepPASTA on dataset 1 for the testis tissue. In order to identify the most frequently used signals,we consider the top three high strength 6-mers in each input sequence based on saliency maps. The barploton the left shows the overall 20 most frequently used hexamer signals in polyA site prediction for the testistissue. The four barplots on the right show the most frequently used hexamer signals in four equally dividedregions (as illustrated at the bottom of the figure) of the input sequence. From these four barplots, it isseen that DeepPASTA used the most hexamer signals from the second region (50-100 nts) in polyA siteprediction for the testis tissue. The most frequently used signals in that region are AAUAAA, AUAAAA,AAAUAA, AUAAAG, and AUUAAA. Again, these signals are known as the polyadenylation signals (PASs)in the literature [Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005].
18
![Page 19: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/19.jpg)
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
Dataset4
Numberofexamplesintrainingdata Numberofexamplesinvalidationdata
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
Dataset5
Numberofexamplesintrainingdata Numberofexamplesinvalidationdata
Figure S19: Number of examples in the training and validation data used in the experiments on predictingtissue-specific relatively dominant polyA sites. As shown in the left plot, the number of training examplesin dataset 4 ranges from 59.4% to 64.3% of the total number of examples (used in training, validation andtesting) across all tissues, and the number of validation examples ranges from 15.8% to 19.7%. As shown inthe right plot, the numbers of training and validation examples range from 60.5% to 61.7% and from 22.3%to 22.8%, respectively.
19
![Page 20: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/20.jpg)
0
10000
20000
30000
40000
50000
60000
Brain Kidney Liver MAQC_Brain1 MAQC_Brain2 MAQC_UHR1 MAQC_UHR2 Muscle Testis
Numberofexmaplesintrainingdata Numberofexamplesinvalidationdata
Figure S20: Number of examples in the training and validation data used in the experiments on predictingtissue-specific absolutely dominant polyA sites of each gene. As the plot shows, the number of examples inthe training data ranges from 54.1% to 55.9% and the number of examples in the validation data rangesfrom 22.6% to 23.4%.
20
![Page 21: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/21.jpg)
2 Supplementary tables
Table S1: Numbers of positive and negative examples in datasets 1 and 2 used in the tissue specific polyAsite prediction experiment.
Tissue Data Number of Number ofname positive negative
examples examples
Brain Dataset 1 18226 458735Dataset 2 18226 172585
Kidney Dataset 1 18557 458404Dataset 2 18557 172254
Liver Dataset 1 16231 460730Dataset 2 16231 174580
MAQC Brain1 Dataset 1 18286 458675Dataset 2 18286 172525
MAQC Brain2 Dataset 1 17194 459767Dataset 2 17194 173617
MAQC UHR1 Dataset 1 20807 456154Dataset 2 20807 170004
MAQC UHR2 Dataset 1 22166 454795Dataset 2 22166 168645
Muscle Dataset 1 20706 456255Dataset 2 20706 170105
Testis Dataset 1 21270 455691Dataset 2 21270 169541
Table S2: Performance comparison between DeepPASTA and Conv-Net [Leung et al., 2018] in relativelydominant polyA site prediction on dataset 5 using AUC and AUPRC.
Tissue # of Test DeepPASTA Conv-Netexamples AUC AUPRC AUC AUPRC
Brain 10567 0.908 0.905 0.895∗ 0.865
Breast 10905 0.900 0.899 0.886∗ 0.868
ES cell 8351 0.910 0.912 0.911∗ 0.878
Ovary 10146 0.903 0.900 0.895∗ 0.864
SK muscle 8816 0.906 0.905 0.893∗ 0.861
Testis 10456 0.893 0.890 0.856∗ 0.839
BCells1 8674 0.905 0.906 0.896∗ 0.878
BCells2 8118 0.901 0.896 0.893∗ 0.847
* The AUC performance of Conv-Net is taken from [Leung et al., 2018]
21
![Page 22: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/22.jpg)
Table S3: Training time of the four DeepPASTA models in our experiments.
Model Data used Tissue Running time (hrs)
PolyA site prediction 12.667
Tissue-specific polyA 9.767site prediction
Tissue-specific Dataset 4 Brain 1.20relative dominance Dataset 4 Kidney 1.15
Dataset 4 Liver 1.93Dataset 4 MAQC Brain1 2.17Dataset 4 MAQC Brain2 1.75Dataset 4 MAQC UHR1 1.87Dataset 4 MAQC UHR2 2.62Dataset 4 Muscle 1.30Dataset 4 Testis 1.65Dataset 5 Brain 0.20Dataset 5 Breast 0.18Dataset 5 ES cell 0.17Dataset 5 Ovary 0.17Dataset 5 SK muscle 0.22Dataset 5 Testis 0.18Dataset 5 BCells1 0.15Dataset 5 BCells2 0.17
Tissue-specific Dataset 6 Brain 0.55absolute dominance Dataset 6 Kidney 0.93
Dataset 6 Liver 0.87Dataset 6 MAQC Brain1 0.68Dataset 6 MAQC Brain2 0.52Dataset 6 MAQC UHR1 0.90Dataset 6 MAQC UHR2 0.72Dataset 6 Muscle 0.90Dataset 6 Testis 0.97
22
![Page 23: 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu aauaaa uuuuug auuaaa cuuuuu guuuuu auuuuu auaaaa aaauaa uuuguu uuguug aguaaa uuuuga](https://reader034.fdocuments.us/reader034/viewer/2022042020/5e77bed14d46fa0b7c296e58/html5/thumbnails/23.jpg)
3 Commands for running the tools compared in the Experimental Results
PolyAR
java polyar -i <INPUT FILE> -o <OUTPUT FILE>
Dragon PolyA Spotter
./polyASignalTesting Data/ <INPUT FILE> polyA spotter output
DeepPolyA
python deepPolyA.py
DeeReCT-PolyA
python test.py <INPUT FILE> models --out <OUTPUT FILE>
Conv-Net
python convnet.py -i <INPUT FILE> -o <OUTPUT FILE>
References
[Bar-Shira et al., 1991] Bar-Shira, A. et al. (1991) An RNA secondary structure juxtaposes two remotegenetic signals for human T-cell leukemia virus type I RNA 3′-end processing. Journal of Virology, 65(10), 5165-5173.
[Derti et al., 2012] Derti, A. et al. (2012) A quantitative atlas of polyadenylation in five mammals. GenomeResearch, 22 (6), 1173-1183.
[Hu et al., 2005] Hu, J. (2005) Bioinformatic identification of candidate cis-regulatory elements involved inhuman mRNA polyadenylation. RNA, 11, 1485-1493.
[Leung et al., 2018] Leung, M. K. K. et al. (2018) Inference of the human polyadenylation code. Bioinfor-matics, 1-10.
[Steffen et al., 2006] Steffen, P. et al. (2006) RNAshapes: an integrated RNA analysis package based onabstract shapes. Bioinformatics, 22 (4), 500-503.
[Tian et al., 2005] Tian, B. et al. (2005) A large-scale analysis of mRNA polyadenylation of human andmouse genes. Nucleic Acid Research, 33 (1), 201-212.
23