1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu...

DeepPASTA: Deep neural network based polyadenylation site analysis(Supplementary Materials)

Ashraful Arefeen, Xinshu Xiao, and Tao Jiang

1 Supplementary figures

Genomic sequence

RNA secondary structure 1 Likelihood value for the

middle position of the input sequence to be a polyA siteRNA secondary

structure 2

RNA secondary structure 3

PolyA site prediction model of DeepPASTA

Input Output

(a)

Genomic sequence




Tissue-specific polyAsite prediction model of DeepPASTA

Likelihood value for the middle position of the input sequence to be a polyA site for tissue 1

Likelihood value for the middle position of the input sequence to be a polyA site for tissue 9

Input Output

(b)

Genomic sequence 1


Probability for the 1st polyAsite to be more dominant

Genomic sequence 2


Tissue-specific relatively dominant polyA site prediction model of DeepPASTA

Input Output

Probability for the 2nd polyAsite to be more dominant

(c)

Genomic sequence

RNA secondary structure

Likelihood value for the input polyA site to be the absolutely dominant polyA site of the gene

Tissue-specific absolutely dominant polyA site prediction model of DeepPASTA

Input Output

(d)

Figure S1: The input and output of the four models of DeepPASTA. a) The polyA site prediction modelof DeepPASTA takes a genomic sequence of 200 nts and three energy efficient RNA secondary structurespredicted by RNAshapes [Steffen et al., 2006] from the sequence as the input and predicts whether the inputsequence contains a polyA site at the middle or not. b) Similar to the previous model, the tissue-specificpolyA site prediction model of DeepPASTA takes a sequence and three corresponding secondary structuresgenerated by RNAshapes as the input and predicts whether the input sequence contains a polyA site atthe middle or not for the nine tissues studied in [Derti et al., 2012]. c) The tissue-specific relatively domi-nant polyA site prediction model of DeepPASTA takes a couple of sequences and corresponding secondarystructures containing polyA sites of some gene at the middle as the input and predicts which polyA site isrelatively dominant. d) Unlike the relatively dominant model, the absolutely dominant model of DeePASTAtakes a sequence and corresponding secondary structure containing a polyA site of some gene at the middleas the input and predicts whether the polyA site is an absolutely dominant polyA site of the gene.

1

Concatenation layer

Multiple fully connected layers

Sequ

ence

sub-

mod

elGenomic sequence RNA secondary structure 1

Single neuron output layer

Fully connected layer

Seco

ndar

y st

ruct

ure

sub-

mod

el

Filters

Convolution layer

LSTM LSTM

Flattening layer

One-hot encoder

Filters

Convolution layer

Average pooling layer

LSTM LSTMLSTM

Flattening layer


Seco

ndar

y st

ruct

ure

sub-

mod

el

Seco

ndar

y st

ruct

ure

sub-

mod

el

One-hot encoder

LSTM

Max pooling layer

Addition layer

RNA secondary structure 2 RNA secondary structure 3

One-hot encoder One-hot encoder

M3

M4

Figure S2: Architectures of the polyA site prediction model of DeepPASTA, M3 and M4. The polyA siteprediction model has four sub-models: a sequence and three secondary structure sub-models. Each sub-model consists of a convolution layer, a maxpooling layer, a recurrent layer (i.e., a bi-directional LSTM), aflattening layer, and a fully connected layer. On the other hand, M3 (model represented by the red dottedline) consists of a sequence sub-model. M4 (model represented by the yellow dotted line) is similar to M3,but its sequence sub-model does not contains a recurrent layer.

2

Input

Output

Ground truthLoss function

Deep learning model in Figure S2

Update param

etersbased on loss value

Figure S3: The training phase of the polyA site prediction model of DeepPASTA. In each iteration of thetraining phase, the model predicts a likelihood value for the given input. This prediction is compared withthe ground truth using a loss function. The loss value is then used to tune the parameters of the deeplearning model.

3

Concatenation layer

Sequence sub-model(similar to the sequence sub-model of Figure S2)

Genomic sequence RNA secondary structure 1

Secondary structure sub-modelSecondary structure sub-model(similar to the secondary structure sub-model of Figure S2)

RNA secondary structure 2 RNA secondary structure 3

Concatenation layer



Output layer with nine neurons (one neuron per tissue)

One-hot encoder

Secondary structure sub-model

One-hot encoder One-hot encoder One-hot encoder

Figure S4: Architecture of the tissue-specific polyA site prediction model of DeepPASTA. Similar to thepolyA site prediction model of DeepPASTA, this model has a sequence and three secondary structure sub-models. Each of these sub-models consists of a convolution layer, a maxpooling layer, a recurrent layer, aflattening layer, and a fully connected layer. This model is a multi-label classification model that has nineneurons in the output layer for predicting polyA sites in the nine tissues studied in [Derti et al., 2012].

4

Genomic sequence 1 RNA secondary structure 1 Genomic sequence 2

Sequence sub-model(Similar to the sequence sub-model of Figure S2)

Secondary structure sub-model(Similar to the secondary structure sub-model of Figure S2)

Concatenation layer


One neuron layer

Output layer with two neurons

One-hot encoder One-hot encoder One-hot encoder


One-hot encoder

Sequence sub-model(Similar to the sequence sub-model of Figure S2)

Secondary structure sub-model(Similar to the secondary structure sub-model of Figure S2)

Concatenation layer


One neuron layerSub-unit Sub-unit

Figure S5: Architecture of the model of DeepPASTA for predicting relative dominance in a particular tissue.The model takes two sequences of 200 nts and corresponding secondary structures generated by RNAshapescontaining polyA sites of some gene at the middle as the input. Each of these sequences and secondarystructures is processed by a sub-unit, which consists of a sequence and a secondary structure sub-models.The output layer compares the outputs from the two sub-units to predict the relatively dominant polyA site.

5

Genomic sequence Secondary structure

Sequence sub-model(similar to the sequence sub-model of Figure S2)

Secondary structure sub-model(similar to the secondary structure sub-model of Figure S2)

Concatenation layer


Single neuron output layer

One-hot encoder One-hot encoder

Figure S6: Architecture of the model of DeepPASTA for predicting absolutely dominate polyA sites of eachgene in a particular tissue. The model has a sequence and a secondary structure sub-models. The outputlayer predicts whether the input polyA site is an absolutely dominant polyA site or not.

6

AUPRC=0.9700AUC=0.9838

(a)

AUPRC=0.9620AUC=0.9580

(b)

AUPRC=0.9669AUC=0.9639

(c)

Figure S7: The impact of negative examples on the performance of DeepPASTA. In order to test theperformance of DeepPASTA in predicting polyA sites on different negative examples, three datasets areconsidered: datasets with shifted negative examples where positive examples are shifted left and right by 50bases, with random negative examples that do not contain the hexamer signal and with random negativeexamples containing the hexamer signal. The positive examples of these datasets are the same. The numberof examples for these three datasets are 286218, 190812 and 190744, respectively. Plots in a show the AUCand AUPRC performance of of DeepPASTA on the dataset with the shifted negative examples. Plots inb show the AUC and AUPRC performance on the dataset with the random examples that do not containthe hexamer signal. Plots in c show the AUC and AUPRC performance on the dataset with the negativeexamples containing the hexamer signal.

7

TTGTACCACTGCTAATAAATGACCAGTTTACCTGAAACCCTTTGTGATCAGTTCTTTAATGATACCTAAATGAAAGCTAATTAAAACAATAGGTTTCTCCCAAGGGTCTGGAGTAAATATATTTTGGGTGCAAATGAAATGGCAAAAATCTAGTATCTTAAATTGTATAAGGGGACATTATATAAAAACTGAAAATATAG

Chr 16:Genename:COX4I1PolyA site:85840671

ACTTAGAGGCAATGGAAAAATAAAAGTTGACTGTACTAAAAATGTATACTTGTTGCCAGGAAGGTGACCTCAAAAATTAAAAGTATAATTATTCggccgggcatggtggctcacacctgtaattccagcactttgggaggccaaggcaggcggatcacgaggtcaggagttcaaaaccagcctgtccaatatagtgaaac

Chr 17:Genename:ADORA2BPolyA site:15879114

TTGTACCACT

GCT

AATAAATG A CC A G T

T T A CC

T GAAACCCT T TGTG

ATC A

GTTCTTTAAT G A

TACCT

AA

ATG

AAAG

CTAA

TTAAAAC

AATAGG

TTTC

TCC

_[_[]_]_[_[[_[[[]_]_]_]_]_]_ energy: -12.30

ACTTAGA GGCAATGGAAAAATAAAAGTTG

ACT G T

A C TAA A A A T G

T A T AC T TGTTG C

CAG G

AAG

GT

GAC C T C

AAA

AATTAAAA

GTAT

AATTA

TTCgg

ccgg

_[_[]_[[_[]_]_]_]_ energy: -14.50

PolyA site

Figure S8: The RNA secondary structures of genes COX4I1 and ADORA2B helped DeepPASTA in predict-ing polyA sites. The figure shows the secondary structures generated by RNAshapes for the 100-nt upstreamsequences of some polyA sites of the genes. Both polyA sites have AATAAA as the polyadenylation signal(PAS), but the locations of the signal in each input are far away from the polyA sites (the PASs and thepolyA sites are colored red in the sequences). It is well known that the PAS often occurs 10-30 nts upstreamof a polyA site ([Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005]). Hence, one might conjecturethat a PAS has to be near a polyA site in order for it to be functional. The folding of the RNA secondarystructures reduces the distance between the PAS and polyA site in each gene. Similar phenomena are alsodescribed in [Bar-Shira et al., 1991].

8

1-49nts 50-100ntsPolyA site

101-149nts 150-200nts

0 500 1000 1500 2000 2500

UUUUUUAAUAAAUUUUUGAUUAAACUUUUUGUUUUUAUUUUUAUAAAAAAAUAAUUUGUUUUGUUGAGUAAAUUUUGAUUUUUCUCUUUUUAAUAAUUUUUACAAUAAUAAAAAAUAAAU

0 5000 15000 25000

AAUAAAAUUAAAAUAAAAAAAAAAAAAUAAAUAAAUAUAAAGUAAUAAAUAAACCAAUAAUAAAAAUAAAAUAGUAAAGAAUAAUAAAUAAAUUAAUAUAAAUUAAAAAAAAUAAAUACA

1-49nts 50-100nts

0 100 200 300 400 500

AAAUUUUUUGUUUUAUUUUUCUUUUGUCUUUGUUUUUUUAUUUUUUGUAAUUUUGUCUUUGUGUUUUCAUUUUUUCUUGUGUGUAAAAUUUGUGUGUCUUUUUGUGUUGUGUCUCUGUGU

101-149nts

0 5 10 15 20

AUAAAUAUUAAAAAAAUAUAAAAUGGGGGCGGUGGGAAAUAAAAUAAAAUAAAAAUAAUAAAUUAAGGAGGGAAAUAUGUAGCUUCCCAACAAAAUACAAAAAAAAUUAAAGGAAUAAAG

150-200nts

0 5000 15000 25000

AAUAAAAUUAAAAAAAAAAUAAAAAAAUAAAUAAAUAUAAAGUUUUUUUAAUAAAUAAACCAAUAAUAAAAAUAAAAUAGUAAAGAAUAAUAAAUAAAUUAAUAUAAAUUAAAAAAAAUA

FrequencyofhexamersignalsusedbyDeepPASTA inpolyAsiteprediction

Inputsequence(200nts)

Figure S9: Hexamer signals extracted from the true positive polyA sites predicted by DeepPASTA ondataset 1. In order to identify the most frequently used signals, we consider the top three high strength6-mers in each input sequence based on saliency maps. The barplot on the left shows the overall 20 mostfrequently used hexamer signals in polyA site prediction. Most of these signals are annotated in the litera-ture [Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005]. In addition, DeepPASTA used some novelhexamer signals: UAAAAU, GAAUAAA, UAAAUA, AAUUAA, and UUAAAA. The four barplots on theright show the most frequently used hexamer signals in four equally divided regions (as illustrated at thebottom of the figure) of the input sequence. From the four barplots, it is seen that DeepPASTA used fewersignals from the fourth region (150-200 nts) in polyA site prediction. Similar to previous studies, Deep-PASTA identified the U-rich signals as auxiliary upstream elements (AUEs) in the first region (1-49 nts),U/GU-rich signals as downstream elements in the third region (101-149 nts) and G-rich signals as auxiliarydownstream elements (ADEs) in the fourth region (150-200 nts).

9


101-149nts 150-200nts

FrequencyofhexamersignalsusedbyDeepPASTA inpolyAsitepredictionforthebraintissue

Inputsequence(200nts)

0 500 1000 1500

AAUAAAAAAAAAAAAAAGAAAUAACAAUAAAUAAAGUAAUAAAUAAAAAUUAAAAUAAACAAAAGAGAAUAAAAAAGGUCAAUAAAAGAAAUAAAUAAAAACCAAAAAAAGAAAAAUUAA

0 1 2 3 4 5 6

AGCCUGUGUGUGAGUGAGGUGUGUGGGGCCGUGAGCAGUUCGUGUGUUGGGGGGGUUCGAAAUAAAGGAGUUGUGGGGGGGGAGAGUUUGAGAUUGUGUGUAUUGUAAUGAGCCCCAAGA

1-49nts

0 500 1000 1500

AAUAAAAAAAAAAAAAAGAAAUAACAAUAAAUAAAGUAAUAAAUAAAAAUUAAAAAAAGAAUAAACGAAUAAAAAAGGUCAAUAAUAAAUAAAGAAAAAAACAAGAAACAAAAAAAUUAA

50-100nts

0 10 20 30 40 50 60 70

AAAAAACUGUGUUGUGUGGUGUCUGUGUGUUGUCUUUGUGUCUUGUGUGGGGGGUGUCUGUUUUUUUGGUGUUUCUUUAAUAAAUGUGUUGGGGGCGUGUGGUGUCUCCUGUCUCAAAAA

101-149nts

0 1 2 3 4 5

GGGGCCAAUAAAAUGAUAAUUAAAGGGGGCAUAAAGGGGGGGCCUGCCAAUAUAAUAUAUCAGGAUAAUAAUAAAAUACAUCUCAUAAACGCUGGCGGAAUAGGGCUGGACCCAAAGGUU

150-200nts

Figure S10: Hexamer signals extracted from the true positive polyA sites predicted by the tissue-specificmodel of DeepPASTA on dataset 1 for the brain tissue. In order to identify the most frequently used signals,we consider the top three high strength 6-mers in each input sequence based on saliency maps. The barploton the left shows the overall 20 most frequently used hexamer signals in polyA site prediction for the braintissue. The four barplots on the right show the most frequently used hexamer signals in four equally dividedregions (as illustrated at the bottom of the figure) of the input sequence. From these four barplots, it isseen that DeepPASTA used the most hexamer signals from the second region (50-100 nts) in polyA siteprediction for the brain tissue. The most frequently used signals in that region are AAUAAA, AAAAAA,AAAAAG, AAAUAA, and CAAUAA. These signals are known as the polyadenylation signals (PASs) in theliterature [Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005].

10


101-149nts 150-200nts

FrequencyofhexamersignalsusedbyDeepPASTA inpolyAsitepredictionforthekidneytissue Inputsequence(200nts)

0 500 1000 1500

AAUAAAAUAAAAAAAUAAAUAAAGCAAUAAAUUAAAUAAUAAAUAAACGAAUAAAUAAAUUAAAGUUAAAAGGUGUGUCAUUAAAGUAAAUAAAUGUGUGUGUUAAAGAAUUAAUAUUAA

0.0 0.5 1.0 1.5 2.0 2.5 3.0

GGGGGGUGUGUGUAUAGGUCUCAGGGGGCCAUAGGCGUUUUCGCGGGGGAUAAAUAGGCACCGAAUACCGAAAUGAUAGUGUGUUGGGGGGGCGGGUGAUAAUUUUCCCUCAGCCGAAUC

1-49nts

0 500 1000 1500

AAUAAAAUAAAAAAAUAAAUAAAGCAAUAAAUUAAAUAAUAAAUAAACGAAUAAAUAAAUUAAAGUUAAAAGCAUUAAAGUAAAUAAAUGUUAAAGAAUUAAUAUUAAAAACCAUAAAGG

50-100nts

0 10 20 30 40 50 60 70

GUGUGUUGUGUGGUGUCUUGUGUCCUGUGUUGGUGUUGUGUUUUGUGUAGUGUGGUGUGGGUCUGUUGUGGCUGUGGUGUGUUUUGUCUUCUUGUGCGUGUGGGUGUGGGUGUCCGUGUC

101-149nts

0.0 0.5 1.0 1.5 2.0

UGGGGCGGGGCCUUGGGCGAGUGGGGGGGCUGGGUCGGGGGGGGGCCUCUGGGGGUGGCC

150-200nts

Figure S11: Hexamer signals extracted from the true positive polyA sites predicted by the tissue-specificmodel of DeepPASTA on dataset 1 for the kidney tissue. In order to identify the most frequently usedsignals, we consider the top three high strength 6-mers in each input sequence based on saliency maps. Thebarplot on the left shows the overall 20 most frequently used hexamer signals in polyA site prediction forthe kidney tissue. The four barplots on the right show the most frequently used hexamer signals in fourequally divided regions (as illustrated at the bottom of the figure) of the input sequence. From these fourbarplots, it is seen that DeepPASTA used the most hexamer signals from the second region (50-100 nts) inpolyA site prediction for the kidney tissue. The most frequently used signals in that region are AAUAAA,AUAAAA, AAAUAA, AUAAAG, and CAAUAA. Again, these signals are known as the polyadenylationsignals (PASs) in the literature [Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005].

11


101-149nts 150-200nts

FrequencyofhexamersignalsusedbyDeepPASTA inpolyAsitepredictionforthelivertissue Inputsequence(200nts)

0 200 400 600 800

AAUAAAAUAAAAAAAUAAAUAAAGAUUAAACAAUAAUAAUAAAUAAACAUAAAUGAAUAAAAUUAAUAUUAAUUAAAGCAUUAAUUAAAAUAAAGUAGUAAAUAAAAUUAAAUAUAAAGG

0.0 0.5 1.0 1.5 2.0

UGCCACGCCACUACCUGCAACCUGGAACCUCCACUA

0 200 400 600 800

AAUAAAAUAAAAAAAUAAAUAAAGAUUAAACAAUAAUAAUAAAUAAACAUAAAUGAAUAAAAUUAAUAUUAAUUAAAGCAUUAAUUAAAAUAAAGUAGUAAAUAAAAUUAAAUAUAAAGG

0 5 10 15 20 25

UGUGUGUGGUGUCUGGUGGUGUGUCUGUGUGGUGUGGUGUCUGUGUGGUGUGGUUGUCUGGUGUGCGGGGCCGGGGGGGUGGUCUUGUGGUUGUGUUGUGUCUUGGUGUGGUGGCCUGUG

0.0 0.5 1.0 1.5 2.0

GGGAAUCCUGGG

1-49nts 50-100nts

101-149nts 150-200nts

Figure S12: Hexamer signals extracted from the true positive polyA sites predicted by the tissue-specificmodel of DeepPASTA on dataset 1 for the liver tissue. In order to identify the most frequently used signals,we consider the top three high strength 6-mers in each input sequence based on saliency maps. The barploton the left shows the overall 20 most frequently used hexamer signals in polyA site prediction for the livertissue. The four barplots on the right show the most frequently used hexamer signals in four equally dividedregions (as illustrated at the bottom of the figure) of the input sequence. From these four barplots, it isseen that DeepPASTA used the most hexamer signals from the second region (50-100 nts) in polyA siteprediction for the liver tissue. The most frequently used signals in that region are AAUAAA, AUAAAA,AAAUAA, AUAAAG, and AUUAAA. Again, these signals are known as the polyadenylation signals (PASs)in the literature [Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005].

12


101-149nts 150-200nts

FrequencyofhexamersignalsusedbyDeepPASTA inpolyAsitepredictionfortheMAQC_Brain1tissue Inputsequence(200nts)

0 500 1500 2500

AAUAAAAAAUAAAUAAAAAUUAAAAUAAAGCAAUAAUAAUAAAUAAACAUAAAUGAAUAAAAUUAAUAUUAAUAAAUACAUUAAAGUAAAAAAAUAUCAAUACAAAUAUUAAUAUGUGUG

0 1 2 3 4 5

UUUUUUAUUAAAAGGCCUAUUACACAUUCAGGGGGCCUGAGGUAUCAUUUUUUAUGUGAAAAUAAAUGCUUUGCUGUGAAAUGUUUGGGGGUCCCCGUUUGUGCUUUGGGCCUGAUCAUA

1-49nts

0 500 1500 2500

AAUAAAAAAUAAAUAAAAAUUAAAAUAAAGCAAUAAUAAUAAAUAAACAUAAAUGAAUAAAAUUAAUAUUAAUAAAUACAUUAAAGUAAAAAAAUAUCAAUACAAAUAUUAAUAAAAGCA

50-100nts

0 10 20 30 40 50 60 70

UGUGUGGUGUGUCUGUGUUGUGUCGUGUCUCUGUCUUGUCUGUGUCUUGUGUGGUUGUGUUGUGGCUGUGGUUGGUGUGCUGUGGUGUCCUGUCUCUUUUUUGUCUGUUGUUGUCUGUGG

101-149nts

0.0 0.5 1.0 1.5 2.0

UGGGGGAGGGGGGUGGGGGGUGGUGCCCCUCAGGGGACCCACGGGAAUGGGGAGCGGGGGGGGAGGGGGGGGGUAGGGGGUGGGCCCUGGGGAAUGGGGGGCGGGAGCGGGGGU

150-200nts

Figure S13: Hexamer signals extracted from the true positive polyA sites predicted by the tissue-specificmodel of DeepPASTA on dataset 1 for the MAQC Brain1 tissue. In order to identify the most frequentlyused signals, we consider the top three high strength 6-mers in each input sequence based on saliency maps.The barplot on the left shows the overall 20 most frequently used hexamer signals in polyA site predictionfor the MAQC Brain1 tissue. The four barplots on the right show the most frequently used hexamer signalsin four equally divided regions (as illustrated at the bottom of the figure) of the input sequence. From thesefour barplots, it is seen that DeepPASTA used the most hexamer signals from the second region (50-100nts) in polyA site prediction for the MAQC Brain1 tissue. The most frequently used signals in that regionare AAUAAA, AAAUAA, AUAAAA, AUUAAA, and AUAAAG. Again, these signals are known as thepolyadenylation signals (PASs) in the literature [Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005].

13


101-149nts 150-200nts

FrequencyofhexamersignalsusedbyDeepPASTA inpolyAsitepredictionfortheMAQC_brain2tissue Inputsequence(200nts)

0 500 1000 1500

AAUAAAAAAUAAAUAAAGCAAUAAUAAUAAAUAAAAAUAAACAUUAAAGAAUAAAUAAAUUAAAGUAGUAAAUAAAUAAAUUAACAUUAAUAAAAUAAAAUAUAUUAAUAAAAGUAAAGC

0.0 0.5 1.0 1.5 2.0 2.5 3.0

GGGGGGUGUUGUUGUUAAUUCACCGUUCAGGGGGGCCAGUUCAAUAAAUGGUGAGGUGAGUACAGUUCCCUCACAGUUAGGCUGGGUUCAGAUUCUGUGUGUCCCCUCUUAAUAAUUCUU

0 10 20 30 40

CUGUGUUGUGUGGUGUCUUGUGUCGUGUGUCCUGUGUGGUGUUUGUGUUGUCUCUGUGGUGUGUUUUCUGUGUGUGUUUUUUUUGUGUGGGUCUCUCUGGUGUGCCUCGUGCCUGUGUCC

0 500 1000 1500

AAUAAAAAAUAAAUAAAGCAAUAAUAAUAAAUAAAAAUAAACAUUAAAGAAUAAAUAAAUUAAAGUAGUAAAUAAAUAAAUUAACAUUAAUAAAAUAAAAUAUAUUAAUAAAAGUAAAGC

0.0 0.5 1.0 1.5 2.0

AAUAAACGAGGGCGUCCCGCGGGGCAGGGGGGGCCGGUGGGGGGGUCUGAGGGCUGGGCCAUAAAU

1-49nts 50-100nts

101-149nts 150-200nts

Figure S14: Hexamer signals extracted from the true positive polyA sites predicted by the tissue-specificmodel of DeepPASTA on dataset 1 for the MAQC Brain2 tissue. In order to identify the most frequentlyused signals, we consider the top three high strength 6-mers in each input sequence based on saliency maps.The barplot on the left shows the overall 20 most frequently used hexamer signals in polyA site predictionfor the MAQC Brain2 tissue. The four barplots on the right show the most frequently used hexamer signalsin four equally divided regions (as illustrated at the bottom of the figure) of the input sequence. From thesefour barplots, it is seen that DeepPASTA used the most hexamer signals from the second region (50-100nts) in polyA site prediction for the MAQC Brain2 tissue. The most frequently used signals in that regionare AAUAAA, AAAUAA, AUAAAG, CAAUAA, and UAAUAA. Again, these signals are known as thepolyadenylation signals (PASs) in the literature [Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005].

14


101-149nts 150-200nts

FrequencyofhexamersignalsusedbyDeepPASTA inpolyAsitepredictionfortheMAQC_UHR1tissue Inputsequence(200nts)

0 500 1500 2500 3500

AAUAAAAAAUAAAUAAAAAUUAAAUAAUAAAUAAAGCAAUAAAUAAACAUAAAUGAAUAAAAUUAAUAUUAAUAAAAUAGUAAACAUUAAUAAAGUUAAAUAUUAAAAUAAAAAUAAAUG

0.0 0.5 1.0 1.5 2.0 2.5 3.0

GUGUAAGUCUUUCGGUGGACUUGCCUUGCUGGGGGAAUAAAAGCGUCCAAUAAAGGGGUUGUUUUUGGUCUUGGGCCAAAUCAAGCGGGCGUCCUGCGGGGGGGCACAAAUGUCCCGCAC

0 5 10 15 20

CUGUGUUUGUGUUGUGUCUGUGUGGUGUCUCACUUGACUUUGGUGUGUGGGGGGCAUUUGCCUGUGUGUGCCCACUUUGGGAGGCUUUGUGUGGUGAAAUGUUGUGGUUGGUGUGUCUUU

0 500 1500 2500 3500

AAUAAAAAAUAAAUAAAAAUUAAAUAAUAAAUAAAGCAAUAAAUAAAUAUAAACGAAUAAAAUUAAUAAAAUUAUUAAAGUAAACAUUAAUAAAGUUAAAUAUUAAAAUAAAAAUAUAAA

0.0 0.5 1.0 1.5 2.0 2.5 3.0

GGGUGGGGGGGGGGAGGGAAAUAAGGGCCCUGGGGGGGGGGAGGGAGGGAGGGGGGUGGGGGCGGGCCCCUUGGGUCCUGUAAAGGGCGGGGUCCC

1-49nts 50-100nts

101-149nts 150-200nts

Figure S15: Hexamer signals extracted from the true positive polyA sites predicted by the tissue-specificmodel of DeepPASTA on dataset 1 for the MAQC UHR1 tissue. In order to identify the most frequentlyused signals, we consider the top three high strength 6-mers in each input sequence based on saliency maps.The barplot on the left shows the overall 20 most frequently used hexamer signals in polyA site predictionfor the MAQC UHR1 tissue. The four barplots on the right show the most frequently used hexamer signalsin four equally divided regions (as illustrated at the bottom of the figure) of the input sequence. From thesefour barplots, it is seen that DeepPASTA used the most hexamer signals from the second region (50-100nts) in polyA site prediction for the MAQC UHR1 tissue. The most frequently used signals in that regionare AAUAAA, AAAUAA, AUAAAA, AUUAAA, and UAAUAA. Again, these signals are known as thepolyadenylation signals (PASs) in the literature [Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005].

15


101-149nts 150-200nts

FrequencyofhexamersignalsusedbyDeepPASTA inpolyAsitepredictionfortheMAQC_UHR2tissue Inputsequence(200nts)

0 1000 2000 3000

AAUAAAAAAUAAAUUAAAAUAAAAUAAUAACAAUAAAUAAAGAUAAAUAUAAACGAAUAAAAUUAAAGUAAAUAAAAUAAGCCAAAACCACAUUAAUAUUAAUAAAUAUAAAAAAAAAUA

0.0 0.5 1.0 1.5 2.0 2.5 3.0

UGUAAAUUGUAAUUUGUAGUAAACGGAUCAGUAUUUACUCCUUGGUUUUGUGGUUGUGAAAAUAAAGGGAUCCCUUUGACUCAAUCAAGAUGGGAUUACUGUUGGAGAGUGCCGUUGGAA

0 1000 2000 3000

AAUAAAAAAUAAAUUAAAAUAAAAUAAUAACAAUAAAUAAAGAUAAAUAUAAACGAAUAAAAUUAAAGUAAAUAAAAUAAGCCAAAACCACAUUAAUAUUAAUAAAUAUAAAAAAAAAUA

0 5 10 15 20 25

GUGUGUGUGUCUUGUGUGUUUUUUUGGUGUUGUGUCCUGUGUUGUCUUCAUUUGUUGUGUCAUUUCUGUGUUCACUUGCAAACAAACCUGCACUUUCAAACUUGGGGGUGUCUCGUGUUU

0.0 0.5 1.0 1.5 2.0

UUUUUUACGAAGGUUUUGGGGGGGAUGGGGAAUUUUGGCGCG

1-49nts 50-100nts

101-149nts

150-200nts

Figure S16: Hexamer signals extracted from the true positive polyA sites predicted by the tissue-specificmodel of DeepPASTA on dataset 1 for the MAQC UHR2 tissue. In order to identify the most frequentlyused signals, we consider the top three high strength 6-mers in each input sequence based on saliency maps.The barplot on the left shows the overall 20 most frequently used hexamer signals in polyA site predictionfor the MAQC UHR2 tissue. The four barplots on the right show the most frequently used hexamer signalsin four equally divided regions (as illustrated at the bottom of the figure) of the input sequence. From thesefour barplots, it is seen that DeepPASTA used the most hexamer signals from the second region (50-100nts) in polyA site prediction for the MAQC UHR2 tissue. The most frequently used signals in that regionare AAUAAA, AAAUAA, AUUAAA, AUAAAA, and UAAUAA. Again, these signals are known as thepolyadenylation signals (PASs) in the literature [Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005].

16


101-149nts 150-200nts

FrequencyofhexamersignalsusedbyDeepPASTA inpolyAsitepredictionforthemuscletissue Inputsequence(200nts)

0 500 1000 1500

AAUAAAUUUUUUAUUAAAAAAUAAAUAAAAAUAAAGCAAUAAUAAUAAAUAAAUAUAAACGAAUAACAUUAAAAUUAAUAAAGAUUAAAGUUAAAAUCAAUAAUUUUUAGUAAAUAAAAA

0 100 300 500 700

UUUUUUAUUUUUCUUUUUUCUUUUUUUUUGUUUUUCGUUUUUCAGGCUUUUUGUUUUUUAAAUUUUUGGGGGUUCUUUCACGAUCCAGGCUUUCUUACCCAGUUUUCUUUUCCAGGGGGG

0 10 20 30 40

UUUUUUUGUGUGGUGUCUUGUCUGCUGUGUAAAUUUAAUUUUUUGUGUAAAAUUAUUUUUGUGUGUUUUUUACAAACUUGUGUCUCUGUGCAACCUAAAUGUAACCUGUGCCUGAAAGUG

0 500 1000 1500

AAUAAAAUUAAAAAAUAAAUAAAAAUAAAGCAAUAAUAAUAAUUUUUUAUAAAUAUAAACGAAUAACAUUAAAAUUAAUAAAGAUUAAAGUUAAAAUCAAUAAGUAAAUAAAAAUAAAAG

0 10 20 30 40

UUUUUUAUUUUUAGUAGAUUUUUAUGGGGGUAGUAGAAUUUUUUUUAUGGGGCGAUUACAGGGCGUCAACUUUCACCAGCUCCAGGGGACCAAGACUUAUUUGUUUUUGGGGGUCCAGCC

1-49nts 50-100nts

101-149nts 150-200nts

Figure S17: Hexamer signals extracted from the true positive polyA sites predicted by the tissue-specificmodel of DeepPASTA on dataset 1 for the muscle tissue. In order to identify the most frequently usedsignals, we consider the top three high strength 6-mers in each input sequence based on saliency maps. Thebarplot on the left shows the overall 20 most frequently used hexamer signals in polyA site prediction forthe muscle tissue. The four barplots on the right show the most frequently used hexamer signals in fourequally divided regions (as illustrated at the bottom of the figure) of the input sequence. From these fourbarplots, it is seen that DeepPASTA used the most hexamer signals from the second region (50-100 nts) inpolyA site prediction for the muscle tissue. The most frequently used signals in that region are AAUAAA,AUUAAA, AAAUAA, AUAAAA, and AUAAAG. Again, these signals are known as the polyadenylationsignals (PASs) in the literature [Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005].

17


101-149nts 150-200nts

FrequencyofhexamersignalsusedbyDeepPASTA inpolyAsitepredictionforthetestistissue Inputsequence(200nts)

0 500 1000 1500 2000

AAUAAAAUAAAAAAAUAAAUAAAGAUUAAAUAAUAAUUUUUUCAAUAAAUAAACAUAAAUGAAUAAUAUUAAUUAAAGUAAAGUUUAAAACAUUAAAAUUAAUAAAAGAGUAAAUAAAAA

0 20 40 60 80

UUUUUUAUUUUUGGGGGCGGGGGGUUGUAUCUUUUUCUUUCCAGAUGGGUAUUUUUUCCUUGUAUUAAUUUUUUUUAUUUCCUUUCUUUCCCAAGCCCUCCCCAAGCCUCUAUCGUCUUU

0 500 1000 1500 2000

AAUAAAAUAAAAAAAUAAAUAAAGAUUAAAUAAUAACAAUAAAUAAACUUUUUUAUAAAUGAAUAAUAUUAAUUAAAGUAAAGUCAUUAAUUAAAAAAUUAAUAAAAGAGUAAAUAAAAA

0 5 10 15 20

GUGUGUCUGUGUGUGUCUUUUUUUUGUGUCUGUCUUAAACUUAAAAUUCAGCCUCAAACUCAACCUAACCUCUGUGUGAAACUGAAAUGUUUGUGUAAACUCGUCUUCAACCUGAAAUUU

0 5 10 15 20

UUUUUUAUUUUUUAUUUUUUUUUAUGGGGGGGGGGGUCUAAUUGGGGUGGGGAGUUUUUGGAGGGGAUGGGGGGGGUUUAAAUGUUUAUUGGAGGG

1-49nts 50-100nts

101-149nts 150-200nts

Figure S18: Hexamer signals extracted from the true positive polyA sites predicted by the tissue-specificmodel of DeepPASTA on dataset 1 for the testis tissue. In order to identify the most frequently used signals,we consider the top three high strength 6-mers in each input sequence based on saliency maps. The barploton the left shows the overall 20 most frequently used hexamer signals in polyA site prediction for the testistissue. The four barplots on the right show the most frequently used hexamer signals in four equally dividedregions (as illustrated at the bottom of the figure) of the input sequence. From these four barplots, it isseen that DeepPASTA used the most hexamer signals from the second region (50-100 nts) in polyA siteprediction for the testis tissue. The most frequently used signals in that region are AAUAAA, AUAAAA,AAAUAA, AUAAAG, and AUUAAA. Again, these signals are known as the polyadenylation signals (PASs)in the literature [Hu et al., 2005], [Derti et al., 2012] and [Tian et al., 2005].

18

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Dataset4

Numberofexamplesintrainingdata Numberofexamplesinvalidationdata

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

Dataset5

Numberofexamplesintrainingdata Numberofexamplesinvalidationdata

Figure S19: Number of examples in the training and validation data used in the experiments on predictingtissue-specific relatively dominant polyA sites. As shown in the left plot, the number of training examplesin dataset 4 ranges from 59.4% to 64.3% of the total number of examples (used in training, validation andtesting) across all tissues, and the number of validation examples ranges from 15.8% to 19.7%. As shown inthe right plot, the numbers of training and validation examples range from 60.5% to 61.7% and from 22.3%to 22.8%, respectively.

19

0

10000

20000

30000

40000

50000

60000

Brain Kidney Liver MAQC_Brain1 MAQC_Brain2 MAQC_UHR1 MAQC_UHR2 Muscle Testis

Numberofexmaplesintrainingdata Numberofexamplesinvalidationdata

Figure S20: Number of examples in the training and validation data used in the experiments on predictingtissue-specific absolutely dominant polyA sites of each gene. As the plot shows, the number of examples inthe training data ranges from 54.1% to 55.9% and the number of examples in the validation data rangesfrom 22.6% to 23.4%.

20

2 Supplementary tables

Table S1: Numbers of positive and negative examples in datasets 1 and 2 used in the tissue specific polyAsite prediction experiment.

Tissue Data Number of Number ofname positive negative

examples examples

Brain Dataset 1 18226 458735Dataset 2 18226 172585

Kidney Dataset 1 18557 458404Dataset 2 18557 172254

Liver Dataset 1 16231 460730Dataset 2 16231 174580

MAQC Brain1 Dataset 1 18286 458675Dataset 2 18286 172525

MAQC Brain2 Dataset 1 17194 459767Dataset 2 17194 173617

MAQC UHR1 Dataset 1 20807 456154Dataset 2 20807 170004

MAQC UHR2 Dataset 1 22166 454795Dataset 2 22166 168645

Muscle Dataset 1 20706 456255Dataset 2 20706 170105

Testis Dataset 1 21270 455691Dataset 2 21270 169541

Table S2: Performance comparison between DeepPASTA and Conv-Net [Leung et al., 2018] in relativelydominant polyA site prediction on dataset 5 using AUC and AUPRC.

Tissue # of Test DeepPASTA Conv-Netexamples AUC AUPRC AUC AUPRC

Brain 10567 0.908 0.905 0.895∗ 0.865

Breast 10905 0.900 0.899 0.886∗ 0.868

ES cell 8351 0.910 0.912 0.911∗ 0.878

Ovary 10146 0.903 0.900 0.895∗ 0.864

SK muscle 8816 0.906 0.905 0.893∗ 0.861

Testis 10456 0.893 0.890 0.856∗ 0.839

BCells1 8674 0.905 0.906 0.896∗ 0.878

BCells2 8118 0.901 0.896 0.893∗ 0.847

* The AUC performance of Conv-Net is taken from [Leung et al., 2018]

21

Table S3: Training time of the four DeepPASTA models in our experiments.

Model Data used Tissue Running time (hrs)

PolyA site prediction 12.667

Tissue-specific polyA 9.767site prediction

Tissue-specific Dataset 4 Brain 1.20relative dominance Dataset 4 Kidney 1.15

Dataset 4 Liver 1.93Dataset 4 MAQC Brain1 2.17Dataset 4 MAQC Brain2 1.75Dataset 4 MAQC UHR1 1.87Dataset 4 MAQC UHR2 2.62Dataset 4 Muscle 1.30Dataset 4 Testis 1.65Dataset 5 Brain 0.20Dataset 5 Breast 0.18Dataset 5 ES cell 0.17Dataset 5 Ovary 0.17Dataset 5 SK muscle 0.22Dataset 5 Testis 0.18Dataset 5 BCells1 0.15Dataset 5 BCells2 0.17

Tissue-specific Dataset 6 Brain 0.55absolute dominance Dataset 6 Kidney 0.93

Dataset 6 Liver 0.87Dataset 6 MAQC Brain1 0.68Dataset 6 MAQC Brain2 0.52Dataset 6 MAQC UHR1 0.90Dataset 6 MAQC UHR2 0.72Dataset 6 Muscle 0.90Dataset 6 Testis 0.97

22

3 Commands for running the tools compared in the Experimental Results

PolyAR

java polyar -i <INPUT FILE> -o <OUTPUT FILE>

Dragon PolyA Spotter

./polyASignalTesting Data/ <INPUT FILE> polyA spotter output

DeepPolyA

python deepPolyA.py

DeeReCT-PolyA

python test.py <INPUT FILE> models --out <OUTPUT FILE>

Conv-Net

python convnet.py -i <INPUT FILE> -o <OUTPUT FILE>

References

[Bar-Shira et al., 1991] Bar-Shira, A. et al. (1991) An RNA secondary structure juxtaposes two remotegenetic signals for human T-cell leukemia virus type I RNA 3′-end processing. Journal of Virology, 65(10), 5165-5173.

[Derti et al., 2012] Derti, A. et al. (2012) A quantitative atlas of polyadenylation in five mammals. GenomeResearch, 22 (6), 1173-1183.

[Hu et al., 2005] Hu, J. (2005) Bioinformatic identification of candidate cis-regulatory elements involved inhuman mRNA polyadenylation. RNA, 11, 1485-1493.

[Leung et al., 2018] Leung, M. K. K. et al. (2018) Inference of the human polyadenylation code. Bioinfor-matics, 1-10.

[Steffen et al., 2006] Steffen, P. et al. (2006) RNAshapes: an integrated RNA analysis package based onabstract shapes. Bioinformatics, 22 (4), 500-503.

[Tian et al., 2005] Tian, B. et al. (2005) A large-scale analysis of mRNA polyadenylation of human andmouse genes. Nucleic Acid Research, 33 (1), 201-212.

23

1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu...

Documents

Transcript of 1 Supplementary gurespdfs.semanticscholar.org/24b9/34acb9c34820cfc283bdfe461e342eb1f389.pdfuuuuuu...