Supplemental Data. Lee et al. (2008). Arabidopsis nuclear ......2008/06/06 · Supplemental Figure...
Transcript of Supplemental Data. Lee et al. (2008). Arabidopsis nuclear ......2008/06/06 · Supplemental Figure...
Supplemental Data. Lee et al. (2008). Arabidopsis nuclear-encoded plastid transit peptides contain multiple sequence subgroups with distinctive chloroplast-targeting sequence motifs.
Supplemental Figure S1. A dendrogram constructed by hierarchical clustering of 208 chloroplast transit peptides in Arabidopsis thaliana. The first 80 residues of each sequence were defined as the N-terminal transit peptide. Multiple sequence alignment was performed using CLUSTALX 1.83 with default parameter settings (pairwise alignment gap opening: 9; pairwise alignment gap extension: 0.10; protein weight matrix: BLOSUM series) (Thompson et al., 1997). The distance matrix for sequence alignment was estimated with Protdist of PHYLIP 3.66 with default parameter settings (Felsenstein, 1989). The dendrogram for hierarchical clustering was constructed by using UPGMA method. The UPGMA method was carried out using Neighbor of PHYLIP 3.66. The dendrogram was drawn by PhyloDraw 0.82 (Choi et al., 2000). The seven representative transit peptides in their subgroups are indicated by arrows.
Supplemental Figure S2. Fractionation analysis of selected alanine substitution mutants on a Percoll gradient. (A), (B) Protoplasts transformed with GFP-fused T2A and T4A of DnaJ-J8 (A) or GFP-fused T2A, T4A, and T6A of PORA (B) were gently lysed and fractionated on a Percoll gradient. Intact chloroplasts were obtained and analyzed by Western blotting using an anti-GFP antibody. The total protein extracts were included as control. Pre, precursor; Pro, processed mature form.
Supplemental Figure S3. In vitro translation of alanine substitution mutants. (A), (B) To confirm that the upper bands of GFP-fused T2A, T3A and T4A of DnaJ-J8 (A) or GFP-fused T2A, T4A and T6A of PORA (B) are precursor forms, these mutant constructs together with their wild-type counterparts were translated in vitro and their migration in SDS gels was compared. The alanine substitution mutants migrated slightly faster than their corresponding wild-types, indicating that the upper band observed in protoplasts corresponds to precursor forms.
Supplemental Figure S4. The performance of the motif discovery algorithm was degraded by removal of sequences downstream of predicted cleavage sites. (A), (B) The predicted sequence motifs (blue) of RbcS, BCCP, Cab, DnaJ-J8, and PORA are shown with the sequence motifs characterized experimentally in this study (red). The predicted motifs were obtained with 208experimentally-confirmed plastid transit peptides, where sequences downstream of the predicted cleavage sites were removed.
Supplemental Figure S5. GFP reporter constructs with the N-terminal transit peptide region upstream of the cleavage sites were not efficiently imported into chloroplasts. (A) The amino acid sequences of RbcS, BCCP, Cab, DnaJ-J8, and PORA. The regions used to construct Cab-cs:GFP, BCCP-cs:GFP, DnaJ-J8-cs:GFP, and PORA-cs:GFP are underlined. cs, cleavage site predicted by ChloroP, except the Cab transit peptide. (B) In vivo targeting of GFP fusion proteins. Protoplasts were transformed with the constructs indicated and localization was examined 8 h after transformation. To simplify labeling, GFP is omitted from the construct names. Bar, 20 mm. (C) Western blot analysis of reporter construct import. Protein extracts were prepared 8 h after transformation and analyzed by immunoblotting using an anti-GFP antibody. The targeting efficiencies of Cab-cs:GFP, BCCP-cs:GFP, DnaJ-J8-cs:GFP, and PORA-cs:GFP were compared to those of Cab-nt:GFP, BCCP-nt:GFP, DnaJ-J8-nt:GFP, and PORA-nt:GFP, respectively. Pre, precursor form; Pro, processed mature form. (D) Confirmation of import into chloroplasts. Protoplasts were transformed with the indicated constructs together with an RFP construct. Chloroplasts were purified from lysed protoplasts by Percoll gradient and chloroplast extracts (CH) were analyzed by Western blotting using an anti-GFP antibody. The total extracts (T) were included. As a control for cytosolic proteins, co-transformed RFP was detected with an anti-RFP antibody. RbcL was used as loading control.
Supplemental Figure S6. Plastid transit peptide motif predictionscheme. The model has two main steps for motif discovery: forward motif selection and backward motif elimination. In forward motif selection, a motif segment which minimizes the misclassification error bound is added stepwise into the motif set. Then, the backward motif elimination step deletes a motif segment stepwise from a motif set chosen from the forward motif selection. We eliminate the motif segment producing the smallest value of the error bound at each iteration and stop when the smallest value is larger than the previous smallest value. The output is the predicted sequence motifs of the query sequence.
Supplemental Figure S7. Overview of plastid transit peptides prediction. Prediction of plastid transit peptides has three main steps: (1) clustering plastid transit peptides, (2) extracting features from sequences, and (3) learning an SVM classifier. The first step partitions all plastid transit peptides of the training data into several groups sharing common sequence motifs. The second step converts each sequence of the training data into the corresponding feature vector based on the representative sequences from the first step. The last step learns an SVM classifier where the input vector is the one obtained from the second step. The learned SVM classifier takes an input sequence and predicts plastid transit peptides.
Supplemental Figure S8. Gaussian distribution assessment of alignment score significance. Histograms of the alignment scores of 10,000 randomly permuted RbcS transit peptide sequences with BCCP transit peptides (A) or BCCP sequence motifs (B). The red curve shows fit to Gaussian distribution.
Supplemental Table S1. The nucleotide sequences of primers used to construct various alanine substation mutants. CaMV 35S-T TTTCAgAAAgAATgCTAACC nosT-B GAACGATCGGGGAAATTC Cab-5 CAACTCAATATggCTACCACC Cab-3 AGGCATCCAGTGAGCAGCCA CabT1A-T
TAAAGAAGAACAATGGCGGCGGCCGCGGCTGCGGCCGCTGCCGCAGCCGCCGTGTACCCTTCG
CabT1A-B
CGAAGGGTACACGGCGGCTGCGGCAGCGGCCGCAGCCGCGGCCGCCGCCATTGTTCTTCTTTA
CabT2A-T AGCTGTGGCATAGCCGCCGCGGCCGCTGCGGCTGCCGCTGCTTCCAAGTCTAAATTCGTATC
CabT2A-B GATACGAATTTAGACTTGGAAGCAGCGGCAGCCGCAGCGGCCGCGGCGGCTATGCCACAGCT
CabT3A-T CCTTCGCTTCTCTCTTCTGCCGCGGCTGCAGCCGCAGCCGCCGCAGCTCCACTCCCAAACGCCGGG
CabT3A-B CCCGGCGTTTGGGAGTGGAGCTGCGGCGGCTGCGGCTGCAGCCGCGGCAGAAGAGAGAAGCGAAGG
CabT4A-T TTCGTATCCGCCGGAGTTGCAGCCGCAGCCGCCGCGGCTGCTGCTGCTATCAGAATGGCTGCTCAC
CabT4A-B
GTGAGCAGCCATTCTGATAGCAGCAGCAGCCGCGGCGGCTGCGGCTGCAACTCCGGCGGATACGAA
CabT5A-T
GCCGGGAATGTTGGTCGTGCCGCAGCGGCTGCTGCCGCGGCGGCTGCCGAGCCACGACCAGCTTAC
CabT5A-B
GTAAGCTGGTCGTGGCTCGGCAGCCGCCGCGGCAGCAGCCGCTGCGGCACGACCAACATTCCCGGC
CabT6A-T
GCTCACTGGATGCCTGGCGCGGCAGCAGCAGCTGCCGCTGCCGCTGCTGCTCCTGGTGACTTTGGG
CabT6A-B GCTCACTGGATGCCTGGCGCGGCAGCAGCAGCTGCCGCTGCCGCTGCTGCTCCTGGTGACTTTGGG
Cab6-T
GCCGCAGCGGCTGCAGCCGCGGCTGCAGCTTCTGCTCCTGGTGACTTTGG
Cab6-B
AGCTGCAGCCGCGGCTGCAGCCGCTGCGGCAGGCATCCAGTGAGCAGCCA
Cab1-1
ATGGCGGCCGCCGCGGCTGCGAGCTGTGGCATAGCCGCCG
Cab3-1T
TCTCTTCTTCCGCGGCTGCAGCGGTATCCGCCGGAGTTCCACTC
Cab3-1B CGGATACCGCTGCAGCCGCGGAAGAAGAGAGAAGCGAAGG Cab3-2T AATTCGCAGCCGCCGCAGCTCCACTCCCAAACGCCGGGAAT Cab3-2B GGGAGTGGAGCTGCGGCGGCTGCGAATTTAGACTTGGAAGAAGACab4-1T GGAGTTGCAGCCGCAGCCGCCGGGAATGTTGGTCGTATCAGA Cab4-1B ACATTCCCGGCGGCTGCGGCTGCAACTCCGGCGGATACGAATTT Cab4-2T ACGCCGCGGCAGCCGCAGCTATCAGAATGGCTGCTCACTGG Cab4-2B CATTCTGATAGCTGCGGCTGCCGCGGCGTTTGGGAGTGGAACTCCBCCP-T (Xba1) TCTAGAAAAGAGTAAAGAAGAACAATGGCGTCTTCGTCGTTC
BCCP-B (Xho1) CTCGAGCCCATCAACTTTGGCAGC BC-T2A-T CGTTCTCAGTCACATCTCCAGCAGCCGCGGCTGCAGCCGCGG
CTGCAGCTCAAACCTCCTCGCACTTCCC BC-T2A-B GGGAAGTGCGAGGAGGTTTGAGCTGCAGCCGCGGCTGCAGCCG
CGGCTGCTGGAGATGTGACTGAGAACG BC-T3A-T CTTCCGTCTATGCAGTCACTGCAGCCGCGGCTGCAGCCGCGGCTG
CAGCTCGCTCTCGCAGAGTTTCTTT BC-T3A-B AAAGAAACTCTGCGAGAGCGAGCTGCAGCCGCGGCTGCAGCCG
CGGCTGCAGTGACTGCATAGACGGAAG BC-T4A-T CGCACTTCCCAATCCAAAACGCAGCCGCGGCTGCAGCCGCGGCT
GCAGCTGCTAAGCCCAAGCTTCGCTT BC-T4A-B AAGCGAAGCTTGGGCTTAGCAGCTGCAGCCGCGGCTGCAGCCGC
GGCTGCGTTTTGGATTGGGAAGTGCG BC-T5A-T GAGTTTCTTTCCGTCTCTCTGCAGCCGCGGCTGCAGCCGCGGCTG
CAGCTCCTAGTCGCAGTAGCTACCC BC-T5A-B GGGTAGCTACTGCGACTAGGAGCTGCAGCCGCGGCTGCAGCCGC
GGCTGCAGAGAGACGGAAAGAAACTC BC-T6A-T AGCTTCGCTTTCTCTCCAAGGCAGCCGCGGCTGCAGCCGCGGCT
GCAGCTGCACAATCTAACAAGGTTAG BC-T6A-B CTAACCTTGTTAGATTGTGCAGCTGCAGCCGCGGCTGCAGCCGCG
GCTGCCTTGGAGAGAAAGCGAAGCT BC-T7A-T GTAGCTACCCTGTGGTGAAAGCAGCCGCGGCTGCAGCCGCGGCT
GCAGCTTCATCAAATGCTGCCAAAGT BC-T7A-B ACTTTGGCAGCATTTGATGAAGCTGCAGCCGCGGCTGCAGCCGC
GGCTGCTTTCACCACAGGGTAGCTAC BC-T3A+QTSSH-T
TCACTCAAACCTCCTCGCACGCTGCCGCAGCCGCGCGCTCTCGCAGAGTTTCTTT
BC-T3A+QTSSH-B
AAAGAAACTCTGCGAGAGCGCGCGGCTGCGGCAGCGTGCGAGGAGGTTTGAGTGA
BC-T3A+FPIQN-T
CTTCCGTCTATGCAGTCACTGCTGCCGCAGCCGCGTTCCCAATCCAAAACCGCTC
BC-T3A+FPIQN-B
GAGCGGTTTTGGATTGGGAACGCGGCTGCGGCAGCAGTGACTGCATAGACGGAAG
BC-T5A+AKPKL-T
TCTCTGCTAAGCCCAAGCTTGCTGCCGCAGCCGCGCCTAGTCGCAGTAGCTACCC
BC-T5A+AKPKL-B
GGGTAGCTACTGCGACTAGGCGCGGCTGCGGCAGCAAGCTTGGGCTTAGCAGAGA
BC-T5A+RFLSK-T
GAGTTTCTTTCCGTCTCTCTGCTGCCGCAGCCGCGCGCTTTCTCTCCAAGCCTAG
BC-T5A+RFLSK-B
CTAGGCTTGGAGAGAAAGCGCGCGGCTGCGGCAGCAGAGAGACGGAAAGAAACTC
BC-T4A+RSRRV-T
AAAACCGCTCTCGCAGAGTTGCTGCCGCAGCCGCGGCTAAGCCCAAGCTTCGCTT
BC-T4A+RSRRV-B
AAGCGAAGCTTGGGCTTAGCCGCGGCTGCGGCAGCAACTCTGCGAGAGCGGTTTT
BC-T4A+SFRLS CGCACTTCCCAATCCAAAACGCTGCCGCAGCCGCGTCTTTCCGTC
-T TCTCTGCTAA BC-T4A+SFRLS-B
TTAGCAGAGAGACGGAAAGACGCGGCTGCGGCAGCGTTTTGGATTGGGAAGTGCG
BC-T6A+PSRSS-T
CCAAGCCTAGTCGCAGTAGCGCTGCCGCAGCCGCGGCACAATCTAACAAGGTTAG
BC-T6A+PSRSS-B
CTAACCTTGTTAGATTGTGCCGCGGCTGCGGCAGCGCTACTGCGACTAGGCTTGG
BC-T6A+YPVVK-T
AGCTTCGCTTTCTCTCCAAGGCTGCCGCAGCCGCGTACCCTGTGGTGAAAGCACA
BC-T6A+YPVVK-B
TGTGCTTTCACCACAGGGTACGCGGCTGCGGCAGCCTTGGAGAGAAAGCGAAGCT
DnaJ-J8-T (XbaI)
TCTAGAAAAGAGTAAAGAAGAACAATGACAATTGCTTTAACG
DnaJ-J8-T (XhoI)
CTCGAGTTTAGCGAGTTGTCTGAA
J8T2A-T CTTTAACGATCGGAGGAAACGCAGCCGCGGCTGCAGCCGCGGCTGCAGCTTCTTCATCATCTTCGTCGTT
J8T2A-B AACGACGAAGATGATGAAGAAGCTGCAGCCGCGGCTGCAGCCGCGGCTGCGTTTCCTCCGATCGTTAAAG
J8T3A-T GTCTACCAGGATCGTCGTTTGCAGCCGCGGCTGCAGCCGCGGCTGCAGCTAACAGCAGAAGAAAGAACAC
J8T3A-B GTGTTCTTTCTTCTGCTGTTAGCTGCAGCCGCGGCTGCAGCCGCGGCTGCAAACGACGATCCTGGTAGAC
J8T4A-T CTTCGTCGTTTCGATTAAAAGCAGCCGCGGCTGCAGCCGCGGCTGCAGCTAACAGATCAAAAGTCGTTTG
J8T4A-B CAAACGACTTTTGATCTGTTAGCTGCAGCCGCGGCTGCAGCCGCGGCTGCTTTTAATCGAAACGACGAAG
J8T5A-T GAAAGAACACGAAGATGCTCGCAGCCGCGGCTGCAGCCGCGGCTGCAGCTTCTTCTGTAATGGATCCGTA
J8T5A-B TACGGATCCATTACAGAAGAAGCTGCAGCCGCGGCTGCAGCCGCGGCTGCGAGCATCTTCGTGTTCTTTC
J8T6A-T AAGTCGTTTGTTCTTCTTCAGCAGCCGCGGCTGCAGCCGCGGCTGCAGCTAAGATCCGACCCGATTCATC
J8T6A-B GATGAATCGGGTCGGATCTTAGCTGCAGCCGCGGCTGCAGCCGCGGCTGCTGAAGAAGAACAAACGACTT
J8T2A+GFSGL-T
GAAACGGGTTTTCGGGTCTAGCAGCCGCGGCTGCATCTTCATCATCTTCGTCGTT
J8T2A+GFSGL-B
AACGACGAAGATGATGAAGATGCAGCCGCGGCTGCTAGACCCGAAAACCCGTTTC
J8T2A+PGSSF-T
CTTTAACGATCGGAGGAAACGCAGCCGCGGCTGCACCAGGATCGTCGTTTTCTTC
J8T2A+PGSSF-B
GAAGAAAACGACGATCCTGGTGCAGCCGCGGCTGCGTTTCCTCCGATCGTTAAAG
J8T3A+SSSSS-T CGTTTTCTTCATCATCTTCGGCAGCCGCGGCTGCAAACAGCAGAAGAAAGAACAC
J8T3A+SSSSS- GTGTTCTTTCTTCTGCTGTTTGCAGCCGCGGCTGCCGAAGATGAT
B GAAGAAAACG J8T3A+SFRLK-T
GTCTACCAGGATCGTCGTTTGCAGCCGCGGCTGCATCGTTTCGATTAAAAAACAG
J8T3A+SFRLK-B
CTGTTTTTTAATCGAAACGATGCAGCCGCGGCTGCAAACGACGATCCTGGTAGAC
J8T4A+NSRRK-T
TAAAAAACAGCAGAAGAAAGGCAGCCGCGGCTGCAAACAGATCAAAAGTCGTTTG
J8T4A+NSRRK-B
CAAACGACTTTTGATCTGTTTGCAGCCGCGGCTGCCTTTCTTCTGCTGTTTTTTA
J8T4A+NTKML-T
CTTCGTCGTTTCGATTAAAAGCAGCCGCGGCTGCAAACACGAAGATGCTCAACAG
J8T4A+NTKML-B
CTGTTGAGCATCTTCGTGTTTGCAGCCGCGGCTGCTTTTAATCGAAACGACGAAG
J8T5A+NRSKV-T
TGCTCAACAGATCAAAAGTCGCAGCCGCGGCTGCATCTTCTGTAATGGATCCGTA
J8T5A+NRSKV-B
TACGGATCCATTACAGAAGATGCAGCCGCGGCTGCGACTTTTGATCTGTTGAGCA
J8T5A+VCSSS-T
TGCTCAACAGATCAAAAGTCGCAGCCGCGGCTGCATCTTCTGTAATGGATCCGTA
J8T5A+VCSSS-B
TACGGATCCATTACAGAAGATGCAGCCGCGGCTGCGACTTTTGATCTGTTGAGCA
PORA-T (XbaI) TCTAGAAAAGAGTAAAGAAGAACAATGGCCCTTCAAGCTGCTTC PORA-B (XhoI) CTCGAGTGTGACTGATGGAGTTGAAG PORA T2A-T AAGCTGCTTCTTTGGTCTCCGCAGCCGCGGCTGCAGCCGCGGCT
GCAGCTTTAAATGCTTCAGCATCATC PORA T2A-B GATGATGCTGAAGCATTTAAAGCTGCAGCCGCGGCTGCAGCCGC
GGCTGCGGAGACCAAAGAAGCAGCTT PORA T3A-T CTGTCCGCAAAGATGGAAAAGCAGCCGCGGCTGCAGCCGCGGCT
GCAGCTGAGTCTAGTCTGTTCGGTGT PORA T3A-B ACACCGAACAGACTAGACTCAGCTGCAGCCGCGGCTGCAGCCGC
GGCTGCTTTTCCATCTTTGCGGACAG PORA T4A-T CAGCATCATCATCATTCAAAGCAGCCGCGGCTGCAGCCGCGGCT
GCAGCTGAGCAAAGCAAAGCTGACTT PORA T4A-B AAGTCAGCTTTGCTTTGCTCAGCTGCAGCCGCGGCTGCAGCCGC
GGCTGCTTTGAATGATGATGATGCTG PORA T5A-T TGTTCGGTGTTTCACTTTCGGCAGCCGCGGCTGCAGCCGCGGCT
GCAGCTTCATTGAGATGCAAGAGGGA PORA T5A-B TCCCTCTTGCATCTCAATGAAGCTGCAGCCGCGGCTGCAGCCGCG
GCTGCCGAAAGTGAAACACCGAACA PORA T6A-T AAGCTGACTTTGTCTCTTCCGCAGCCGCGGCTGCAGCCGCGGCT
GCAGCTAGGAATAATAAAGCGATTAT PORA T6A-B ATAATCGCTTTATTATTCCTAGCTGCAGCCGCGGCTGCAGCCGCGG
CTGCGGAAGAGACAAAGTCAGCTT PT2A+SAFSV-T TCTCCTCTGCTTTCTCTGTCGCAGCCGCGGCTGCATTAAATGCTTC
AGCATCATC PT2A+SAFSV-B GATGATGCTGAAGCATTTAATGCAGCCGCGGCTGCGACAGAGAA
AGCAGAGGAGA PT2A+RKDGK-T
AAGCTGCTTCTTTGGTCTCCGCAGCCGCGGCTGCACGCAAAGATGGAAAATTAAA
PT2A+RKDGK-B
TTTAATTTTCCATCTTTGCGTGCAGCCGCGGCTGCGGAGACCAAAGAAGCAGCTT
PT3A+LNASA-T
GAAAATTAAATGCTTCAGCAGCAGCCGCGGCTGCAGAGTCTAGTCTGTTCGGTGT
PT3A+LNASA-B
ACACCGAACAGACTAGACTCTGCAGCCGCGGCTGCTGCTGAAGCATTTAATTTTC
PT3A+SSSFK-T CTGTCCGCAAAGATGGAAAAGCAGCCGCGGCTGCATCATCATCATTCAAAGAGTC
PT3A+SSSFK-B GACTCTTTGAATGATGATGATGCAGCCGCGGCTGCTTTTCCATCTTTGCGGACAG
PT4A+ESSLF-T TCAAAGAGTCTAGTCTGTTCGCAGCCGCGGCTGCAGAGCAAAGCAAAGCTGACTT
PT4A+ESSLF-B AAGTCAGCTTTGCTTTGCTCTGCAGCCGCGGCTGCGAACAGACTAGACTCTTTGA
PT4A+GVSLS-T
CAGCATCATCATCATTCAAAGCAGCCGCGGCTGCAGGTGTTTCACTTTCGGAGCA
PT4A+GVSLS-B
TGCTCCGAAAGTGAAACACCTGCAGCCGCGGCTGCTTTGAATGATGATGATGCTG
PT5A+EQSKA-T
TTTCGGAGCAAAGCAAAGCTGCAGCCGCGGCTGCATCATTGAGATGCAAGAGGGA
PT5A+EQSKA-B
TCCCTCTTGCATCTCAATGATGCAGCCGCGGCTGCAGCTTTGCTTTGCTCCGAAA
PT5A+DFVSS-T TGTTCGGTGTTTCACTTTCGGCAGCCGCGGCTGCAGACTTTGTCTCTTCCTCATT
PT5A+DFVSS-B
AATGAGGAAGAGACAAAGTCTGCAGCCGCGGCTGCCGAAAGTGAAACACCGAACA
PT6A+SLRCK-T
CTTCCTCATTGAGATGCAAGGCAGCCGCGGCTGCAAGGAATAATAAAGCGATTAT
PT6A+SLRCK-B
ATAATCGCTTTATTATTCCTTGCAGCCGCGGCTGCCTTGCATCTCAATGAGGAAG
PT6A+REQSL-T
AAGCTGACTTTGTCTCTTCCGCAGCCGCGGCTGCAAGGGAACAGAGCTTGAGGAA
PT6A+REQSL-B
TTCCTCAAGCTCTGTTCCCTTGCAGCCGCGGCTGCGGAAGAGACAAAGTCAGCTT
GLU2(Xba1)-T CTAG TCTAGAATGGCTCTACAGTCTCCCGG GLU2(Xho1)-B CCG CTCGAG GGCTCGGTCAGAATTAAGGA GLU2 T1-hy-5 CTAG TCTAGA ATGGCT GCT CAGTCT GCTGCT GCTACC GCT
GCTTCATCTTCCGTTTCC GLU2 T2A-T TCTCCCGGAGCTACCGGAGCAGCCGCGGCTGCAGCCGCGGCTGC
AGCT TCCGCGAAATTAAGCTCT GLU2 T2A-B AGAGCTTAATTTCGCGGAAGCTGCAGCCGCGGCTGCAGCCGCGG
CTGCTCCGGTAGCTCCGGGAGA GLU2 T3A-T GTTTCCCGGCTTCTCTCTGCAGCCGCGGCTGCAGCCGCGGCTGC
AGCTTTCTCTGTTGACTTCGTC GLU2 T3A-B GACGAAGTCAACAGAGAAAGCTGCAGCCGCGGCTGCAGCCGCG
GCTGCAGAGAGAAGCCGGGAAAC GLU2 T4A-T AGCTCTACTAAGACTATCGCAGCCGCGGCTGCAGCCGCGGCTGC
AGCTATTTCTAAAGGAACCAAA GLU2 T4A-B TTTGGTTCCTTTAGAAATAGCTGCAGCCGCGGCTGCAGCCGCGGC
TGCGATAGTCTTAGTAGAGCT GLU2 T5A-T TTCGTCAGATCCTACTGTGCAGCCGCGGCTGCAGCCGCGGCTGC
AGCTCTCTCCGGCTTTCGCGGC GLU2 T5A-B GCCGCGAAAGCCGGAGAGAGCTGCAGCCGCGGCTGCAGCCGCG
GCTGCACAGTAGGATCTGACGAA GLU2 T6A-T ACCAAACGGCGTAACGAAGCAGCCGCGGCTGCAGCCGCGGCTG
CAGCTCTCAAGTCCTCGCTGAGG GLU2 T6A-B CCTCAGCGAGGACTTGAGAGCTGCAGCCGCGGCTGCAGCCGCG
GCTGCTTCGTTACGCCGTTTGGT GLU2 T3A+SAKLS-T
TCTTCCGCGAAATTAAGCGCAGCCGCGGCTGCATTCTCTGTTGACTTCGTC
GLU2 T3A+SAKLS-B
GACGAAGTCAACAGAGAATGCAGCCGCGGCTGCGCTTAATTTCGCGGAAGA
GLU2 T3A+STKTI-T
GTTTCCCGGCTTCTCTCTGCAGCCGCGGCTGCATCTACTAAGACTATCTTC
GLU2 T3A+STKTI-B
GAAGATAGTCTTAGTAGATGCAGCCGCGGCTGCAGAGAGAAGCCGGGAAAC
GLU2 T4A+FSVDF-T
ATCTTCTCTGTTGACTTCGCAGCCGCGGCTGCAATTTCTAAAGGAACCAAA
GLU2 T4A+FSVDF-B
TTTGGTTCCTTTAGAAATTGCAGCCGCGGCTGCGAAGTCAACAGAGAAGAT
GLU2 T4A+VRSYC-T
AGCTCTACTAAGACTATCGCAGCCGCGGCTGCAGTCAGATCCTACTGTATT
GLU2 T4A+VRSYC-B
AATACAGTAGGATCTGACTGCAGCCGCGGCTGCGATAGTCTTAGTAGAGCT
GLU2 T5A+ISKGT-T
TGTATTTCTAAAGGAACCGCAGCCGCGGCTGCACTCTCCGGCTTTCGCGGC
GLU2 T5A+ISKGT-B
GCCGCGAAAGCCGGAGAGTGCAGCCGCGGCTGCGGTTCCTTTAGAAATACA
GLU2 T5A+KRRNE-T
TTCGTCAGATCCTACTGTGCAGCCGCGGCTGCAAAACGGCGTAACGAACTC
GLU2 T5A+KRRNE-B
GAGTTCGTTACGCCGTTTTGCAGCCGCGGCTGCACAGTAGGATCTGACGAA
TOCC (xba1)-T CTAG TCTAGAATGGAGATACGGAGCTTG TOCC (xho1)-B CCG CTCGAGACTGTGAGGAGTCCGGAG TOCC T1-hy-5 CTAGTCTAGAATGGAGGCGCGGAGCGCGGCAGCATCTGCGAACC
CTAATTTATCTTCC TOCC T2A-T AGCTTGATTGTTTCTATGGCAGCCGCGGCTGCAGCCGCGGCTGCA
GCTCGCCCTGTATCTCCTCTC TOCC T2A-B GAGAGGAGATACAGGGCGAGCTGCAGCCGCGGCTGCAGCCGCG
GCTGCCATAGAAACAATCAAGCT TOCC T3A-T TCTTCCTTTGAGCTCTCTGCAGCCGCGGCTGCAGCCGCGGCTGCA
GCTGTTCCGTTCCGATCGACT TOCC T3A-B AGTCGATCGGAACGGAACAGCTGCAGCCGCGGCTGCAGCCGCG
GCTGCAGAGAGCTCAAAGGAAGA TOCC T4A-T CCTCTCACTCGCTCACTAGCAGCCGCGGCTGCAGCCGCGGCTGC
AGCTCGCTCCATTTCTAGGGTT TOCC T4A-B AACCCTAGAAATGGAGCGAGCTGCAGCCGCGGCTGCAGCCGCGG
CTGCTAGTGAGCGAGTGAGAGG TOCC T5A-T TCGACTAAACTAGTTCCCGCAGCCGCGGCTGCAGCCGCGGCTGC
AGCTTCCACCCCGAATAGTGAA TOCC T5A-B TTCACTATTCGGGGTGGAAGCTGCAGCCGCGGCTGCAGCCGCGG
CTGCGGGAACTAGTTTAGTCGA TOCC T6A-T AGGGTTTCGGCGTCGATCGCAGCCGCGGCTGCAGCCGCGGCTGC
AGCTTCCGTTAAACCTGTTTAC TOCC T6A-B GTAAACAGGTTTAACGGAAGCTGCAGCCGCGGCTGCAGCCGCGG
CTGCGATCGACGCCGAAACCCT TOCC T3A+RPVSP-T
TCTCGCCCTGTATCTCCTGCAGCCGCGGCTGCAGTTCCGTTCCGATCGACT
TOCC T3A+RPVSP-B
AGTCGATCGGAACGGAACTGCAGCCGCGGCTGCAGGAGATACAGGGCGAGA
TOCC T3A+LTRSL-T
TCTTCCTTTGAGCTCTCTGCAGCCGCGGCTGCACTCACTCGCTCACTAGTT
TOCC T3A+LTRSL-B
AACTAGTGAGCGAGTGAGTGCAGCCGCGGCTGCAGAGAGCTCAAAGGAAGA
TOCC T4A+VPFRS-T
CTAGTTCCGTTCCGATCGGCAGCCGCGGCTGCACGCTCCATTTCTAGGGTT
TOCC T4A+VPFRS-B
AACCCTAGAAATGGAGCGTGCAGCCGCGGCTGCCGATCGGAACGGAACTAG
TOCC T4A+TKLVP-T
CCTCTCACTCGCTCACTAGCAGCCGCGGCTGCAACTAAACTAGTTCCCCGC
TOCC T4A+TKLVP-B
GCGGGGAACTAGTTTAGTTGCAGCCGCGGCTGCTAGTGAGCGAGTGAGAGG
TOCC T5A+RSISR-T
CCCCGCTCCATTTCTAGGGCAGCCGCGGCTGCATCCACCCCGAATAGTGAA
TOCC T5A+RSISR-B
TTCACTATTCGGGGTGGATGCAGCCGCGGCTGCCCTAGAAATGGAGCGGGG
TOCC T5A+VSASI-T
TCGACTAAACTAGTTCCCGCAGCCGCGGCTGCAGTTTCGGCGTCGATCTCC
TOCC T5A+VSASI-B
GGAGATCGACGCCGAAACTGCAGCCGCGGCTGCGGGAACTAGTTTAGTCGA
Primers for in vitro translation in wheat germ extracts T7p CabM GGC CAG TGA ATT GTA ATA CGA CTC ACT ATA GGG CGA ATT GGA
GCT CCA CCG CGG TGG CGG CCG CTCTAGAATG gcttaccttgacggttct T7p J8M GGC CAG TGA ATT GTA ATA CGA CTC ACT ATA GGG CGA ATT GGA
GCT CCA CCG CGG TGG CGG CCG CTCTAGAATG tgttcttcttcatcttct T7p BCCPM
GGC CAG TGA ATT GTA ATA CGA CTC ACT ATA GGG CGA ATT GGA GCT CCA CCG CGG TGG CGG CCG CTCTAGAATG gcacaatctaacaaggtt
T7p PORAM
GGC CAG TGA ATT GTA ATA CGA CTC ACT ATA GGG CGA ATT GGA GCT CCA CCG CGG TGG CGG CCG CTCTAGAATG tgcaagagggaacagagc
T7P GLU2M
GGC CAG TGA ATT GTA ATA CGA CTC ACT ATA GGG CGA ATT GGA GCT CCA CCG CGG TGG CGG CCG CATG gcgatccttaattctgac
T7P TOCCM
GGC CAG TGA ATT GTA ATA CGA CTC ACT ATA GGG CGA ATT GGA GCT CCA CCG CGG TGG CGG CCG CATG gcgtcgatctccaccccg
nosT 3’ ctatgacatgattacgaatt
1
Supplemental Methods
Assessment of alignment significance
The statistical significance of the observed scores of global alignments, based on either
Needleman-Wunsch (Needleman and Wunsch, 1970) or our proposed algorithm with dual gap
penalties, was assessed by calculating the p-value, where the scores of the alignments to 1000
randomly-permuted sequences were approximated by Gaussian distribution. A Gaussian
distribution with mean μ and variance 2σ has the form
2 22 1/2 2
1 1( | , ) exp ( )(2 ) 2
P x xμ σ μπσ σ
⎧ ⎫= − −⎨ ⎬⎩ ⎭
.
Maximum likelihood estimates of the two parameters were obtained from the scores of the
alignments of 10000 randomly permuted sequences. The p-value of the observed alignment
score obsx was calculated from the estimated Gaussian distribution by
p-value 1 ( )obsP X x= − ≤ .
The Gaussian score distribution of the global alignments can be justified by the adequate fit to
the score histograms of the alignments to 10,000 randomly-permuted sequences (Fig. S8).
Detailed description of prediction of plastid transit peptide sequence motifs
The upper bound of the misclassification error plays a central role in our model. According to
the Bayesian decision theory, the optimal decision rule minimizing the probability of
misclassification error is given by
1 2Decide if ( | ) ( | ); otherwise decide 1C P C P C C>x x 2 ,
where is the feature vector, is the class label for chloroplast transit peptides, and
is the class label for other proteins. If the class-conditional densities are normal, the
probability of error is given by
x 1C 2C
2
( )1 2
( ) ( | ) ( )
( ) ( ) exp (1/ 2) ,
P error P error p d
P C P C k
∞
−∞=
≤ −
∫ x x x
where is called Bhattacharyya bound of the error and is given by (1/ 2)k
1 21
T 1 22 1 2 1
1 2
1 1 2(1/ 2) ( ) ( ) ln ,8 2 2 | || |
k μ μ μ μ−
Σ + ΣΣ + Σ⎛ ⎞= − − +⎜ ⎟ Σ Σ⎝ ⎠
where 1μ is the the mean vector for , 1C 2μ is the the mean vector for , 2C 1Σ is the
covariance matrix for , and is the covariance matrix for . 1C 2Σ 2C
To evaluate the significance of the overlap between the experimentally-determined
and predicted ones, we calculated p-values under the null hypothesis, in which overlaps
occurred by chance and could be modeled using a hypergeometric distribution.
Detailed description of plastid transit peptide prediction
An empirical kernel map enables the SVM classifier to learn a classification function from
protein sequences. From the clustering module, we have a set of representative plastid transit
peptide sequences with the predicted motifs 1{ , , }MD b b= K 1, , Mb b′ ′K , where M is equal
to the number of groups. Let be an input sequence or a sequence in the training data. The
empirical kernel map finds a feature vector representation, mapping the sequence into a
vector space by
a
a
( )a aφa . Then, the feature vector for the sequence has the following form a
1 1 1( ) ( ( , ), ( , ), ( , ), , ( , ), ( , ), ( , ))TNW SW MA NW M SW M MA Ma d b a d b a d b a d b a d b a d b aφ ′ ′= K
where is the scores of global alignments (Needleman-Wunsch
algorithm) and local alignments (Smith-Waterman algorithm) between the sequence and
the chosen plastid transit peptide sequence representing the ith group, and
( , ) and ( , )NW i SW id b a d b a
a
ib ( , )MA id b a′ is
the score of the proposed global alignments with dual gap penalties with the predicted
sequence motifs ib′ . We used LIBSVM 2.84 for the SVM classifier (Chang et al., 2001).
3
Performance evaluation of predicting chloroplast transit peptides
To measure the performance of the prediction system, we calculated sensitivity, specificity,
Matthew’s correlation coefficient (MCC), and the overall accuracy using the following
equations:
( )Sensitivity( )( ) ( )
tp iitp i fn i
=+ ,
( )Specificity( )( ) ( )
tp iitp i fp i
=+ ,
( ) ( ) ( ) ( )MCC( )( ( ) ( ))( ( ) ( ))( ( ) ( ))( ( ) ( ))
tp i tn i fp i fn iitp i fn i tp i fp i tn i fp i tn i fn i
× − ×=
+ + + + ,
2
1
(Overall accuracy
)i
tp
N
i==∑
,
where N is the total number of sequences, tp(i) (true positive) is the number of correctly
predicted sequences of class i, tn(i) (true negative) is the number of correctly predicted
sequences not of class i, fp(i) (false positive) is the number of over-predicted sequences of
class i, and fn(i) (false negative) is the number of under-predicted sequences of class i. When
a classifier gives a perfect prediction result, MCC equals one. In the opposite case of a
completely random assignment, MCC is zero.
Performance comparisons of predicting chloroplast transit peptides
To compare our prediction method with four predictors including ChloroP, PCLR, TargetP,
and Predotar, we used (1) the same training set, (2) the same test set, and (3) equally sized N-
terminal sequences between our method and the comparison predictor. More detailed
information on preparing training and test sets is available in a previous publication by Schein
et al. (2001). For ChloroP and PCLR, our method was trained using the same training set (75
cTP + 75 non cTP), tested with the same test set (113 cTP + 725 non cTP), and the size of N-
4
terminal sequences was set to 100. For TargetP, we trained using the same training set (141
cTP + 799 non cTP), tested with the same test set (204 cTP + 741 non cTP), and the size of N-
terminal sequences was set to 100. The test set was generated from the 208 experimentally-
confirmed plastid proteins and 778 non-plastid proteins (see Methods section), and redundant
sequences that overlapped the training set were removed. For Predotar, we used the same
training set (588 cTP + 5400 non cTP), tested with the same test set (154 cTP + 683 non cTP),
and the size of N-terminal sequences was set to 60. The test set was generated in a similar
manner with TargetP. When we trained our method with the training sets of TargetP and
Predotar, we balanced the positive and negative ratios in the training data, where the number
of negative sequences is twice as many as the positive ones.
5
Supplemental References
Chang, C., and Lin, C. (2001). LIBSVM: a library for support vector machines. Software
available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Choi, J.H, Jung, H.Y, Kim, H.S, and Cho, H.G. (2000). PhyloDraw: a phylogenetic tree
drawing system. Bioinformatics. 16, 1056-1058.
Felsenstein J. (1989). PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics. 5,
164-166.
Needleman, S.B., and Wunsch, C.D. (1970). A general method applicable to the search for
similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443-453.
Thompson, J.D., Gibson, T.J., Plewniak, F, Jeanmougin, F, and Higgins, D.G. (1997). The
CLUSTAL X windows interface: flexible strategies for multiple sequence alignment aided by
quality analysis tools. Nucleic Acids Res. 25, 4876-4882.