Automatic ssu- rRNA novelty ranking pipeline Ssu-RNA sequence s One ranking score for each sequence...
-
Upload
theodore-gibbs -
Category
Documents
-
view
220 -
download
5
Transcript of Automatic ssu- rRNA novelty ranking pipeline Ssu-RNA sequence s One ranking score for each sequence...
Automatic ssu-rRNA novelty ranking pipeline
Ssu-RNA sequences
One ranking score for each sequence for phylogenetic novelty
Dongying Wu
03/2015
A tree of query sequences with SILVA references(reference sequences with defined phyla)
Cut the tree at the phylum level into OTUs, singletons of query sequences are from novel phyla
1 Archaea;Ancient Archaeal Group(AAG) 14Archaea;Crenarchaeota44 Archaea;Euryarchaeota 1Archaea;Korarchaeota2 Archaea;Marine Hydrothermal Vent Group 1(MHVG-1) 1 Archaea;Nanoarchaeota15 Archaea;Thaumarchaeota
70 Bacteria;Acidobacteria 60Bacteria;Actinobacteria3 Bacteria;Aquificae44 Bacteria;Armatimonadetes23 Bacteria;BD1-511 Bacteria;BHI80-139103 Bacteria;Bacteroidetes 7Bacteria;CK-1C4-1913 Bacteria;Caldiserica311 Bacteria;Candidate division OD14 Bacteria;Chlamydiae 17Bacteria;Chlorobi125 Bacteria;Chloroflexi1 Bacteria;Chrysiogenetes128 Bacteria;Cyanobacteria 18Bacteria;Deferribacteres8 Bacteria;Deinococcus-Thermus 3Bacteria;Dictyoglomi22 Bacteria;Elusimicrobia 27Bacteria;Fibrobacteres217 Bacteria;Firmicutes24 Bacteria;Fusobacteria4 Bacteria;GAL085 Bacteria;GOUTA49 Bacteria;Gemmatimonadetes 6Bacteria;Hyd24-1210 Bacteria;JL-ETNP-Z39 9Bacteria;Kazan-3B-282 Bacteria;LD1-PA3822 Bacteria;Lentisphaerae3 Bacteria;MVP-2129 Bacteria;NPL-UPA222 Bacteria;Nitrospirae3 Bacteria;OC31116 Bacteria;Planctomycetes 256Bacteria;Proteobacteria8 Bacteria;RF31 Bacteria;RsaHF2311 Bacteria;S2R-291 Bacteria;SBYG-27913 Bacteria;SM2F1145 Bacteria;Spirochaetae13 Bacteria;Synergistetes 55Bacteria;TA0615 Bacteria;TM613 Bacteria;Tenericutes3 Bacteria;Thermodesulfobacteria 18Bacteria;Thermotogae24 Bacteria;Verrucomicrobia 6Bacteria;WCHB1-606 Bacteria;aquifer13 Bacteria;aquifer2
Number of representatives from each phylum (1950 bacteria/78 archaea)
38 Eukaryota;Archaeplastida;Chloroplastida;Charophyta;Phragmoplastophyta;Streptophyta29 Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta41 Eukaryota;Excavata;Discoba;Discicristata;Euglenozoa;Euglenida7 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Acanthocephala27 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Annelida66 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Arthropoda2 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Brachiopoda4 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Bryozoa1 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Chaetognatha8 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Cnidaria3 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Ctenophora1 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Cycliophora4 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Echinodermata3 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Entoprocta3 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Gastrotricha5 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Gnathostomulida4 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Hemichordata1 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Kinorhyncha1 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Loricifera1 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Mesozoa;Orthonectida1 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Mesozoa;Rhombozoa17 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Mollusca1 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Myzostomida17 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Nematoda2 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Nematomorpha10 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Nemertea1 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Onychophora27 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Platyhelminthes2 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Priapulida3 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Rotifera4 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Tardigrada1 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Xenoturbellida3 Eukaryota;Opisthokonta;Holozoa;Metazoa;Porifera11 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Chytridiomycota55 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota12 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota11 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Kickxellomycotina;Glomeromycota25 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Microsporidia7 Eukaryota;Picozoa95 Eukaryota;SAR;Alveolata;Apicomplexa2 Eukaryota;SAR;Alveolata;Protalveolata;Chromerida1 Eukaryota;SAR;Stramenopiles;Diatomea;Coscinodiscophytina;Fragilariales;Ctenophora3 Eukaryota;SAR;Stramenopiles;Phaeophyceae4 Eukaryota;SAR;Stramenopiles;Xanthophyceae
Number of representatives from each phylum (564 Eukaryota)
The selection process maximizes phylogenetic diversity. The core representative sequences only include those with PHYLUM assignments. Thus we have phylogenetic gaps in the representative data set.
We have to include close relatives from SILVA of query sequences for tree building
1. Filling phylogenetic gaps in the core references2. Guide short query sequences into the right positions in the tree
Query ssu-rRNA
Top 10 hit from SILVA by blat
Align query sequences and hits by sina
Build ML tree by Fasttree together with pre-aligned core reference sequences
Jessica Jarett’s test ssu-rRNA sequence of 28 + 169 top hits +core references
Eukaryota
Archaea
Bacteria
How to identify the cutoff line at the phylum level
Cut the tree using different TreeOTU cutoffs, and compare the resulting OTUs with phylum level OTU standard defined by SILVA (query sequences are ignored during the comparison)
0 0.2 0.4 0.6 0.8 1 1.20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
TreeOTU cutoff
AMI c
ompa
red
to S
ILVA
phyl
um le
vel
defin
ition
(Bac
teria
/Arc
haea
)0.42 is the TreeOTU cutoff for phylum
Query in a OTU with reference sequences?
Yes No
10-CSP1477__MDM2__DC4__Prim__02__M8__S11_B02_014
19-CSP1477__MDM2__DC4__SYBR__02__M20__S7_C03_027
CutoffphylumCutoffqueryNovel score=
-
1- Cutoffphylum
(if Cutoffquery >= Cutoffphylum)
CutoffphylumCutoffqueryNovel score=
-
Cutoffphylum
(if Cutoffquery < Cutoffphylum)
1 (root) 0 (phylum line) -1(tip)
The pipeline has been completed. Here is one command line example:
~/dwu_scripts/single_cell/prep_seq_with_close_relatives_from_silva.pl -db ../../SSURef_NR99_115_tax_silva_trunc.fasta -blat ~/bin/blat -input star16S18Sseq.txt -output star16S18Sseq.hit
nohup ~/dwu_scripts/single_cell/run_sina.pl -i star16S18Sseq.hit -o star16S18Sseq.ali -db ../../SSURef_NR99_115_SILVA_20_07_13_opt.arb -sina ~/bin/SINA/sina-1.2.11/sina &
cat star16S18Sseq.ali ../silva_phyla_rep.sina.fasta | ~/dwu_scripts/single_cell/trim_all_nt_gap.pl > star16S18Sseq.trim
nohup ~/bin/FastTree -nt star16S18Sseq.trim > star16S18Sseq.tre &
~/dwu_scripts/single_cell/phylum_novelty_ranking.pl -tree star16S18Sseq.tre -reftaxa ../silva_phyla.info -output star16S18Sseq.ranking
Example of the output##Bacteria && Archaea: standard reference TreerOTU cutoff: 0.490##values: 1->root, 0->same_as_refrence_standard, -1->identical_seqeunces_in_the_referencesCanI4_Uncultured_Thiothrix_sp___JX435593 -0.510 Bacteria;Proteobacteria FM174326.1.1418 0.14918CanF1_Uncultured_Thiothrix_sp___JX435593 -0.510 Bacteria;Proteobacteria FM174326.1.1418 0.1617CanI1_Uncultured_Thiothrix_sp___JX435593 -0.510 Bacteria;Proteobacteria FM174326.1.1418 0.20465CanI3_Uncultured_Thiothrix_sp___JX435593 -0.510 Bacteria;Proteobacteria FM174326.1.1418 0.20612CanI2_ -0.612 Bacteria;Proteobacteria FM165230.1.1430 0.12171AerM5_Uncultured_Bacterium_KF135895 -0.633 Bacteria;Proteobacteria JX223534.1.1452 0.21316AerM4_Uncultured_Bacterium_KF135901 -0.633 Bacteria;Proteobacteria JX223534.1.1452 0.12817CanJ2_Uncultured_Bacterium_JF232531 -0.735 Bacteria;Proteobacteria DQ264409.1.1505 0.1243….
##Eukaryota: standard reference TreerOTU cutoff: 0.370##values: 1->root, 0->same_as_refrence_standard, -1->identical_seqeunces_in_the_referencesAC1_Uncultured_archaeon_clone_DQ088777 0.984 Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta Eukaryota_AF525614.1.1325 1.415856__4_Saccharomyces_cerevisiae_YJM789_JQ277730 -0.703 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota Eukaryota_BAEL01000039.331389.332863 0.206882__2_Saccharomyces_cerevisiae_YJM993_CP006467 -0.703 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Cnidaria Eukaryota_ABRM01021940.10464.12058 0.359836__2_Saccharomyces_cerevisiae_YJM789_JQ277730 -0.703 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota Eukaryota_BAEL01000039.331389.332863 0.205566__3_Uncultured_Fungus_KC337083 -0.703 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota Eukaryota_GU324000.1.1794 0.334724__1_Saccharomyces_cerevisiae_YJM789_JQ277730 -0.703 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Cnidaria Eukaryota_ABRM01021940.10464.12058 0.358672__1_Saccharomyces_cerevisiae_YJM993_CP006467 -0.703 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Cnidaria Eukaryota_ABRM01021940.10464.12058 0.358826__1_Uncultured_Eukaryote_EU326631 -0.703 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota Eukaryota_BAEL01000039.331389.332863 0.20425
Novelty ranking
Star Diamonds sample from Ramunas Stepanauskas, Mariana Erasmus
SILVA is required only in the aligning step (sina), we can modify database for close relative selection, taxonomic level information, core reference sequences for novelty ranking.
proteobacteria
Database Problem
Database and reference updating:
1. Add new phyla to the core reference sequences
2. Merge or split current phylum definitions
Only one rule must be observed: The classification must be exclusive. one sequences in the database or core reference set must be in one taxonomic group only, the taxonomic groups should be treated equally
e.g., if we break proteobacteria into 4 groups, the definition of proteobacteria need to be removed. Proteobacteria sequences that cannot be assigned to the 4 sub-groups can be in the database or core reference dataset. They can play roles in tree structure and novelty ranking, but play no roles in taxonomic assignments and the novelty baseline TreeOTU cutoff identification.
High novelty ranking sequence problem:
M00954:45:000000000-A8ECK:1:1102:18894:5578 0.981 Archaea;Euryarchaeota Archaea_AF328210.1.10132.25351
Batch job result consistency problem
In an ideal word, the pipeline takes one sequence at a time, one sequence have one novelty ranking score. But the aligning and tree building steps are slow, we have to combine queries together to build reasonable number of alignments and trees. How different bundling affects novelty ranking score need to be addressed.