Automatic ssu- rRNA novelty ranking pipeline Ssu-RNA sequence s One ranking score for each sequence...

21
Automatic ssu-rRNA novelty ranking pipeline Ssu- RNA sequen ces One ranking score for each sequence for phylogenetic novelty Dongying Wu 03/2015

Transcript of Automatic ssu- rRNA novelty ranking pipeline Ssu-RNA sequence s One ranking score for each sequence...

Automatic ssu-rRNA novelty ranking pipeline

Ssu-RNA sequences

One ranking score for each sequence for phylogenetic novelty

Dongying Wu

03/2015

A tree of query sequences with SILVA references(reference sequences with defined phyla)

Cut the tree at the phylum level into OTUs, singletons of query sequences are from novel phyla

1 Archaea;Ancient Archaeal Group(AAG) 14Archaea;Crenarchaeota44 Archaea;Euryarchaeota 1Archaea;Korarchaeota2 Archaea;Marine Hydrothermal Vent Group 1(MHVG-1) 1 Archaea;Nanoarchaeota15 Archaea;Thaumarchaeota

70 Bacteria;Acidobacteria 60Bacteria;Actinobacteria3 Bacteria;Aquificae44 Bacteria;Armatimonadetes23 Bacteria;BD1-511 Bacteria;BHI80-139103 Bacteria;Bacteroidetes 7Bacteria;CK-1C4-1913 Bacteria;Caldiserica311 Bacteria;Candidate division OD14 Bacteria;Chlamydiae 17Bacteria;Chlorobi125 Bacteria;Chloroflexi1 Bacteria;Chrysiogenetes128 Bacteria;Cyanobacteria 18Bacteria;Deferribacteres8 Bacteria;Deinococcus-Thermus 3Bacteria;Dictyoglomi22 Bacteria;Elusimicrobia 27Bacteria;Fibrobacteres217 Bacteria;Firmicutes24 Bacteria;Fusobacteria4 Bacteria;GAL085 Bacteria;GOUTA49 Bacteria;Gemmatimonadetes 6Bacteria;Hyd24-1210 Bacteria;JL-ETNP-Z39 9Bacteria;Kazan-3B-282 Bacteria;LD1-PA3822 Bacteria;Lentisphaerae3 Bacteria;MVP-2129 Bacteria;NPL-UPA222 Bacteria;Nitrospirae3 Bacteria;OC31116 Bacteria;Planctomycetes 256Bacteria;Proteobacteria8 Bacteria;RF31 Bacteria;RsaHF2311 Bacteria;S2R-291 Bacteria;SBYG-27913 Bacteria;SM2F1145 Bacteria;Spirochaetae13 Bacteria;Synergistetes 55Bacteria;TA0615 Bacteria;TM613 Bacteria;Tenericutes3 Bacteria;Thermodesulfobacteria 18Bacteria;Thermotogae24 Bacteria;Verrucomicrobia 6Bacteria;WCHB1-606 Bacteria;aquifer13 Bacteria;aquifer2

Number of representatives from each phylum (1950 bacteria/78 archaea)

38 Eukaryota;Archaeplastida;Chloroplastida;Charophyta;Phragmoplastophyta;Streptophyta29 Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta41 Eukaryota;Excavata;Discoba;Discicristata;Euglenozoa;Euglenida7 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Acanthocephala27 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Annelida66 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Arthropoda2 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Brachiopoda4 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Bryozoa1 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Chaetognatha8 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Cnidaria3 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Ctenophora1 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Cycliophora4 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Echinodermata3 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Entoprocta3 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Gastrotricha5 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Gnathostomulida4 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Hemichordata1 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Kinorhyncha1 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Loricifera1 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Mesozoa;Orthonectida1 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Mesozoa;Rhombozoa17 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Mollusca1 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Myzostomida17 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Nematoda2 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Nematomorpha10 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Nemertea1 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Onychophora27 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Platyhelminthes2 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Priapulida3 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Rotifera4 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Tardigrada1 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Xenoturbellida3 Eukaryota;Opisthokonta;Holozoa;Metazoa;Porifera11 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Chytridiomycota55 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota12 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota11 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Kickxellomycotina;Glomeromycota25 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Microsporidia7 Eukaryota;Picozoa95 Eukaryota;SAR;Alveolata;Apicomplexa2 Eukaryota;SAR;Alveolata;Protalveolata;Chromerida1 Eukaryota;SAR;Stramenopiles;Diatomea;Coscinodiscophytina;Fragilariales;Ctenophora3 Eukaryota;SAR;Stramenopiles;Phaeophyceae4 Eukaryota;SAR;Stramenopiles;Xanthophyceae

Number of representatives from each phylum (564 Eukaryota)

The selection process maximizes phylogenetic diversity. The core representative sequences only include those with PHYLUM assignments. Thus we have phylogenetic gaps in the representative data set.

We have to include close relatives from SILVA of query sequences for tree building

1. Filling phylogenetic gaps in the core references2. Guide short query sequences into the right positions in the tree

Query ssu-rRNA

Top 10 hit from SILVA by blat

Align query sequences and hits by sina

Build ML tree by Fasttree together with pre-aligned core reference sequences

Jessica Jarett’s test ssu-rRNA sequence of 28 + 169 top hits +core references

Eukaryota

Archaea

Bacteria

Eukaryota

Archaea

Bacteria

Tree rooting to separate Eukaryota and Archaea/Bacteria (automatic)

Tree rooting for Archaea/Bacteria (automatic)

How to identify the cutoff line at the phylum level

Cut the tree using different TreeOTU cutoffs, and compare the resulting OTUs with phylum level OTU standard defined by SILVA (query sequences are ignored during the comparison)

0 0.2 0.4 0.6 0.8 1 1.20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

TreeOTU cutoff

AMI c

ompa

red

to S

ILVA

phyl

um le

vel

defin

ition

(Bac

teria

/Arc

haea

)0.42 is the TreeOTU cutoff for phylum

Query in a OTU with reference sequences?

Yes No

10-CSP1477__MDM2__DC4__Prim__02__M8__S11_B02_014

19-CSP1477__MDM2__DC4__SYBR__02__M20__S7_C03_027

CutoffphylumCutoffqueryNovel score=

-

1- Cutoffphylum

(if Cutoffquery >= Cutoffphylum)

CutoffphylumCutoffqueryNovel score=

-

Cutoffphylum

(if Cutoffquery < Cutoffphylum)

1 (root) 0 (phylum line) -1(tip)

Chlorobi

Ranking value: -0.619

Aquificae

Deinococcus-Thermus ?

Archaea ?

The pipeline has been completed. Here is one command line example:

~/dwu_scripts/single_cell/prep_seq_with_close_relatives_from_silva.pl -db ../../SSURef_NR99_115_tax_silva_trunc.fasta -blat ~/bin/blat -input star16S18Sseq.txt -output star16S18Sseq.hit

nohup ~/dwu_scripts/single_cell/run_sina.pl -i star16S18Sseq.hit -o star16S18Sseq.ali -db ../../SSURef_NR99_115_SILVA_20_07_13_opt.arb -sina ~/bin/SINA/sina-1.2.11/sina &

cat star16S18Sseq.ali ../silva_phyla_rep.sina.fasta | ~/dwu_scripts/single_cell/trim_all_nt_gap.pl > star16S18Sseq.trim

nohup ~/bin/FastTree -nt star16S18Sseq.trim > star16S18Sseq.tre &

~/dwu_scripts/single_cell/phylum_novelty_ranking.pl -tree star16S18Sseq.tre -reftaxa ../silva_phyla.info -output star16S18Sseq.ranking

Example of the output##Bacteria && Archaea: standard reference TreerOTU cutoff: 0.490##values: 1->root, 0->same_as_refrence_standard, -1->identical_seqeunces_in_the_referencesCanI4_Uncultured_Thiothrix_sp___JX435593 -0.510 Bacteria;Proteobacteria FM174326.1.1418 0.14918CanF1_Uncultured_Thiothrix_sp___JX435593 -0.510 Bacteria;Proteobacteria FM174326.1.1418 0.1617CanI1_Uncultured_Thiothrix_sp___JX435593 -0.510 Bacteria;Proteobacteria FM174326.1.1418 0.20465CanI3_Uncultured_Thiothrix_sp___JX435593 -0.510 Bacteria;Proteobacteria FM174326.1.1418 0.20612CanI2_ -0.612 Bacteria;Proteobacteria FM165230.1.1430 0.12171AerM5_Uncultured_Bacterium_KF135895 -0.633 Bacteria;Proteobacteria JX223534.1.1452 0.21316AerM4_Uncultured_Bacterium_KF135901 -0.633 Bacteria;Proteobacteria JX223534.1.1452 0.12817CanJ2_Uncultured_Bacterium_JF232531 -0.735 Bacteria;Proteobacteria DQ264409.1.1505 0.1243….

##Eukaryota: standard reference TreerOTU cutoff: 0.370##values: 1->root, 0->same_as_refrence_standard, -1->identical_seqeunces_in_the_referencesAC1_Uncultured_archaeon_clone_DQ088777 0.984 Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta Eukaryota_AF525614.1.1325 1.415856__4_Saccharomyces_cerevisiae_YJM789_JQ277730 -0.703 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota Eukaryota_BAEL01000039.331389.332863 0.206882__2_Saccharomyces_cerevisiae_YJM993_CP006467 -0.703 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Cnidaria Eukaryota_ABRM01021940.10464.12058 0.359836__2_Saccharomyces_cerevisiae_YJM789_JQ277730 -0.703 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota Eukaryota_BAEL01000039.331389.332863 0.205566__3_Uncultured_Fungus_KC337083 -0.703 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota Eukaryota_GU324000.1.1794 0.334724__1_Saccharomyces_cerevisiae_YJM789_JQ277730 -0.703 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Cnidaria Eukaryota_ABRM01021940.10464.12058 0.358672__1_Saccharomyces_cerevisiae_YJM993_CP006467 -0.703 Eukaryota;Opisthokonta;Holozoa;Metazoa;Animalia;Cnidaria Eukaryota_ABRM01021940.10464.12058 0.358826__1_Uncultured_Eukaryote_EU326631 -0.703 Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota Eukaryota_BAEL01000039.331389.332863 0.20425

Novelty ranking

Star Diamonds sample from Ramunas Stepanauskas, Mariana Erasmus

SILVA is required only in the aligning step (sina), we can modify database for close relative selection, taxonomic level information, core reference sequences for novelty ranking.

proteobacteria

Database Problem

Firmicutes

Database and reference updating:

1. Add new phyla to the core reference sequences

2. Merge or split current phylum definitions

Only one rule must be observed: The classification must be exclusive. one sequences in the database or core reference set must be in one taxonomic group only, the taxonomic groups should be treated equally

e.g., if we break proteobacteria into 4 groups, the definition of proteobacteria need to be removed. Proteobacteria sequences that cannot be assigned to the 4 sub-groups can be in the database or core reference dataset. They can play roles in tree structure and novelty ranking, but play no roles in taxonomic assignments and the novelty baseline TreeOTU cutoff identification.

High novelty ranking sequence problem:

M00954:45:000000000-A8ECK:1:1102:18894:5578 0.981 Archaea;Euryarchaeota Archaea_AF328210.1.10132.25351

Batch job result consistency problem

In an ideal word, the pipeline takes one sequence at a time, one sequence have one novelty ranking score. But the aligning and tree building steps are slow, we have to combine queries together to build reasonable number of alignments and trees. How different bundling affects novelty ranking score need to be addressed.

Team members at the current stage:

UC Davis: Dongying Wu, Guillaume Jospin, Jonathan A. Eisen

JGI: Jessica Jarett, Tanja Woyke

SCGC Bigelow : Ramunas Stepanauskas