16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S...

33
Hospital Microbiome Project QIIME Analysis 1 Contents 16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME .................... 5 Report Overview ................................................................................................................. 5 How to Obtain Microbiome Data ................................................................................... 6 How to Setup QIIME ...................................................................................................... 7 Essential files for QIIME ................................................................................................ 7 Sequence File (.fna) .................................................................................................... 8 Quality File (.qual) ...................................................................................................... 8 Mapping File ............................................................................................................... 9 Basic Statistics on Sequence Data ................................................................................ 10 Otu Picking ................................................................................................................... 10 Basic Statistics on OTU Table ...................................................................................... 13 OTU Heatmap ............................................................................................................... 14 Data Analysis ................................................................................................................ 15 Summarize Communities by Taxonomic Composition ............................................ 15 Investigating Alpha Diversity ................................................................................... 18 Identifying Differentially Abundant OTUs ............................................................... 20 Normalizing OTU Table ........................................................................................... 23 Beta-diversity and PCoA .......................................................................................... 24 Jackknifed Beta Diversity Analysis .......................................................................... 26

Transcript of 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S...

Page 1: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 1

Contents 16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME .................... 5

Report Overview ................................................................................................................. 5

How to Obtain Microbiome Data ................................................................................... 6

How to Setup QIIME ...................................................................................................... 7

Essential files for QIIME ................................................................................................ 7

Sequence File (.fna) .................................................................................................... 8

Quality File (.qual) ...................................................................................................... 8

Mapping File ............................................................................................................... 9

Basic Statistics on Sequence Data ................................................................................ 10

Otu Picking ................................................................................................................... 10

Basic Statistics on OTU Table ...................................................................................... 13

OTU Heatmap ............................................................................................................... 14

Data Analysis ................................................................................................................ 15

Summarize Communities by Taxonomic Composition ............................................ 15

Investigating Alpha Diversity ................................................................................... 18

Identifying Differentially Abundant OTUs ............................................................... 20

Normalizing OTU Table ........................................................................................... 23

Beta-diversity and PCoA .......................................................................................... 24

Jackknifed Beta Diversity Analysis .......................................................................... 26

Page 2: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 2

Asli Yazağan ayazagan.com

Make Bootstrapped Tree ........................................................................................... 29

Comparing Categories .............................................................................................. 30

Conclusion .................................................................................................................... 31

REFERENCES ................................................................................................................. 32

Page 3: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 3

Asli Yazağan ayazagan.com

Tables and Figures

Figure 1. FastaQ File Format .......................................................................................................... 8

Figure 2. Mothur output for sequence summary ........................................................................... 10

Figure 3. Summary for biom file .................................................................................................. 14

Figure 4. rep_set_tax_assignments.txt .......................................................................................... 14

Figure 5. Heatmap for HMP data .................................................................................................. 15

Figure 6. Pie plot of the degree of sharing of microbial taxa in 14 collected samples from 7

different point with four months interval in a hospital room. ....................................................... 16

Figure 7. Area plot of the degree of sharing of microbial taxa in 14 collected samples from 7

different point with four months interval in a hospital room. ....................................................... 17

Figure 8. Bar plot of the degree of sharing of microbial taxa in 14 collected samples from 7

different point with four months interval in a hospital room. ....................................................... 17

Figure 9. Microbial composition of the microbial taxa in 14 collected samples .......................... 18

Figure 10. Rarefraction Plot for date_s ......................................................................................... 19

Figure 11. Rarefraction plot for sample_type_s ............................................................................ 20

Figure 12. Diff_otus.txt for Computer Mouse and Countertop .................................................... 21

Figure 13. MA plot for differential abundance of Computer Mouse and Countertop .................. 22

Page 4: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 4

Asli Yazağan ayazagan.com

Figure 14. Dispersion Estimate Plot for Differential Abundance of Computer Mouse and

Countertop..................................................................................................................................... 22

Figure 15. MA plot for Computer Mouse Samples....................................................................... 23

Figure 16. Dispersion Estimate Plot for Computer Mouse Samples ............................................ 23

Figure 17. PCoA plot for the bacterial community collected in the hospital room. Community

were characterized by samples collected in February and April. Bray-Curtis is used as distance

metric. ........................................................................................................................................... 25

Figure 18.PCoA plot for the bacterial community collected in the hospital room. ...................... 26

Figure 19. 3D PCoA Plots for HMP samples................................................................................ 27

Figure 20. Distance Boxplot for Surface type .............................................................................. 28

Figure 21. Distance Comparison among surface types ................................................................. 29

Figure 22. Jackknifed UPGMA clustering (using the weighted UniFrac metric) showing the

similarity of bacterial communities based on 16S rRNA genes. .................................................. 30

Page 5: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 5

Asli Yazağan ayazagan.com

16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME

Report Overview

The rapid progress of that DNA sequencing techniques has changed the way of

metagenomics research and data analysis techniques over the past few years. Sequencing of 16S

rRNA gene has become a relatively easy way to study microbial composition and diversity

(Fierer et al., 2007). High-throughput bioinformatics analyses increasingly rely on pipeline

frameworks to process sequence and metadata. Popular bioinformatics pipelines in the literature

are QIIME, Mother and Uparse. In this study, QIIME (Quantitative Insights Into Microbial

Ecology) (Caporaso et al., 2010), which is an open-source bioinformatics pipeline, is planned to

use for performing microbiome analysis from raw DNA sequencing data. QIIME is designed to

create quality graphics and statistics from raw sequencing data generated on the Illumina or other

platforms. Typical QIIME analysis workflow is consisted of demultiplexing, quality filtering,

clustering (OTU detection), chimera removal, taxonomic assignment, and phylogenetic

reconstruction, and diversity analyses and visualizations.

This document is organized as an introduction tutorial on how to analyze 16S sequencing

data using current methods. During microbiome analysis, there are basic questions about

microbiome data. The following questions were covered in this tutorial document:

1. Proportionally, what microbes are found in each sample community?

2. How many species are in each sample?

3. Are there species significantly more abundant in one set of samples than in another?

4. How much does diversity change between samples?

5. Do different sample groupings significantly differ in their microbial composition?

Page 6: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 6

Asli Yazağan ayazagan.com

This documents is structured as answer for these questions concerned so that each section

is primarily concerned with how to find the answer to a particular question about the microbiome

data.

How to Obtain Microbiome Data

The Sequence Read Archive (SRA) is a bioinformatics database that provides a public

repository for DNA sequencing data obtained from next generation sequence (NGS) technology.

Raw sequence data and metadata could be searched as well as downloaded for further

downstream analysis.

Biotechnology companies such as 454, IonTorrent, Illumina, SOLiD, Helicos and

Complete Genomics, provide a line of products and services on sequencing, genotyping and gene

expression. Illumina is one of the successful company that their technology reduced the cost

of sequencing a human genome reasonable prices. Since Illumina will be used for our data

sequencing purposes eventually in the project, 16s rRNA data obtained Illumina system was

searched from SRA database and Hospital Microbiome Project data obtained from the database.

Every experiment in SRA database has an accession codes and metadata such as study abstract,

experiment attributes and owner of the data. Raw sequence data related that experiment can be

downloaded in fasta and fastaq format using accession codes.

Hospital Microbiome Project (HMP) (Shogan et al., 2013) aims to collect microbial

samples from surfaces, air, staff, and patients from the University of Chicago's new hospital

pavilion, involving 10 patient rooms, 2 nursing stations, staff, water and air sampling, both daily

and weekly during a year in order to better understand the factors that influence bacterial

population development in health care environments.

Page 7: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 7

Asli Yazağan ayazagan.com

As a preliminary exploration, a small data set from HMP was analyzed. Data collected

from seven different point (countertop, computer mouse, station phone, chair armrest, corridor

floor, hot tap water faucet and cold tap water faucet) in the same room (S10) at two different

time point (27/02/2013 and 17/04/2013) was used.

How to Setup QIIME

QIIME is a software package of python wrapper scripts and it can be downloaded and

used on Linux system. It can also be used on Virtual Box with Windows operation system. I used

QIIME 1.9.0 version on Virtual box in Windows OS.

Essential files for QIIME

QIIME works with FASTAQ file format. A FASTQ file uses four lines per sequence. A

typical sequence file in FASTAQ format as described below:

Line 1 begins with a '@' character and is followed by a sequence identifier and

an optional description.

Line 2 is the raw sequence letters.

Line 3 begins with a '+' character and is optionally followed by the same sequence

identifier again as line 1.

Line 4 encodes the quality values for the sequence in Line 2, and must contain the same

number of symbols as letters in the sequence.

FASTAQ format has sequence data as well as its quality data. QIIME has

convert_fastaqual_fastq.py script in order to convert FASTQ data file as a qual file with for

quality scores and fna file for sequence data.

Page 8: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 8

Asli Yazağan ayazagan.com

( convert_fastaqual_fastq.py -f seqs.fastq -c fastq_to_fastaqual )

Figure 1. FastaQ File Format

Sequence File (.fna)

Sequence file shows the raw sequence data for each sequence. A typical sequence file in

fna format as described below:

Line 1 begins with a '>' character and is followed by an Accession Run Code.

Line 2 is the raw sequence letters.

Quality File (.qual)

Quality file shows the quality scores for each sequence. A typical sequence file in qual

format as described below:

Line 1 begins with a '>' character and is followed by a Accession Run Code.

Line 2 is the quality scores.

Page 9: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 9

Asli Yazağan ayazagan.com

Mapping File

QIIME requires a metadata mapping file for most analysis. Mapping file is generated by

user and contains all of the information, categorical or numeric, about the samples necessary to

perform the data analysis. Excel or text file can be used to create mapping file. It should be tab-

delimited. Mapping file is important because it links sample identifier with its metadata. In a

typical mapping file, each line refers to a specific sample data. Line starts with a “SampleID”,

the “BarcodeSequence” used for each sample, the “LinkerPrimerSequence” used to amplify the

sample, and ends with a description column. First column should be “SampleID” and sampleID

could have any alphanumeric characters and periods, cannot have underscores. SampleID should

refer to the sequence headers used in FASTA files. Moreover, any metadata that relates to the

samples and any additional information relating to specific samples that may be useful to have at

hand when considering outliers. The last column must be “Description”. In some circumstances,

users may need to generate a mapping file that does not contain barcodes and/or primers. To

generate such a mapping file, fields for “Barcode Sequence” and “LinkerPrimerSequence” can

be left empty.

In order to check whether created mapping file is in the right format

validate_mapping_file.py is implemented in QIIME. This script tests many problems in the

mapping file and a “_corrected.txt” form of the mapping file is generated in output folder. If

BarcodeSequence and LinkerPrimerSequence fields are empty, then barcode and primer testing

need to be disabled with the -p and -b parameters.

validate_mapping_file.py -m <mapping_filepath> -o <outputpath> -p –b

Page 10: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 10

Asli Yazağan ayazagan.com

Basic Statistics on Sequence Data

count_seqs.py -i <sequence_file.fna> script is implemented in QIIME to count sequences

and calculate sequence length mean and standard deviation. Our file had total 220028 sequence,

151 sequence length mean and 0 standard deviation.

Mothur gives more detailed statistics such as min, max, median and quartiles. Running

summary.seqs(fasta=<sequence_file.fna>) command, the following screen is displayed and

summary output file created.

Otu Picking

Picking OTUs is called "clustering" as sequences with some threshold of identity are

clustered together to into an OTU. There are three different methods for OTU picking:

De novo Clustering

Closed-reference

Open-reference

Figure 2. Mothur output for sequence summary

Page 11: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 11

Asli Yazağan ayazagan.com

The answer to which method to choose is depend on what is known about the

microbiome community priori. If the studied microbial community is well studied, then 16S

databases has many representatives and closed reference otu picking strategy is suitable. De novo

method is suitable to discover new species. Open reference method is combined of two methods,

closed and de novo method, and is highly suggested method by QIIME developers. First it

clusters sequences against a database of 16S references sequences called “greengenes”, then

uses de novo clustering on those sequences which are not similar to the reference sequences.

Table 1. Which OTU picking strategies in which study?

OTU Picking Strategies In Which Study?

Closed reference

pick_closed_reference_otus.py Human,mouse, gut, skin, oral microbiome

De novo

pick_de_novo_otus.py Environmental, soil, water etc. hazy microbiome

Open reference

pick_open_reference_otus.py Any microbiome studies. QIIME developers

suggests this method.

In the following table, advantages and disadvantages of OTU picking strategies are

compared.

Table 2. Advantages and Disadvantages of OTU Picking Strategies

OTU Picking Strgs. Advantages Disadvantages

Closed reference

Fast and parallelizable. Suitable for

big datasets. Since it uses reference

databases, creates qualified

taxonomies and trees.

Not possible to find new species.

De novo Clusters all sequences. Parallelizable is not enabled so

slow for big datasets.

Open reference Clusters all sequences. Some part of

the work is being parallelized. Faster

Not parallelizable part of the work

is slow. It might take very long

Page 12: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 12

Asli Yazağan ayazagan.com

OTU Picking Strgs. Advantages Disadvantages

than De novo. time in the case of finding new

species except in the reference

databases.

Open reference Otu picking strategy was used for our HMP data analysis and QIIME has

pick_open_reference_otus.py script. This script walks through many substeps in a single step: it

has (1) picked OTUs, (2) generated a representative sequence for each OTU, (3) assigned known

taxonomy to those OTUs, (4) created a phylogenetic tree, and (5) created an OTU table.

>pick_open_reference_otus.py -i <sequence_file.fna> –r <97_otus.fasta > -o <outputpath > -s

0.1 -m <clustering algorithm> -p <parameter_file>

97_otus.fasta is the reference OTU file from Greengenes. Greengenes is the database of

reference 16S sequences that is used to assign taxonomy. 97_otus.fasta file is created by

clustering all the sequences in the Greengenes database into 97% identity clusters. A

representative sequence is chosen from each of those clusters to be used to create the 97_tree and

97_taxonomy. Sequences in our data are compared by representative sequences in 97_otus.fasta

and the most similar sequence’s taxonomy is assigned to our sequence.

Default clustering algorithm is UCLUST for pick_open_reference_otus.py script. But

usearch is widely used for OTU picking, Usearch was used as clustering algorithm for our data.

Parameter file was created by user with “pick_otus:enable_rev_strand_match True” line.

This line is needed if most or all of the sequences are failing to hit the reference during the

prefiltering or closed-reference OTU picking steps, sequences may be in the reverse orientation

Page 13: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 13

Asli Yazağan ayazagan.com

with respect to the reference database. This line addresses this problem, however it doubles the

amount of memory used in the workflow.

An index.html file was created and it is a navigation page and has an informative table

about output files. The important outputs of the script are the following four files:

rep_set.tre: The phylogenetic tree describing the relationship of all of our sequences

rep_set.fna: The list of representative sequences for each Otu.

otu_table_mc2_w_tax.biom: The final OTU results, including taxonomic

assignments and per-sample abundances, stored in a biom file. Mc2 refers to

“minimum size 2” that means each OTU requires at least 2 sequences. This is the file

mostly used for deeper analysis.

final_otu_map_mc2.txt: the listing of which reads were clustered into which OTU.

Basic Statistics on OTU Table

biom summarize-table -i <biom_file> -o <outputpath> script is implemented in QIIME to

create a summarization for otu table.

Figure shows the summary file for biom file. 9605 OUT was picked. If the representative

sequence file rep_set.fna is counted, the same number of sequences should be displayed.

assign_taxonomy.py -i <rep_set.fna> -o <taxonomyResults_outputpath> script is used

to assign taxonomy for each OTU representative sequence. It creates

rep_set_tax_assignments.txt file that contains an entry for each representative sequence, listing

taxonomy to the greatest depth allowed by the confidence threshold (80% by default, can be

Page 14: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 14

Asli Yazağan ayazagan.com

changed with the -c option), and a column of confidence values for the deepest level of

taxonomy shown.

Figure 3. Summary for biom file

Figure 4. rep_set_tax_assignments.txt

OTU Heatmap

make_otu_heatmap.py -i <biom file > -o <heatmap.pdf> script creates a pdf file with a

visualization of OTU table. Each row corresponds to an OTU and each column corresponds to a

sample. The higher the relative abundance of an OTU in a sample, the more intense the color at

the corresponding position in the heatmap.

Page 15: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 15

Asli Yazağan ayazagan.com

Figure 5. Heatmap for HMP data

Data Analysis

Summarize Communities by Taxonomic Composition

Looking at the relative abundances of taxa per sample in the OTU table, we could

understand what microbes are found in each sample community.

Question: Proportionally, what microbes are found in each sample community?

Scripts: summarize_taxa.py and plot_taxa_summary.py

Output: Visualized plots showing relative abundance data per samples

summarize_taxa.py -i <biom file> -o <taxaSummary_outputpath> script is used to generate

text files with relative abundance data per samples to obtain a basic overview of the members of

the community for all taxonomic ranks. The level specified at specific taxonomic ranks can be

Page 16: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 16

Asli Yazağan ayazagan.com

specified by -L parameters for the script (1 for kingdom, 2 for phylum, 3 for class, 4 for order, 5

for family, 6 for genus, 7 for species). Output text files can be passed to plot_taxa_summary.py

script to create visualized plots a following command:

plot_taxa_summary.py -i <taxaSummary_outputpath/otu_table_w_tax.txt> -l <taxonomic rank>

-c pie,bar,area -o < taxsCharts_outputpath>

The following pie plot show the total relative abundance for all data.

Figure 6. Pie plot of the degree of sharing of microbial taxa in 14 collected samples from 7 different point

with four months interval in a hospital room.

Following area and bar plot shows the relative abundance of taxa for each sample.

Page 17: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 17

Asli Yazağan ayazagan.com

Figure 7. Area plot of the degree of sharing of microbial taxa in 14 collected samples from 7 different

point with four months interval in a hospital room.

Figure 8. Bar plot of the degree of sharing of microbial taxa in 14 collected samples from 7 different

point with four months interval in a hospital room.

The following table shows the microbial composition for each sample at two different time point

at phylum level. From the plots, it looks like there is higher taxa change on computer mouse,

counter top and tab faucet handles between two time points. On the other hand, those samples

show similar taxa proportion in the same time point. This might be because the person who used

Page 18: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 18

Asli Yazağan ayazagan.com

those locations is the same person and in second time points, the person using those locations

had been changed and it had modified the microbial abundance of taxa of samples in second time

point.

Corr.Floor

February

Comp. Mouse

February

Countertop

February

Station Phone

February

Chair Armr.

February

Cold Tap W.F.H.

February

Hot Tap W.F.H.

February

Corr. Floor

April

Comp. Mouse

April

Countertop

April

Station Phone

April

Chair Armr.

April

Cold Tap W.F.H.

April

Hot Tap W.F.H.

April

Figure 9. Microbial composition of the microbial taxa in 14 collected samples

Investigating Alpha Diversity

Diversity of species in a single sample or environment is described by alpha diversity.

Question: How many species are in each sample?

Script: alpha_rarefaction.py -i <biom file > -o < alphaDiversity_outputpath>

-p < parameters.txt > -m < mapping file >

Output: Rarefaction plots.

This script is performed several steps: (1) generate rarefied OTU tables; (2) compute

alpha diversity metrics for each rarefied OTU table; (3) collate alpha diversity results; and (4)

generate alpha rarefaction plots. Alpha diversity increases with sequencing depth and rarefaction

plots are useful to compare alpha diversity between two or more samples which may have

unequal sequence depth. This plot uses alpha diversity value versus number of included

Page 19: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 19

Asli Yazağan ayazagan.com

sequences. To build rarefaction curves, each community is randomly subsampled without

replacement at different intervals, and the average number of OTUs at each interval is plotted

against the size of the subsample.

As parameter file, alpha diversity metric is listed in a text file. Observed_species,

shannon, chao1 metrics are commonly used alpha diversity metrics. Observed_species is the

number of OTUs identifier per sample. Shannon diversity is a measure of entropy and chao1 is a

measure which predicts OUT richness at high depth of sequencing. echo 'alpha_diversity:metrics

observed_species,shannon,chao1' > parameters.txt command creates a parameter.txt file.

After running the script on our data, a html page with rarefraction plots were created.

Figure 10. Rarefraction Plot for date_s

Page 20: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 20

Asli Yazağan ayazagan.com

Figure 11. Rarefraction plot for sample_type_s

Identifying Differentially Abundant OTUs

Question: Are there species significantly more abundant in one set of samples than in

another? Which microbes are significantly different between two sample groupings ? Do specific

groups of samples differ in their microbial composition?

Script: differential_abundance.py -i < biom file > -o <output.txt> -m <mapping file> -a

DESeq2_nbinom –c <mapping category> -x < subcategory 1> -y <subcategory 2> -d

Output: text file with a list of differentially observed OTUs and their statistics and a MA

plot.

OTU differential abundance testing is used to identify OTUs that differ between two

mapping file sample categories denoted by –x and –y in the script. Differentially abundant OTUs

identification method is denoted by –a. DESeq2_nbinom and metagenomeSeq_fitZIG are

differential abundance algorithm can be used in QIIME (Paulson, Stine, Bravo, & Pop, 2013).

-d option creates a MA plot. The MA plot allows to look at the relationship between

intensity and difference between two data stores. The x-axis represents the average quantitated

Page 21: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 21

Asli Yazağan ayazagan.com

value across the data stores, and the y axis shows the difference between them. It also creates a

Dispersion Estimate plot that visualize the fitted dispersion vs. mean relationship.

In order to see if there are any OTUs which are significantly more abundant in the

countertop environment samples than in the computer mouse environment samples, “countertop”

was passed as –y option and “computer mouse” was passed as –x option. Checking the output

text file, the members of Actinobacteria are significantly more abundant in the countertop

samples.

Figure 12. Diff_otus.txt for Computer Mouse and Countertop

Page 22: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 22

Asli Yazağan ayazagan.com

Figure 13. MA plot for differential abundance of

Computer Mouse and Countertop

Figure 14. Dispersion Estimate Plot for

differential abundance of Computer Mouse and

Countertop

Checking the microbial abundance of taxa of computer mouse samples taken in february and

april, it was seen visually different taxonomy fromthe pie charts. To do an experiment,

differential abundance script was run on those samples and Figure 15 and 16 shows the MA plot

and dispersion estimate plots.

Page 23: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 23

Asli Yazağan ayazagan.com

Figure 15. MA plot for Computer Mouse Samples.

Figure 16. Dispersion Estimate Plot for Computer

Mouse Samples

Normalizing OTU Table

When analyzing microbial data, uneven sequencing depth could lead biased results.

Having different number of sequences for each sample will cause inaccurate results in beta

diversity analyses.

Question: How to prevent bias as result of uneven sequencing depth?

Script: normalize_table.py -i <biom file> -a CSS -o <normalized biom file>

Output: Biom table with normalized counts. This table is used as input biom file for beta

diversity script.

-a option determines the normalization algorithm to apply to input bio table. Default algorithm is

CSS. CSS is stand for “cumulative sum scaling” normalization which is an adaptive extension of

the quantile normalization approach that is better suited for marker gene survey data whereby

Page 24: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 24

Asli Yazağan ayazagan.com

raw counts are divided by the cumulative sum of counts up to a percentile determined using a

data-driven approach (Paulson, J.N., Stine, O.C., Corrada Bravo, H., Pop, 2013). DESeq2 is

another normalization algorithm option. DESeq2 outputs negative values for lower abundant

OTUs as a result of its log transformation and throws away low depth samples (e.g. less that

1000 sequences/sample). This presents a problem when using Bray Curtis and Unifrac metrics

which are common metrics to calculate ecological distance. There is not a good solution yet, but

CSS is currently recommanded normalization algorithm.

Beta-diversity and PCoA

It is important to analyze how different every sample is from all of the rest in microbiome

research. On the other hand, another important information is whether any grouping of samples

are more similar in composition than the average. Beta diversity is a metric of diversity that

describes how different the species composition of different sample is.

Question: How much does diversity change between samples?

Script: beta_diversity.py, principal_coordinates.py, make_2d_plots.py

Output: Distance matrix and visualized Principle Coordinate plots

In order to measure the difference between two samples mathematical and phylogenetic metrics

can be used. Two commonly used metrics in microbiome studies are Bray_Curtis and

unweighted_unifrac.

>beta_diversity.py -i <normalized biom file> -m <distance metric> -o <beta_div_output_path>

-t <rep_set.tre>

Page 25: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 25

Asli Yazağan ayazagan.com

The output of the command is a distance matrix defines distance between every pair of samples.

I used Bray-Curtis metric to calculate distance. This matrix can be visualized in a Principle

Coordinate plot (PCoA).

principal_coordinates.py -i <beta_div_output_path>/<metric_normalized_otu_table.txt > -o

<beta_div_coords.txt>

make_2d_plots.py -i <beta_div_coords.txt> -m <mapping file>

The resulting PCoA plot is shown in the following charts. Figure 15 shows microbial community

similarity change between two sample collection dates and it looks like overall community

mostly changed in two timepoint. Figure 16 shows the microbial community similarity among

sample types. It looks like computer mouse, countertop, stationary phone, armchair rest

visualized together meaning that they have similar microbial community. Computer mouse -

countertop samples collected in february but stationary phone - armchair rest samples collected

in april. It can also be visually displayed in the pie charts that these samples have very similar

charts. Pie charts shows very different composition for computer mouse and countertop samples

in two different time point. It can also be viewed from the PcoA plots. For example, two purple

circle stay far away between each other on the PC1-PC2 and PC1-PC3 plots in Figure 18.

Figure 17. PCoA plot for the bacterial community collected in the Hospital Room. Community were

characterized by samples collected in February and April. Bray-Curtis is used as distance metric.

April

February

Page 26: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 26

Asli Yazağan ayazagan.com

Figure 18.PCoA plot for the bacterial community collected in the Hospital Room. Community were

characterized by type of samples collected. Bray-Curtis is used as distance metric.

Jackknifed Beta Diversity Analysis

Question: How to compare samples to each other ?

Script: jackknifed_beta_diversity.py -i < biom file > -t <rep_set.tre> -m <mapping file >

-o <Jackknife_Output folder> -e <rarefaction_depth>;

Output: 3D PcoA plots with Emperor

This script does the following steps:

i. Compute a beta diversity distance matrix from the full data set

ii. Perform multiple rarefactions at a single depth (-e option is to change the

rarefaction depth)

iii. Compute distance matrices for all the rarefied OTU tables

iv. Build UPGMA trees for all the rarefactions

v. Compare all the trees to get consensus and support values for branching

vi. Perform principal coordinates analysis on all the rarefied distance matrices

vii. Generate plots of the principal coordinates

Cold T.W.F.H

Hot T.W.F.H

Comp. Mouse

Countertop

Station Phone

Armchair Rest

Corridor Floor

Page 27: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 27

Asli Yazağan ayazagan.com

Emperor is an interactive next generation tool for analysis, visualization and

interpretation of high throughput microbial ecology datasets (Vázquez-Baeza, Pirrung, Gonzalez,

& Knight, 2013). After running script, three sub-folder for each distance metric and 3D PCoA

plots are created. Unweighted_uniFrac /emperor_pcoa_plot folder has a html file has visualized

3D PCoA Plots as in Figure 12. Each point represents one of the samples and distances between

samples were calculated using unweighted UniFrac. Samples stay close to each other means that

those samples have communities with very similar overall phylogenetic trees.

Figure 19. 3D PCoA Plots for HMP samples

Jackknife analysis created a large collection of distance matrices to do statistics on.

Question: How to analyze distance matrices?

Script: dissimilarity_mtx_stats.py –i < Jackknife_Output folder/unweighted_unifrac/rare_dm> -

o <stat_output_folder>

Output: Three files; means.txt, medians.txt, and stdevs.txt files for the mean, standard deviation

and means of the distance between two samples are created.

Page 28: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 28

Asli Yazağan ayazagan.com

Question: Are the samples in an individual category closer to each other than they are to samples

outside the category?

Script: make_distance_boxplots.py –m <mapping file> -o <BoxPlot_Outout_Folder> -d

stat_output_folder/means.txt –f <category> --save_raw_data

Output: Boxplot Plot as a pdf file

The first and second boxplots represent all within distances and all between distances,

respectively in Figure 14.

Figure 20. Distance Boxplot for Surface type

Question: How to compare between samples grouped at different field states of a

mapping file field?

Script: make_distance_comparison_plots.py -m <mapping file> -d

<unweighted_unifrac_otu_table.txt> -f <category from mapping file> -c <comparison_groups>

-o <output_folder> -a <label_type> -t <plot_type>

Page 29: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 29

Asli Yazağan ayazagan.com

Output: Distance Comparison Plot

Figure 14 shows the boxplots that allow for the comparison among surface types. Countertop,

Corridor Floor and Station Phone were taken as comparison groups and those were compared

with other surface types.

Figure 21. Distance Comparison among surface types

Make Bootstrapped Tree

Question: How to make a bootstrapped tree?

Script: make_bootstrapped_tree.py

-m <Jackknife_Output folder/unweighted_unifrac/upgma_cmp/master_tree.tre>

-s <Jackknife_Output folder /unweighted_unifrac/upgma_cmp/jackknife_support.txt>

-o <Jackknife_Output folder /unweighted_unifrac/upgma_cmp/Tree.pdf>

Page 30: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 30

Asli Yazağan ayazagan.com

Figure 22. Jackknifed UPGMA clustering (using the weighted UniFrac metric) showing the similarity of

bacterial communities based on 16S rRNA genes.

Comparing Categories

In HMP data, seven different points in a room were sampled: countertop, computer

mouse, station phone, chair armrest, corridor floor, hot tap water faucet and cold tap water

faucet. Visual graphs reveal how different a microbial composition of sample from other

samples, but a statistical support is needed.

To generate statistical support for hypotheses, adonis and anosim (analysis of similarity)

statistical tests can be used. Adonis is a nonparametric statistical method that takes beta diversity

distance matrices, a mapping file and a category in the mapping file to determine sample

grouping from. It computes an R2 value (effect size) which shows the percentage of variation

explained by the supplied mapping file category, as well as a p-value to determine the statistical

significance. Anosim (Permanova) is a method that tests whether two or more groups of samples

Cold. T. W. F. H. April

St. Phone February

Cold T. W. F. H. February

Corr. Floor February

Countertop February

Ch. Armrest February

Hot T.W. H. February

Comp. Mouse February

Countertop April

Comp. Mouse April

St. Phone April

Ch. Armrest April

Corr. Floor April

Hot. T. W. F. H. April

Page 31: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 31

Asli Yazağan ayazagan.com

are significantly different. Anosim only work with categorical variable that is used to do the

grouping.

Question: Do the samples grouped by a parameter in the mapping file (i.e. sample type)

are statistically significant?

Script 1: compare_categories.py --method adonis -i <metric_normalized_otu_table.txt >

-m <mapping file> -c <comparingCategory> <adonis_out_folder>

Script 2: compare_categories.py --method anosim -i <metric_normalized_otu_table.txt >

-m <mapping file> -c <comparingCategory> -o <anosim_out_folder>

Output: p-value and R2 value. p-value indicates the statistically significance of grouping

of samples by the parameter. R2 value indicates the percentage of variation in distances is

explained by the grouping.

Adonis and anosim statistical tests were applied for “sample_type_s” and “date_s”

categories in HMP data. date_s and sample_type_s do not differ significantly from each other in

terms of microbial composition (p = 0.2, p = 0.58).

Conclusion

As a preliminary exploration, a small data set from HMP was analyzed. Data collected

from seven different point (countertop, computer mouse, station phone, chair armrest, corridor

floor, hot tap water faucet and cold tap water faucet) in the same room (S10) at two different

time point (27/02/2013 and 17/04/2014) was used. For each sample, how many and what kind of

microbes are found, diversity change between samples and microbial composition comparison

among sample groupings were investigated using QIIME pipeline. Moreover, significant

Page 32: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 32

Asli Yazağan ayazagan.com

abundance change among samples was investigated. Visualization and statistical tools were used

to draw conclusions.

REFERENCES

Caporaso, J. G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F. D., Costello, E. K., …

Knight, R. (2010). QIIME allows analysis of high-throughput community sequencing data.

Nature Methods, 7(5), 335–6. https://doi.org/10.1038/nmeth.f.303

Fierer, N., Breitbart, M., Nulton, J., Salamon, P., Lozupone, C., Jones, R., … Jackson, R. B.

(2007). Metagenomic and small-subunit rRNA analyses reveal the genetic diversity of

bacteria, archaea, fungi, and viruses in soil. Applied and Environmental Microbiology,

73(21), 7059–7066. https://doi.org/10.1128/AEM.00358-07

Paulson, J.N., Stine, O.C., Corrada Bravo, H., Pop, M. (2013). Robust methods for differential

abundance analysis in marker gene surveys. Nature Methods, 10(12), 1200–1202.

https://doi.org/10.1016/j.biotechadv.2011.08.021.Secreted

Paulson, J. N., Stine, O. C., Bravo, H. C., & Pop, M. (2013). Differential abundance analysis for

microbial marker-gene surveys. Nature Methods, 10(12), 1200–2.

https://doi.org/10.1038/nmeth.2658

Shogan, B. D., Smith, D. P., Packman, A. I., Kelley, S. T., Landon, E. M., Bhangar, S., …

Gilbert, J. (2013). The Hospital Microbiome Project: Meeting report for the 2nd Hospital

Microbiome Project, Chicago, USA, January 15(th), 2013. Standards in Genomic Sciences,

8(3), 571–9. https://doi.org/10.4056/sigs.4187859

Page 33: 16S rRNA SEQUENCING DATA ANALYSIS …ayazagan.com/dataask/wp-content/uploads/2017/03/Report1...16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME ... Diff_otus.txt for Computer

Hospital Microbiome Project

QIIME Analysis 33

Asli Yazağan ayazagan.com

Vázquez-Baeza, Y., Pirrung, M., Gonzalez, A., & Knight, R. (2013). EMPeror: a tool for

visualizing high-throughput microbial community data. GigaScience, 2(1), 16.

https://doi.org/10.1186/2047-217X-2-16