UNIVERSIDADE DE LISBOA
FACULDADE DE CIÊNCIAS
DEPARTAMENTO DE BIOLOGIA ANIMAL
METAGENOMIC ANALYSIS OF
MARIANA TRENCH SEDIMENT
SAMPLES
Vera Maria Leal Carvalho
Dissertação de
MESTRADO EM BIOINFORMÁTICA E BIOLOGIA
COMPUTACIONAL
ESPECIALIZAÇÃO EM BIOINFORMÁTICA
2013
UNIVERSIDADE DE LISBOA
FACULDADE DE CIÊNCIAS
DEPARTAMENTO DE BIOLOGIA ANIMAL
METAGENOMIC ANALYSIS OF
MARIANA TRENCH SEDIMENT
SAMPLES
Vera Maria Leal Carvalho
Dissertação de
MESTRADO EM BIOINFORMÁTICA E BIOLOGIA
COMPUTACIONAL
ESPECIALIZAÇÃO EM BIOINFORMÁTICA
Dissertação orientada pelo Professor Doutor Francisco Couto (DI-FCUL) e
pelo Post-doctoral fellow Martin Asser Hansen (MME-KU)
2013
METAGENOMIC ANALYSIS OF
MARIANA TRENCH SEDIMENT
SAMPLES
Vera Maria Leal Carvalho
Thesis conducted at the Molecular Microbial Ecology Group of the
Department of Biology of the Faculty of Science of the University of
Copenhagen
MSc IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY
2013
i
Acknowledgements
First of all, I would like to thank my external supervisor at the Molecular Microbial
Ecology Group of the University of Copenhagen, Doctor Martin Asser Hansen, who,
undoubtedly, was the person that contributed the most for the success of this project.
He taught me some things but, most importantly, he pushed me to learn by myself
everything else. He was a source of inspiration and support, and the hours of patient
discussion were invaluable. Moreover, he introduced me to the group, and made me
feel welcome since day one. I feel extremely happy that I got the opportunity to have
him as a supervisor.
Secondly, I would like to thank my supervisor at the Faculty of Science of the University
of Lisbon, Professor Francisco Couto, for accepting to supervise me and checking
regularly if my work was progressing. I’d also like to thank him for allowing me to do a
short collaboration within the project EPIWORK, which set the ground for my research
career.
I should thank Professor Søren Sørensen for having me in his group, and Assistant
Professor Waleed Abu Al-Soud for integrating me in this project. In addition, I want to
thank Associate Professor Lars H. Hansen for the times when I needed some direction,
and he immediately engaged in a session of brainstorming, as well as Lea Skov
Hansen who never failed to help me when I asked. Also, I’d like to state my sincere
gratitude to the Associate Professor Emeritus Annelise Kjøller for reviewing the
manuscript.
But since work is not just about working, I want to thank the people at MME who made
my working hours and coffee-breaks, enjoyable ones. Working in such a great
atmosphere was a wonderful experience! I want to thank specially Tue, Lea, Analia,
Stefan, Michael, Peter, Lars B. and Witold.
I also want to thank my roommates Chris and Emil for all the fun and hyggeligt times.
I shall also thank all of my friends and family in Portugal who have always been
present, supporting me, giving me love and care, and reminding me every day that
distance meant simply a plane ride…nothing else.
However, there were some people who had a direct effect on the turnover of this
thesis, and therefore I feel their names should be clearly stated. First of all, Gil for
keeping me company every day; for sharing the tiny details of our days; for discussing
ii
silly and more serious thoughts. Ricardo Purificação for bugging me the whole year to
write the Introduction. Carolina, Mogli, Anas and João for maintaining my insanity, with
our group conversation that is about…(What is it about?) And finally, Natalia Cięciwa
and Pascaline Serra for keeping track of me all these months...for the love they shared
with me, for their friendship, for not giving up.
Most of all, I want to thank Inês and Luísa. For they made me feel home in the cold
kingdoms of the north: they took care of me, they gave me colo, they told me I would
be alright. They became those people that I want to be around when I’m sad; or when
I’m happy! They became special to me…. Somehow, these two girls overcame the
barrier that I had in me when I arrived in Copenhagen, and found their way to my heart.
So…if to write a thesis you need a certain state of mind, a certain focus, a certain
equilibrium, they were certainly the people that made sure during this year that I had
them. And as Luísa said in the airport, when we were running for the security, there are
certain people in life that you don’t need to know for too long, to know that they are
going to stay in your life for a long time.
Finally, I would like to thank the NSF, the Moore Foundation and the European
BIOTRIANGLE project for supporting the Course on Marine Bioinformatics "Marine
Omics” at the University of Delaware, to which I was fortunately selected to go, and
that was definitely a turning point for the writing of this dissertation.
This work is dedicated to my brother Filipe, to my Avó Fernanda and to my parents.
But mainly to my parents, since without them, I wouldn’t have had the opportunity to go
to Denmark in first place. And they keep arguing that they are doing what they are
supposed to, but they are not – they are doing more, and they should know that I am
extremely grateful, that I am not forgetting what they are giving me, and that I love
them.
iii
So long, and thanks for all the fish!
iv
v
Resumo
O “Challenger Deep” na Fossa das Marianas é um dos ambientes mais extremos à
face da Terra. A combinação da baixa temperatura de 2.5ºC, e a pressão de quase
112MPa, devido à coluna de água de 11km, tornam-no único, e sujeito à curiosidade
humana por esse facto. Contudo, os métodos tradicionais de cultura microbiana em
laboratório tornavam muito difícil obter uma visão completa da comunidade que habita
esse ambiente, uma vez que nem todos os microorganismos são passíveis de ser
cultivados, nem as condições extremas simples de ser recriadas.
Em 1998, Jo Handeslman cunhou o termo “Metagenómica”, ao tentar estudar a
microflora como um todo, em vez de estudar organismos individuais, denominando-o
assim de metagenoma. Desde aí que a Metagenómica evoluiu, passando a englobar a
identificação de sequências genómicas duma comunidade, bem como a sua análise
funcional e evolutiva.
A análise metagenómica tipicamente inclui vários passos que começam na
amostragem, seguindo-se a filtração (embora esta seja facultativa, dependendo do
objectivo do estudo), a sequenciação, até à análise das sequências e publicação dos
dados gerados. Este trabalho lidou exclusivamente com os dois últimos passos.
O objectivo do trabalho foi, não só gerar mais questões, como é habitual em análises
metagenómicas, mas também investigar que comunidade habita este ambiente, e
explorar um pouco o seu potencial metabólico. Contudo, com a publicação de um
estudo que descreve as mesmas amostras, surgiu o objectivo de explorar os
resultados no sentido de corroborar a descoberta de que há consumo de oxigénio ao
longo do sedimento.
A análise seguiu os moldes normalmente usados em trabalhos semelhantes. De 8
amostras iniciais, correspondendo a intervalos de 5cm da superfície até 40cm de
profundidade do sedimento, 7 foram sequenciadas. Inicialmente, as sequências foram
automaticamente pré-processadas de forma a que apenas a informação relevante e
fidedigna passasse à fase seguinte. Nomeadamente, removeram-se os adaptadores
utilizados na sequenciação, bem como sequências demasiado curtas, e bases de má
qualidade. Para isto, foi utilizada a colecção de ferramentas “Biopieces”, que permite
organizar os comandos numa pipeline de uma forma simples e intuitiva.
Seguidamente, as sequências foram sujeitas a assemblagem, utilizando o programa
IDBA-UD, de forma a gerar sequências mais longas, para serem anotadas com maior
vi
percentagem de identidade, e consequentemente, com maior confiança. Mais uma
vez, este passo é facultativo, uma vez que ao assemblar sequências perde-se
informação relativamente à abundância. Antes da anotação, as sequências foram
classificadas de codificantes ou não codificantes, e as primeiras foram então
mapeadas contra sequências conhecidas em bases de dados. A anotação foi feita em
termos taxonómicos e funcionais. Todos os passos que se seguiram à assemblagem
foram realizados pelo servidor MG-RAST, no entanto, tanto sequências assembladas
(“contigs”) como não-assembladas (“reads”) foram submetidas, para haver informação
relativamente à abundância, mas também informação sólida relativamente a
determinadas características de interesse.
Os resultados gerados pelo MG-RAST mostram claramente que Betaproteobacteria
domina a amostra de superfície (0-5cm), enquanto que nas restantes amostras é a
classe Gammaproteobacteria a mais abundante. É interessante observar que
enquanto que Gammaproteobacteria nas amostras 1 (0-5cm) e 2 (5-10cm) é dominada
por um género, da amostra 3 (10-15cm) à 8 (35-40cm) o número de géneros
abundantes aumenta. Em termos de diversidade-alfa, a amostra 1 apresenta o valor
mais elevado (430.83 espécies), em comparação com as outras que variam entre
184.10 e 252.14 espécies. A diversidade-beta foi calculada entre todas as amostras,
usando o pacote “vegan” da linguagem de programação estatística R. Especulou-se
que poderia haver uma correlação entre esta e a profundidade, mas tal não se
verificou. Para averiguar se haveria alguma relação entre a profundidade do sedimento
e a composição da comunidade, utilizou-se a análise de componentes principais
(PcoA). Estes resultados não permitiram a confirmação da hipótese, no entanto, ao
comparar com amostras de outros projectos bastante diferentes, as amostras da
Fossa das Marianas agruparam-se de modo consistente, mostrando que a
composição da comunidade é característica deste ambiente. Além disto, gerou-se uma
curva de rarefacção, que é utilizada para verificar se o esforço de sequenciação foi
suficiente para representar a comunidade por inteiro, e dado que as curvas das 7
amostras estão a chegar perto da assímptota, pôde-se concluir que os resultados são
razoáveis.
Em termos funcionais, a análise focou-se no metabolismo energético. A maior parte
das sequências “reads” deste metabolismo mapeavam para fosforilação oxidativa, que
é o último passo da respiração aeróbia. Analisando as sequências “contigs” que
mapeavam para o mesmo, verificou-se que existia mais de 91% de identidade contra
sequências na base de dados escolhida, o que indica que os resultados são credíveis.
vii
O metabolismo do metano e do azoto foram também investigados e, apesar de menos
abundantes, algumas enzimas envolvidas na metanogénese e no ciclo do azoto foram
identificadas nas sequências “contigs”.
Finalmente, gerou-se um mapa geral com todas as enzimas identificadas nas
amostras, utilizando o programa iPath que se baseia nos mapas metabólicos KEGG. É
de notar, todavia, que este mapeamento pode ser erróneo, como se constatou quando
se observou que a fotossíntese estava indicada como presente, o que é altamente
improvável a 11km de profundidade. Quando se investigou porquê, descobriu-se que
era devido a uma ATPase que está presente tanto na fotossíntese como na
fosforilação oxidativa.
Os resultados gerados, permitem concluir que efectivamente o consumo de oxigénio,
medido no estudo efectuado por colaboradores, se deve a metabolismo aeróbio
mesmo nas camadas de sedimento mais profundas. Esse estudo também previu que
os processos de mineralização acentuados neste ambiente fossem mediados pela
comunidade microbiana, o que se coaduna com a presença de enzimas envolvidas no
ciclo do azoto. A dominância de Gammaproteobacteria é partilhada por sedimentos no
Oceano Pacífico a 4000m de profundidade, bem como sedimentos no Oceano Pacífico
Ártico, que se encontra igualmente a baixas temperaturas. Curiosamente, a microflora
de fontes hidrotermais em profundidade, a mais de 310ºC também são dominadas por
Gammaproteobacteria.
Este estudo mostrou que é possível investigar em detalhe a composição da
comunidade bacteriana de ambientes extremos. Contudo, este trabalho poderia ter
sido mais robusto se houvesse réplicas das unidades amostrais, e mais dados
contextuais que permitissem fazer comparações com outros estudos. No futuro, seria
também interessante tirar amostras a diversas profundidades do “Challenger Deep” de
forma a estudar a variação da composição da comunidade com a profundidade.
Uma vez que esta área é ainda bastante jovem, a colecção de ferramentas disponíveis
apesar de vasta, ainda está sujeita a melhoramentos. Desta forma, os resultados aqui
apresentados poder-se-ão revelar imprecisos daqui a 10 anos. Também é provável
que uma escolha alternativa não produzisse exactamente os mesmos resultados.
Assim, o produto deste trabalho é fruto da escolha das ferramentas e dos seus
parâmetros, com todas as vantagens e incovenientes que lhes são inerentes.
viii
ix
Abstract
The emergence of Metagenomics allowed the study of the microbial community in the
deepest point on Earth: the Challenger Deep on the Mariana Trench. Its extreme
conditions, a water depth of almost 11km, a temperature of 2.5 degrees Celsius and a
pressure around 112 MPa, made it very difficult to perform a comprehensive study of
its microecology, given the previous dependency on culturing methods. This
metagenomic analysis included taxonomic identification and exploration of some
functional potential of the genomic sequences of the community, generated by Illumina
Next-Generation Sequencing technique, therefore bypassing the need for cloning. Here
we show that Proteobacteria clearly dominate this environment but that there is no
obvious correlation between the sediment depth and the community composition.
Moreover, the abundance of enzymes involved in oxidative phosphorylation in all
samples, suggests aerobic activity within the sediment. This supports the finding that
there is oxygen consumption along the depth of the sediment. An extensive description
of all the data generated was prohibitive; however as soon as the data becomes
available, it will be accessible to the public to search for their features of interest.
Keywords: metagenomics, Mariana Trench, Challenger Deep, extreme environments,
Illumina, community structure, energy metabolism
O aparecimento da Metagenómica permitiu o estudo da comunidade microbiana no
ponto mais profundo na Terra: o “Challenger Deep” na Fossa das Marianas. As
condições extremas aí presentes - a coluna de água de quase 11km, 2.5ºC de
temperatura e a pressão à volta de 112MPa - tornaram um estudo aprofundado da sua
microecologia muito difícil de executar, dada a prévia dependência em métodos que
envolviam culturas em laboratório. Esta análise metagenómica incluiu identificação
taxonómica e a pesquisa do potencial funcional das sequências genómicas da
comunidade, geradas utilizando a tecnologia de nova geração de sequenciação da
Illumina, ultrapassando assim a necessidade de clonagem. Neste trabalho demonstra-
se que Proteobacteria domina claramente este habitat, mas que não há uma
correlação inequívoca entre a profundidade do sedimento e a composição da
x
comunidade. Além disso, a abundância de enzimas envolvidas na oxidação
fosforilativa em todas as amostras, sugere actividade aeróbia no sedimento. Isto
sustenta a descoberta de que há consumo de oxigénio ao longo da profundidade do
sedimento. Uma descrição extensa de todos os dados que foram gerados era
proibitivo, no entanto, assim que os dados se tornarem públicos, serão acessíveis a
todos os que os queiram investigar consoante os seus interesses.
Palavras-chave: metagenómica, Fossa das Marianas, “Challenger Deep”, ambientes
extremos, Illumina, estrutura da comunidade, metabolismo energético
xi
Contents
Acknowledgements ........................................................................................................ i
Resumo ........................................................................................................................ v
Abstract ....................................................................................................................... ix
List of Figures ............................................................................................................. xiii
List of Tables .............................................................................................................. xv
1. Introduction ............................................................................................................ 1
1.1. Background .................................................................................................... 1
1.2. Metagenomic Analysis .................................................................................... 2
1.3. Objective ........................................................................................................ 4
1.4. Structure of the thesis ..................................................................................... 5
2. Methods ................................................................................................................. 7
2.1. Sample collection, preparation and sequencing .............................................. 7
2.2. Preliminary Analysis ....................................................................................... 7
2.3. Biopieces ........................................................................................................ 8
2.4. Assembly ...................................................................................................... 10
2.5. MG-RAST ..................................................................................................... 12
3. Results ................................................................................................................ 17
3.1. Taxonomic Hits Distribution .......................................................................... 17
3.2. Functional Category Hits Distribution ............................................................ 23
4. Discussion ........................................................................................................... 29
5. Conclusion ........................................................................................................... 33
References ................................................................................................................. 35
Appendix ..................................................................................................................... 41
xii
xiii
List of Figures
Figure 1 - Challenger Deep location (11º 22.1'N 142º 25.8' E) ...................................... 1
Figure 2 - Total number of metagenomics articles published since 1998 ...................... 4
Figure 3 – Cleaning script ............................................................................................. 8
Figure 4 – order_pairs script ......................................................................................... 9
Figure 5 – Genome assembly strategies: Hamiltonian and Eulerian cycles[31]. .......... 11
Figure 6 – Taxonomic distribution of the reads at the domain level ............................. 18
Figure 7 - Taxonomic distribution of the reads at the class level (Proteobacteria) ....... 18
Figure 8 - β-diversity barchart ..................................................................................... 19
Figure 9 – Rarefaction curve of annotated species richness ....................................... 20
Figure 10 - PCoA using the M5RNA database (left) and the M5NR database (right) .. 21
Figure 11 - PCoA of the reads against the M5NR database. Red - Mariana Trench;
Blue - Activated Sludge; Green – Gut Microbiota ................................................. 22
Figure 12 - Number of features in the reads of sample 7 annotated by the different
databases ............................................................................................................ 23
Figure 13 – Number of features in the contigs of sample 7 annotated by the different
databases ............................................................................................................ 23
Figure 14 - Oxidative Phosphorylation, pathway ko00190. .......................................... 24
Figure 15 – Photosynthesis, pathway ko00195. .......................................................... 25
Figure 16 - Methane metabolism, pathway ko00680. In red the enzymes found in the
samples. .............................................................................................................. 26
Figure 17 - Nitrogen metabolism, pathway ko00910. In red the enzymes found in
samples 2, 5 and 8. ............................................................................................. 27
Figure 18 - Metabolic map of the seven samples ........................................................ 28
Figure 19 – Oxygen micro-profiles at 6,018 m water depth (a); and at Challenger Deep
(b) [1]. .................................................................................................................. 30
Figure 20 - Taxonomic distribution of the reads from the seven samples at the phylum
level ..................................................................................................................... 41
Figure 21 - Krona graph of the distribution of the reads of Gammaproteobacteria in
sample 1 .............................................................................................................. 42
Figure 22 - Krona graph of the distribution of the reads of Gammaproteobacteria in
sample 2 .............................................................................................................. 42
Figure 23 - Krona graph of the distribution of the reads of Gammaproteobacteria in
sample 3 .............................................................................................................. 43
xiv
Figure 24 - Krona graph of the distribution of the reads of Gammaproteobacteria in
sample 5 .............................................................................................................. 43
Figure 25 - Krona graph of the distribution of the reads of Gammaproteobacteria in
sample 6 .............................................................................................................. 44
Figure 26 - Krona graph of the distribution of the reads of Gammaproteobacteria in
sample 7 .............................................................................................................. 44
Figure 27 - Krona graph of the distribution of the reads of Gammaproteobacteria in
sample 8 .............................................................................................................. 45
Figure 28 - Beta diversity related to spacial distance .................................................. 45
Figure 29 - Heatmap of the reads agains the M5RNA database at 97% identity ......... 46
Figure 30 - Heatmap of the reads agains the M5NR database at 90% identity ........... 47
xv
List of Tables
Table 1 - Percentage of reads removed with the cleaning ............................................. 7
Table 2 – Analysis of the assembly with minimum contig size 200 bp. ........................ 12
Table 3 – Analysis of the assembly with minimum contig size 500 bp. ........................ 12
Table 4 - ID's of the contigs submitted to MG-RAST ................................................... 17
Table 5 - α-diversity .................................................................................................... 19
Table 6 – Pairwise β-diversity ..................................................................................... 19
xvi
Introduction
1
1. Introduction
1.1. Background
The Challenger Deep on the Mariana Trench is one of the most extreme environments
on Earth, with a depth of almost 11km, a temperature of 2.5 degrees Celsius[1] and a
pressure around 111.79 MPa – calculated assuming the mean density of sea water
1036 kg/m3[2] and the gravity to be 9,81 m/s2[3]. It is located roughly at 11ºN 22.1’N
142º 25.8’ E [1](Figure 1).
Figure 1 - Challenger Deep location (11º 22.1'N 142º 25.8' E)
It has been subject to human curiosity for many years[4], however so far, there wasn't a
detailed study of its microecology. With the emergence of Metagenomics, it was now
finally possible to unravel which organisms live in the deepest point on Earth, and what
are they doing.
It was in 1998 that the term "Metagenomics" was first used by Jo Handeslman [5] in an
effort to study the microflora as a unit, the metagenome, instead of addressing each
type of organism individually.
Previously, it was thought that it was necessary to study the morphology, physiology
and pathogenic characters in order to classify a microorganism[6], but since Woese in
Introduction
2
1977 pioneered the use of 16S sequences for classification[7], sequence comparison
has been widely used and accepted as valid to do so.
With the development of the sequencing technology, one can now take a sample
directly from the environment, extract its DNA, sequence it, and infer the microbial
composition of the sample, therefore overcoming the bottleneck of growing pure
cultures in the laboratory. This method enables the discovery of new forms of life that
are not cultivable, and to assess the genetic richness and diversity, as well as the
metabolic potential, of a community of organisms as a whole[8].
Metagenomic analysis can accordingly be defined as “the identification, and functional
and evolutionary analysis of the genomic sequences of a community of organisms”.[9]
Moreover, the paradigm that most of the microbial world was known changed, to the
acknowledgement that there is still a lot to know and to explore[10]. Discovering new
forms of life in extreme environments can provide insights into a variety of topics, like
the biogeochemical activities that occur in the ocean[11], and the impact that human
activity may have on them[12] .
1.2. Metagenomic Analysis
To analyse a metagenome, several steps are typically involved, from the experimental
design to sharing the data[13]. Firstly, one has to obtain the samples. Ideally, true
replicates should be taken as well. Afterwards, one may filter the samples, to target a
(more-or-less) specific group of organisms [14].
The following step is sequencing. There are several technologies to sequence DNA,
each with its own advantages and weaknesses. The Mariana Trench sediment
samples were sequenced using Illumina’s paired-end assay. Its advantage is that it is
cheap and generates a large number of reads per run, however they are very short (50
– 250 bp), which can pose a problem for assembly and comparison since it becomes
more difficult to assign a read unequivocally to a template[15].
Illumina’s technology consists in attaching random DNA fragments to a surface, amplify
them to form clusters of the same sequence, and then use them as templates for
repeated cycles of polymerase-directed single base extension. This is guaranteed by
using 3′-modified nucleotides, labeled with a removable fluorophore. After determining
the identity of the nucleotide incorporated by laser-induced excitation of the
fluorophores, these as well as the side arm (that prevents the incorporation of more
than one nucleotide per cycle) are removed. The images of the fluorescent signal are
Introduction
3
used to determine the sequence (each nucleotide is attached to a fluorophore of a
different colour), and its quality, defined as the likelihood of each call being correct[16].
The paired-end option means that a fragment is sequenced in both directions (5’ → 3’,
and 3’ → 5’), therefore being helpful for the assembly[17].
Assembly is the next step in the Metagenomic analysis pipeline, although it is
sometimes skipped. Its usefulness is debatable[18], given that the accuracy of the
assemblers is difficult to assess, since there is currently no microbial community with
known reference sequences to compare to[13].
The main problem with assembly is that it distorts abundance information, since
abundant fragments will be considered as belonging to the most abundant species,
when in reality they may be present in rare species[18]. Moreover, some fragments
may be incorrectly discarded as mistakes or repeats, or joined up in the wrong places
or orientations[19]. Nonetheless, if these setbacks are taken into account when doing
the analysis, then assembly can be advantageous as it produces longer sequences
that are easily unambiguously annotated.
Gene prediction and annotation usually follow. The first classifies the sequences as
coding or non-coding, and the second tries to find homology between the coding
sequences and known sequences stored in databases. Once again, these methods
have their own flaws, mainly because they are based on models, hence failing to
predict exceptions that can occur in the biological world.
Typically, the final step is to share the sequence data on public databases together
with the metadata. Contextual data is necessary to compare with other datasets,
essentially making the sequences useful for the database and the scientific community.
By complying with standard languages for metadata, such as MIMS, the data becomes
more accessible, as complex searches will retrieve more information[20].
The whole set of drawbacks that are surrounding metagenomic analysis, are not at all
surprising, if one considers that it is still a very young field. A quick search on Web of
Knowledge[21] for the total number of articles featuring the term “metagenome” or
“metagenomics”, gives a very clear perception on how novel this field is, and how much
data has been produced (Figure 2).
With the popularity of the field expanding, a multitude of tools were developed making
the choice of which one to use, a not so trivial one. There is still no evident consensus
on which is the best tool for each step (not even for sequencing), so the errors in the
Introduction
4
data are most likely directly related to the flaws in each method, which means that a
different set of methods will yield a different set of errors.
Figure 2 - Total number of metagenomics articles published since 1998
Given this explosion of data, an obvious question is on its applicability. One example
would be bioremediation[22]. The process of biodegradation encompasses several
metabolic pathways that being considered in a community-basis, instead of an
individual-basis, lead to a global understanding of what is essential and what is
superfluous, easing the design of such a system. Moreover, the industry sector is
always in search of novel enzymes and processes[23].
Even so, metagenomics tends to be regarded as exploratory research, raising more
questions instead of addressing them. Accordingly, the aim of this project was not only
to answer some simple questions, but also to raise some more, and hopefully to
encourage further studies in this environment.
1.3. Objective
This project dealt solely with the analysis of the raw data output by sequencing. The
goal was to assess the taxonomical distribution of the community along the depth of
the sediment and to explore its metabolic potential, using the most adequate tools.
However, with the publication of the article [1], which included these sediment samples,
the focus turned to assess if the data generated by this analysis would corroborate the
published data, namely to confirm the O2 consumption throughout the sediment depth.
1 3 4 7 19 52 110 225 383 637 1,046 1,689
6,538
9,381
13,106 13,853
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
To
tal n
um
ber
of
art
icle
s
Year
Number of articles with the keyword metagenome* or metagenomic*
Introduction
5
1.4. Structure of the thesis
This report is organized in five chapters. Starting with the Introduction, some
background information is presented regarding both the site of the samples, as well as
the technology and pipeline typically employed in this kind of studies. The second
chapter has the methodology, with an explanation of each method and its output.
Chapter three includes the selected results, and in chapter four a critical discussion of
the previous is given. The last chapter has the conclusion with the final remarks and
some future directions to similar studies.
6
Methods
7
2. Methods
Seven out of eight sequenced samples from different depths were analysed. Each
sample corresponded to a gradient of 5cm, starting from 0-5cm (sample 1) to 35-40cm
(sample 8). The data was cleaned using the collection of tools Biopieces[24], the reads
assembled using IDBA-UD[25] [26], and both the generated contigs and the clean
reads were submitted to MG-RAST[27].
2.1. Sample collection, preparation and sequencing
The upstream methodology was done by collaborators and consisted on the following:
the DNA was extracted from 5g of sediment collected at different depths from the
Challenger Deep-Mariana Trench at 10,900m, using PowerMax soil DNA isolation kit
(MoBio Laboratories, CA USA). Eight DNA samples, each corresponding to a different
depth, were sent to BGI-Shenzhen (China), for library preparation and sequencing.
Since one of the samples (sample 4: 15-20 cm) did not contain enough DNA for library
preparation (as reported from the Sample Test Report of BGI), 14 fastq files were
received back, 2 for each of the seven samples – one with the forward and another
with the reverse reads.
2.2. Preliminary Analysis
The initial number of reads on each sample ranged from around 41 million to almost 84
million (Table 1).
Table 1 - Percentage of reads removed with the cleaning
Sample Number of raw reads (forward + reverse)
Number of clean reads
Percentage of Cleaning
1 58,814,066 43,569,096 25.921%
2 45,533,260 18,717,708 58.892%
3 47,163,190 34,419,612 27.020%
5 83,968,942 36,784,382 56.193%
6 61,751,498 43,891,904 28.922%
7 46,894,236 33,786,396 27.952%
8 41,030,848 28,508,242 30.520%
Methods
8
2.3. Biopieces
Sub-quality residues from the ends of the reads were removed, as well as the adaptors
used in the sequencing. The reads with a length inferior to 30 bp were also excluded, in
addition to reads with a local mean score under 15, to overcome errors propagated
from cycle to cycle[28]. The cleaning removed from 27% to almost 59% of the reads in
the samples (Table 1). The Biopieces script used is shown in Figure 3.
The tool trim_seq removes residues from the ends of sequences whose quality, in the
scores of the FASTQ file, does not match the minimum quality specified (in this case
25). The flag “-l” makes sure that residues are removed until a stretch of at least 3
residues with good quality is found, to avoid a premature termination due to a good
quality residue at the end. This step is necessary to overcome the effect of phasing and
pre-phasing. These are caused by incomplete removal of the 3' terminators and
fluorophores, sequences missing an incorporation cycle, or by the incorporation of
nucleotides without effective 3' terminators[28]. This means that each cycle’s signal is
affected by the signal of the previous and subsequent cycles, hindering the detection of
the right base.
read_fastq –i - |
trim_seq –m 25 –l 3 |
find_adaptor –l 6 –L 6 –f ACACGACGCTCTTCCGATCT –r AGATCGGAAGAGCACACGTC |
clip_adaptor |
merge_pair_seq |
grab –e ‘SEQ_LEN_LEFT >= 30’ |
grab –e ‘SEQ_LEN_RIGHT >= 30’ |
mean_scores –l |
grab –e ‘SCORES_MEAN_LOCAL >= 15’ |
split_pair_seq |
write_fastq –x
Figure 3 – Cleaning script
Find_adaptor searches the reads for the given adaptors (forward:
ACACGACGCTCTTCCGATCT and reverse: AGATCGGAAGAGCACACGTC), or
partial adaptors with at least 6 residues of length – flags “-l” for the forward and “-L” for
the reverse adaptor. By default, a percentage of the adaptor length is allowed for
mismatches, insertions, and deletions (10%, 5% and 5%, respectively).
Once the adaptors are found, clip_adaptor removes them, based on the keys output by
find_adaptor: ADAPTOR_POS_RIGHT, ADAPTOR_POS_LEFT, and ADAPTOR_LEN_LEFT.
Methods
9
The merge_pair_seq merges paired sequences, as long as they are interleaved.
Sequence names must be in either Illumina1.3/1.5 format trailing a “/1” or “/2” or
Illumina1.8 containing “1:” or “2:”. The sequence names should also match.
Grab is an improved version of Unix’s “grep”. It selects records that match a pattern, a
regular expression, or a numerical evaluation. In this case, we selected for reads with a
length superior to 30bp, by examining the keys SEQ_LEN_LEFT and
SEQ_LEN_RIGHT, output by merge_pair_seq.
Afterwards, mean_scores –l was used to calculate the local mean scores, which means
that instead of calculating the mean as the sum of all the scores over the length of the
string, it uses means from a sliding window, and returns the smallest value.
Finally, split_pair_seq was used to split the sequences merged with merge_pair_seq.
To speed up the process, this script was ran with GNU parallel[29] with the –L 8 option,
which takes two records at a time (each record has 4 lines), to circumvent breaking the
pairs. GNU Parallel allows Biopieces to be executed in parallel using multiple CPUs on
multiple cores and servers[24].
The merge_pair_seq and split_pair_seq tools were created within this project, to
overcome speed and memory problems originated by the use of order_pairs. The latter
interleaves the sequences, as long as their names are in Illumina 1.5 or 1.8 scheme,
and ads a key stating if the read is “paired” or “orphan”. This should be used after the
trimming and grabbing steps, and subsequently, only the paired reads should be
grabbed.
Example of a script using order_pairs (Figure 4):
read_fastq –i - |
trim_seq –m 25 –l 3 |
find_adaptor –l 6 –L 6 –f ACACGACGCTCTTCCGATCT –r AGATCGGAAGAGCACACGTC |
clip_adaptor |
grab –e ‘SEQ_LEN >= 30’ |
mean_scores –l |
grab –e ‘SCORES_MEAN_LOCAL >= 15’ |
order_pairs |
grab –p ‘pair’ –k ORDER |
write_fastq –x
Figure 4 – order_pairs script
Methods
10
2.4. Assembly
The decision to assemble smaller reads into larger contigs was made based on the
postulation that “The longer the sequence information, the better is the ability to obtain
accurate information.” The annotation procedure becomes easier since longer
sequences yield more information to compare with the databases, but it also applies for
classification of DNA fragments, as well as to rise the confidence in accuracy due to
the lower quality of single reads, by having multiple reads covering the same segment
of information, provided that the coverage is high enough[13]. The IDBA-UD algorithm
is based on de Bruijn graphs adapted for metagenomic sequencing technologies with
uneven sequencing depths[26].
De Bruijn graphs have every possible (k-1)-mer assigned to a node and it has a direct
edge to another one if there is some k-mer whose prefix is the former and whose suffix
is the latter. This means that all the edges in the graph represent all possible k-mers.
The idea is to find an Eulerian cycle[30] with the shortest superstring that contains each
k-mer exactly once (Figure 5).
By visiting each edge only once, the time to run the algorithm is roughly proportional to
the number of edges[31], unlike in a Hamiltonian cycle[32], where each node is visited
only once, making it an NP-complete problem[33] (meaning the time to solve it
increases quickly with the size of the input).
Applied to genome assembly, all the k-mers are the ones present in the reads
generated by sequencing[31], so ideally, the Eulerian cycle would generate the
genome. In practice this method cannot be applied directly, since there are some
assumptions that do not hold. Firstly, we cannot be sure that all the k-mers present in
the genome were generated; secondly, k-mers are not error-free; thirdly, each k-mer is
very likely to appear more than once in the genome; and lastly, we should not assume
that the genome is a single circular chromosome.
To deal with the first problem, instead of trying to assemble the reads, the algorithm
breaks them into smaller k-mers which are more likely to be representative of the whole
genome. To handle errors, the assembler chooses the path which is supported by
higher coverage. Regarding repeats, if a k-mer appears more than once in the
genome, it shall be represented by several edges connecting the same two nodes.
Finally, rather than searching for an Eulerian cycle, if the algorithm is modified to
search for an Eulerian path[34], then it is not required to end in the same node where it
began[31].
Methods
11
Figure 5 – Genome assembly strategies: Hamiltonian and Eulerian cycles[31].
The main problem with metagenomic data is that species with different abundances will
be represented by reads with uneven depth, and this cannot be disregarded as, e.g.,
an amplification bias. IDBA_UD solves this problem by adopting variable thresholds on
the multiplicity of the k-mers, making them dependent on the sequencing depth of the
neighboring contigs. The idea is that contigs with much lower sequencing depths that
their neighbors are more likely erroneous[26]. Moreover, IDBA_UD uses paired-end
information, namely the distance between the pairs, to solve issues such as missing k-
mers and repeats.
The assembler IDBA_UD was firstly used with the default minimum contig size setting
(200 bp), which yielded a N50 from 3545 to 9240. N50 is the length of the smallest
contig that contains the fewest largest contigs whose combined length represents no
less than 50% of the assembly. It is one of the common assembly statistics[35].
Therefore, then a higher minimum contig size of 500 bp was chosen, which improved
the N50 values, so these contigs were uploaded to the server MG-RAST[27]. The
complete analysis of both assemblies (using the Biopiece analyze_assembly) is shown
on Table 2 and Table 3, including N50, contig length (maximum, minimum, mean and
total) and the number of contigs.
Methods
12
Table 2 – Analysis of the assembly with minimum contig size 200 bp.
200 bp
Sample 1 2 3 5 6 7 8
N50 3545 4705 4397 6430 8726 7136 9240
Leng
th
Max 614,662 215,848 466,951 305,081 551,041 305,025 452,197
Min 200 200 200 200 200 200 200
Mean 1439 1956 1651 2206 2068 1822 2234
Total 106,683,337 42,414,452 79,943,730 68,959,229 61,784,508 70,790,963 64,146,341
Number contigs 74,124 21,681 48,418 31,250 29,868 38,839 28,704
Table 3 – Analysis of the assembly with minimum contig size 500 bp.
500 bp
Sample 1 2 4 5 6 7 8
N50 14,106 6,199 14,122 8,662 17,856 16,340 16,698
Leng
th
Max 614,662 215,848 548,284 305,081 551,037 337,423 551,034
Min 504 502 503 501 505 518 503
Mean 3,261 3,180 3,384 3,883 4,277 4,235 4,492
Total 76,906,252 36,624,750 60,016,272 59,604,646 51,333,136 55,864,260 54,104,132
Number contigs 23,581 11,514 17,732 15,349 12,000 13,190 12,044
2.5. MG-RAST
MG-RAST[27] uses several bioinformatics tools in its pipeline. Firstly, it filters
sequences based on length, number of ambiguous bases and quality values. All the
contigs from all the 7 samples uploaded, passed this preprocessing stage.
Then, “technical replicates”, identified as sequences with identical first 50 base-pairs,
are removed in a step called Dereplication. Between 0,7% (surface sample) and 2,3%
(sample 7) of the contigs were removed in this step, but no reads were removed. This
can be explained by the use of the same reads for different contigs.
After that, FragGeneScan[36] is used to predict coding regions. This tool is an ab-initio
gene calling algorithm that uses hidden Markov Model for coding and non-coding
regions, and that was developed specially for metagenomes. It includes codon usage
bias, sequencing error models and start/stop codon patterns. A gene is reported if it’s
longer than 60 bp, and begins either with a start or an internal codon of a gene and
ends with a stop or an internal codon. This way, both complete and partial genes are
predicted. From 29,239 (sample 2) to 63,877 (sample 1) coding sequences were
Methods
13
predicted within the contigs, and from 16,387,405 (sample 2) to 40,199,546 (sample 6)
within the reads.
The sequences output from FragGeneScan are then clustered at 90% identity with
qiime-uclust. QIIME[37] is a software package developed specially for high throughput
amplicon sequencing data, although it also supports metagenomic data. It incorporates
many third party tools, such as UCLUST[38]. This algorithm clusters sequences based
on their similarity, according to a threshold set by the user (or in this case by MG-
RAST). Each cluster is therefore represented by a sequence, and all the sequences in
it should have a similarity higher than the threshold to the sequence representing the
cluster (centroid), and centroids should have similarity below the threshold to the other
centroids. The algorithm starts with no centroids, and each sequence is compared to
the list of centroids and it is either assigned to a cluster or selected as a new centroid.
The centroids and the singletons (unclustered sequences) are then searched using
BLAT[39] against the M5NR protein database. M5NR is a non-redundant protein
database which incorporates data from GO[40], KEGG[36][37], NCBI[38][39],
SEED[40][41], UniProt[47], VBI[48] and eggNOG[49], and has almost 16,000,000
sequences. BLAT builds an index of the database and then scans linearly through the
query sequence, unlike BLAST which builds an index of the query sequence and then
scans linearly through the database, making it faster since it does not have to scan
through a database of gigabases of sequence but only through a relatively short query
sequence. BLAT, however, looses to BLAST in terms of sensitivity, since it needs an
exact or nearly-exact match to find a hit, making it suitable mostly for closely related
species. The alignment identified between 25,261 (sample 2) and 50,816 (sample 1)
protein features in the contigs, and from 4,859,593 (sample 2) to 10,890,942 (sample
6) in the reads, which proved to be correlated at 98% with the number of dereplicated
reads, using Pearson’s coefficient:
Where and are the average of the number of dereplicated reads and the number of
protein features, respectively.
The results of the search against the M5NR database were retrieved for each of the
samples, at 90% identity, to map against the metabolic pathways maps based on
KEGG data, using KEGG Mapper[41] [42] and iPath[50] [51].
Methods
14
Besides from being the input for the Dereplication step, the filtered sequences are pre-
screened to identify ribosomal sequences at 70% identity, and then they are clustered
using UCLUST at 97% identity. The clusters are then searched for similarity against the
M5RNA database (Greengenes[52], SILVA[53] and RDP[54]), using BLAT[39]. This
alignment identified between 36 rRNA features (sample 2) to 72 (sample 1) in the
contigs, whilst in the reads the number ranged from 19,014 (sample 2) to 38,639
(sample 1).
MG-RAST also calculated automatically the alpha diversity of each sample, to
summarize the distribution of species-level annotations in that sample, using the
following equation:
Where p is a ratio of the number of annotations for each species to the total number of
annotations and m is the total number of different species annotations, using all the
annotation source databases incorporated by MG-RAST[27].
Based on the abundances of each species in each sample (using the reads), the R
package vegan[55] was used to calculate the beta diversity, as suggested in the
manual[56]. Therefore it was calculated pair wise between samples, using the
Sørensen index of dissimilarity:
Where a is the number of species shared by the two samples, and b and c are the
number of unique species to each sample; as well as the widely known Whittaker's
species turnover:
Where γ is the total number of species in the collection of samples (gamma diversity),
and is the average richness per sample. Subtraction of one guarantees that β=0
means that there are no excess species or no heterogeneity between samples.
Rarefaction curves were also automatically generated. The theory behind it, is to
repeatedly re-sample the pool of reads, at random, plotting the average number of
species represented by 1, 2,…N reads[57].
Methods
15
Krona[58] was used to view the percentage of reads with predicted proteins and
ribosomal RNA genes annotated based on all the databases.
16
Results
17
3. Results
The reads and contigs submitted to MG-RAST were automatically attributed with
unique ID’s, as indicated on Table 4.
Table 4 - ID's of the contigs submitted to MG-RAST
Sample Reads Contigs
1 4525786.3 4518922.3
2 4525785.3 4518923.3
3 4525784.3 4518924.3
5 4525781.3 4518925.3
6 4525782.3 4518926.3
7 4525783.3 4518927.3
8 4525787.3 4518928.3
To compare the abundances among the samples, the results were extracted from the
reads, whereas to assess presence or absence of a defined feature, the contigs’
results were retrieved.
3.1. Taxonomic Hits Distribution
Extracting the best hit classification from the reads compared to M5NR using a
maximum e-value of 1e-5, a minimum identity of 90%, and a minimum alignment length
of 15 aa, it is clear that Bacteria, and more specifically Proteobacteria, largely dominate
in all the 7 samples (Figure 6 and Figure 20).
In terms of class, Betaproteobacteria seems to comprise 78% of Proteobacteria in
Sample 1, unlike the other samples, where Gammaproteobacteria seems to be the
dominant class (Figure 7). Sample 3 shows a larger representation of
Alphaproteobacteria compared to the other samples.
Most of Gammaproteobacteria in sample 1 is Pseudoalteromonas, in sample 2 is
Pseudomonas, whereas from sample 3 to sample 8 other genera, namely
Marinobacter, become just as dominant (See Figure 21 to Figure 27).
Results
18
Figure 6 – Taxonomic distribution of the reads at the domain level
Figure 7 - Taxonomic distribution of the reads at the class level (Proteobacteria)
In terms of α-diversity, calculated using the reads against all the annotation databases
used by MG-RAST, sample 1 shows the highest: 430.83 species. The other samples
have diversities between 184.10 species (sample 6) and 252.14 species (sample 7).
The values of α-diversity for all the samples are shown on Table 5.
Results
19
Table 5 - α-diversity
α-diversity
Sample 1 430.83
Sample 2 213.47
Sample 3 232.97
Sample 5 210.42
Sample 6 184.10
Sample 7 252.14
Sample 8 240.39
The β-diversity value, using the Whittaker's species turnover was 1.181461, and the
pairwise comparisons are shown on Table 6 and Figure 8.
Table 6 – Pairwise β-diversity
Sample 1 Sample 2 Sample 3 Sample 5 Sample 6 Sample 7
2 0.422489
3 0.353043 0.319049
5 0.382264 0.283298 0.292187
6 0.364884 0.30632 0.292165 0.288654
7 0.360278 0.333708 0.314927 0.307126 0.287154
8 0.365677 0.324216 0.309876 0.306393 0.292684 0.294254
Figure 8 - β-diversity barchart
Results
20
A correlation analysis of the distance between samples and their β-diversity, shows no
relation between them (Figure 28).
The rarefaction curves of annotated species richness for all the samples show a quick
rise at first, and then they become flatter but without leveling off towards an asymptote
(Figure 9). This means that if there had been more reads, probably more species would
be found. Even so, these results allow a reasonable guess of the community structure.
Figure 9 – Rarefaction curve of annotated species richness
The Principle Component Analysis for the reads of the 7 samples, with annotation
against the M5RNA database, using the Bray-Curtis measure (chosen for showing a
robust relationship with ecological distance[59]), an e-value of 1e-5 and a minimum
identity of 97%, does not show a clear trend, neither when using the M5NR database,
with a minimum identity of 90% (Figure 10). See Figure 29 and Figure 30 for the
heatmaps with the same thresholds and normalized values to the size of the samples.
Results
21
Figure 10 - PCoA using the M5RNA database (left) and the M5NR database (right)
Nevertheless, when comparing with metagenomes from 1) the gut microbiota of 91
pregnant women of varying prepregnancy BMIs and gestational diabetes status and
their infants (http://metagenomics.anl.gov/linkin.cgi?project=265), and 2) metagenomes
from activated sludge from 2 full-scale tannery wastewater treatment plants
(http://metagenomics.anl.gov/linkin.cgi?project=922), it is clearly seen, that the Mariana
Trench samples group together in a very distinct group. As these two environments are
expected to be very and quite different, respectively, from the deep sea Mariana
Trench samples, this is a good indicator on the reliability of the latter. See for example
Figure 11, for a comparison against the M5NR database, at 90% minimum identity, and
an e-value of 1e-5.
Results
22
Figure 11 - PCoA of the reads against the M5NR database. Red - Mariana Trench; Blue - Activated Sludge; Green – Gut Microbiota
Results
23
3.2. Functional Category Hits Distribution
Looking at the number of features that were annotated based on the reads compared
to the contigs, it is noticeable that the latter provide a much more reliable source for
annotation, as seen from the range of e-values, which was expected. See, for example,
sample 7 in Figure 12 and Figure 13. One might notice that there were more features
predicted from the reads, but at the same time there were more reads than contigs.
Figure 12 - Number of features in the reads of sample 7 annotated by the different databases
Figure 13 – Number of features in the contigs of sample 7 annotated by the different databases
Moreover, taking again sample 7 as an example, only 50.7% of the predicted protein
features in the reads could be annotated with similarity to a protein of known function,
whereas 84.9% of the predicted protein features of the contigs were annotated.
Results
24
From all the databases that were used to compare the protein sequences generated
from the contigs, SEED Subsystems[45] had the higher number of annotations. (Figure
12 and Figure 13) It is worth noting, however, that each database has a different type
of annotation data, hence the different number of hits. Since the tools to analyse the
pathways (KEGG Mapper and iPath) use the KEGG database, the focus was put on
the functional hierarchy given by KEGG Orthology (KO)[41][42].
Comparing the reads to KO, using a maximum e-value of 1e-5, a minimum identity of
90%, and a minimum alignment length of 15, on average 53% (±0.03) of the reads with
predicted protein functions were annotated as belonging to the Metabolism category.
From those, 14% (±0.05) of the reads belong to Energy metabolism.
Roughly 100% of the reads from Energy metabolism, in the reads from sample 1,
correspond to oxidative phosphorylation, and on the rest of the samples, this value lays
around 77% (±0.07).
In fact, the F-type H+-transporting ATPase subunit beta (K02112), involved in both
oxidative phosphorylation (Figure 14) and photosynthesis (Figure 15), is the second
most abundant hit in sample 1 (out of 54 hits), with an average identity of 91.06% and
an average e-value of -6.14.
Figure 14 - Oxidative Phosphorylation, pathway ko00190.
Results
25
In sample 2, K02112 appears in 11th place (out of 239 hits) with an abundance of 9187
together with F-type H+-transporting ATPase subunit alpha (K02111) in 10th place with
an abundance of 9307.
In sample 3, K02112 has an abundance of 9513 and K02111 of 9758, appearing in 8th
and 6th, respectively, when sorting for abundance. For sample 5 the values are 13405
for K02112 and 12764 for K02111 (10th and 12th). Sample 6 has even higher
abundances for K02112 and K02111: 16632 and 16260 (8th and 9th most abundant). In
samples 7 and 8 they appear in 5th and 6th place, out of 108 and 115 hits, with
abundances of 11492 and 11257, and 10691 and 10294. In all samples from the
second to the seventh, these subunits have an average identity above 91.5%.
Figure 15 – Photosynthesis, pathway ko00195.
Using the contigs, with the same settings, only K02112 was found, and only in samples
2 and 8. However, the average alignment length of the hits was 356.55 and 332.22,
respectively, whereas for the reads it was 27.67 and 27.57. Nevertheless, other hits
also classified as belonging to Oxidative Phosphorylation were found, like NADH-
quinone oxidoreductase subunit (K13380 and K13378), NADH-quinone oxidoreductase
subunits (K00338 and K00340), F-type H+-transporting ATPase subunit c (K02110), V-
type H+-transporting ATPase subunits (K02118 and K02122), cytochrome c oxidase
Results
26
assembly protein subunit 17 (K02260), nucleosome-remodeling factor 38 kDa subunit
(K11726), cytochrome o ubiquinol oxidase subunit III (K02299), cytochrome o ubiquinol
oxidase operon protein cyoD (K02300) and NAD(P)H-quinone oxidoreductase subunit
5 (K05577).
To address, with some degree of confidence, whether alternative energy metabolism
processes occur in any of the samples, the contigs results were further explored.
Indeed, all samples contained contigs involved in Methane Metabolism (Figure 16).
Figure 16 - Methane metabolism, pathway ko00680. In red the enzymes found in the samples.
In addition, contigs from samples 2, 5 and 8, matched hits from nitrogen metabolism
(Figure 17). In all the three samples, nitric oxide reductase subunit B (K04561)
(EC:1.7.2.5) was present, which is involved in denitrification (nitrate → nitrogen).
Results
27
Sample 2 also had a nitrogenase iron protein NifH (K02588) (EC:1.18.6.1), a
nitrogenase molybdenum-cofactor synthesis protein NifE (K02587) and a nitrogen
fixation protein NifX (K02596).
Figure 17 - Nitrogen metabolism, pathway ko00910. In red the enzymes found in samples 2, 5 and 8.
Finally, the map generated with iPATH (Figure 18) gives a general overview of the
pathways present, when combining all samples. It is worth noting that photosynthesis
appears mapped; however, this is most likely a misleading mapping, since the enzyme
Results
28
identified is an F-type H+-transporting ATPase, which is involved in photosynthesis but
also in oxidative phosphorylation, as mentioned earlier.
Figure 18 - Metabolic map of the seven samples
Discussion
29
4. Discussion
Marine sediments, and in particular hadal trenches, receive substantial deposition of
microbes and organic matter from the upper water layer[1], and provide a matrix of
complex nutrients and solid surfaces for microbial growth[60]. However, the low
temperature and the extreme hydrostatic pressure demand a certain degree of
adaptation from the organisms inhabiting such an environment. Even so, there seems
to be a fairly high diversity along the sediment depth, as seen in Table 5 and Figure 9.
Proteobacteria is the largest and most metabolically diverse group of Bacteria. They
are all gram-negative, and they divide into 5 classes: alpha, beta, gamma, delta and
epsilon[61]. The dominance of Gammaproteobacteria is in accordance with a study
from the Pacific Artic Ocean, where the temperatures are also very low[62], and
somewhat with the study of sediments at 4000m depth in Pacific Ocean, where not
only Gammaproteobacteria but also Alphaproteobacteria dominate the community[63].
Intriguingly, the outer-layer of an actively venting black-smoker chimney from a
hydrothermal vent field on the Juan de Fuca Ridge[64], is also dominated by
Gammaproteobacteria, even though its temperature lies above 310ºC.
The PCoA graphs show samples that exhibit similar abundance profiles, in terms of
taxonomy or function, grouped together. However, when comparing the seven
samples, there is no obvious trend in the community towards the depth of the sediment
(Figure 10). Nevertheless, the fact that this project’s samples group together and very
distinctly from other project’s samples, is a good indicator that this environment has its
own community structure.
The poor correlation between β-diversity and distance between samples also supports
the PCoA results (Table 6 and Figure 28). This means that the difference in microbial
community composition (as defined in [65]) is most likely due to factors other than
depth. It is possible that, under such high pressure, some centimeters of sediment do
not really make a difference in the community structure. Alternatively, there might have
been some mixing of the communities during the sampling process.
It should be noted however, that the fact that the community as a whole does not show
a shift alongside the depth of the sediment, does not exclude the hypothesis that some
taxa correlate with it.
Regarding the decision to assemble, the range of e-values of the number of features
annotated with the different databases, as well as the percentage of predicted protein
Discussion
30
features that were annotated, should provide some degree of confidence in the
assembly.
The high number of hits of the oxidative phosphorylation pathway supported the
predictions from [1], that there is intensified O2 consumption within the sediment, unlike
in the sediment of the reference site (≈6000m of water depth), where the microbial
activity has reduced rates. This was supported by measurements of the O2
concentration throughout the depth of the sediment. Attenuation in the O2
concentration reflects higher rates of its consumption[1] (Figure 19), which is consistent
with the presence of genes involved in aerobic respiration in all the samples.
Figure 19 – Oxygen micro-profiles at 6,018 m water depth (a); and at Challenger Deep (b) [1].
Even though oxidative phosphorylation dominates the energy metabolism processes,
methane and nitrogen metabolism still play a part in the community’s energetic
potential.
Normally, methanogenesis is associated with anoxic environments; still, it is known that
even in oxic environments, anoxic microenvironments can form, where
methanogenesis takes place[61].
Discussion
31
Once more, the predictions that there is intensified mineralization mediated by the
prokaryotic community at Challenger Deep[1] are supported by the contigs with
homology to features involved in nitrogen metabolism.
Finally, the misleading mapping of the ATPase (Figure 18), should be taken as an
example that care and criticism are fundamental when using automated tools.
32
Conclusion
33
5. Conclusion
This study was a first description of both the community structure and its functional
potential, in the Mariana Trench, a unique environment for its extreme conditions. The
amount of data generated made it prohibitive to describe it in total. The energy
metabolism was selected for this thesis, since it was interesting to compare with the
results from [1]. The finding that there are enzymes involved in the oxidative
phosphorylation pathway in all 7 samples, supported the published measurements of
oxygen consumption throughout the sediment.
It was expected to observe a taxonomic and/or functional gradient along the depth of
the sediment but that does not seem to happen. A further investigation on this matter
would be helpful to prove if there are any signature taxa of the depth.
The data used in the study will soon be publicly available on MG-RAST, therefore
accessible for additional investigation. However, in the future, it would be sensible to
sample with true replicates, and take a broader number of environmental
measurements, to allow the data to be more comparable to other studies. It would also
be interesting to take samples from sediments from other depths along the Challenger
Deep, to assess if the community uniqueness is due to the extreme depth or to the
overall conditions on that site.
To conclude, it is probable that in 10 years time, with the development of new tools or
with the improvement of the existing ones, all of these results will be proved inaccurate.
However, the aim of this thesis was neither to develop new tools, nor to compare the
existing ones, but to use them wisely and understand their purpose for this analysis.
Hence, the argument of this project is that with this set of tools, this is the product.
34
References
35
References
[1] R. N. Glud, F. Wenzhöfer, M. Middelboe, K. Oguri, R. Turnewitsch, D. E. Canfield, and H. Kitazato, “High rates of microbial carbon turnover in sediments in the deepest oceanic trench on earth,” Nature Geoscience, vol. 6, no. 4, pp. 284–288, Mar. 2013.
[2] R. Pawlowicz, “Key physical variables in the ocean: temperature, salinity, and density,” Nature Education Knowledge, vol. 4, no. 4, p. 13, 2013.
[3] “The international system of units.” Bureau International des Poids et Mesures, 2006.
[4] R. A. Lutz and P. G. Falkowski, “Ocean science. A dive to Challenger Deep.,” Science (New York, N.Y.), vol. 336, no. 6079, pp. 301–2, Apr. 2012.
[5] J. Handelsman, M. R. Rondon, S. F. Brady, J. Clardy, and R. M. Goodman, “Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products,” Chemistry & Biology, vol. 5, no. 10, pp. R245–R249, Oct. 1998.
[6] Society of American Bacteriologists., Bergey’s manual of determinative bacteriology, 1st ed. Baltimore, Williams & Wilkins Co., 1923.
[7] C. R. Woese and G. E. Fox, “Phylogenetic structure of the prokaryotic domain: The primary kingdoms,” Proceedings of the National Academy of Sciences, vol. 74, no. 11, pp. 5088–5090, Nov. 1977.
[8] P. Hugenholtz and G. W. Tyson, “Microbiology: metagenomics.,” Nature, vol. 455, no. 7212, pp. 481–3, Sep. 2008.
[9] E. M. Glass and F. Meyer, “Analysis of metagenomics data,” in in Bioinformatics for High Throughput Sequencing, N. Rodríguez-Ezpeleta, M. Hackenberg, and A. M. Aransay, Eds. New York, NY: Springer New York, 2012, pp. 219–229.
[10] J. Handelsman, “Metagenomics: application of genomics to uncultured microorganisms.,” Microbiology and molecular biology reviews : MMBR, vol. 68, no. 4, pp. 669–85, Dec. 2004.
[11] X. Hao and T. Chen, “OTU analysis using metagenomic shotgun sequencing data,” PLoS ONE, vol. 7, no. 11, p. e49785, Nov. 2012.
[12] V. Iverson, R. M. Morris, C. D. Frazar, C. T. Berthiaume, R. L. Morales, and E. V. Armbrust, “Untangling genomes from metagenomes: revealing an uncultured class of marine Euryarchaeota.,” Science (New York, N.Y.), vol. 335, no. 6068, pp. 587–90, Feb. 2012.
[13] T. Thomas, J. Gilbert, and F. Meyer, “Metagenomics - a guide from sampling to data analysis.,” Microbial informatics and experimentation, vol. 2, no. 1, p. 3, Jan. 2012.
[14] J. C. Wooley, A. Godzik, and I. Friedberg, “A primer on metagenomics.,” PLoS computational biology, vol. 6, no. 2, p. e1000667, Feb. 2010.
[15] N. Whiteford, N. Haslam, G. Weber, A. Prügel-Bennett, J. W. Essex, P. L. Roach, M. Bradley, and C. Neylon, “An analysis of the feasibility of short read sequencing.,” Nucleic acids research, vol. 33, no. 19, p. e171, Jan. 2005.
References
36
[16] D. R. Bentley, S. Balasubramanian, H. P. Swerdlow, G. P. Smith, J. Milton, C. G. Brown, K. P. Hall, D. J. Evers, C. L. Barnes, H. R. Bignell, J. M. Boutell, J. Bryant, R. J. Carter, R. Keira Cheetham, A. J. Cox, D. J. Ellis, M. R. Flatbush, N. A. Gormley, S. J. Humphray, L. J. Irving, M. S. Karbelashvili, S. M. Kirk, H. Li, X. Liu, K. S. Maisinger, L. J. Murray, B. Obradovic, T. Ost, M. L. Parkinson, M. R. Pratt, I. M. J. Rasolonjatovo, M. T. Reed, R. Rigatti, C. Rodighiero, M. T. Ross, A. Sabot, S. V Sankar, A. Scally, G. P. Schroth, M. E. Smith, V. P. Smith, A. Spiridou, P. E. Torrance, S. S. Tzonev, E. H. Vermaas, K. Walter, X. Wu, L. Zhang, M. D. Alam, C. Anastasi, I. C. Aniebo, D. M. D. Bailey, I. R. Bancarz, S. Banerjee, S. G. Barbour, P. A. Baybayan, V. A. Benoit, K. F. Benson, C. Bevis, P. J. Black, A. Boodhun, J. S. Brennan, J. A. Bridgham, R. C. Brown, A. A. Brown, D. H. Buermann, A. A. Bundu, J. C. Burrows, N. P. Carter, N. Castillo, M. Chiara E Catenazzi, S. Chang, R. Neil Cooley, N. R. Crake, O. O. Dada, K. D. Diakoumakos, B. Dominguez-Fernandez, D. J. Earnshaw, U. C. Egbujor, D. W. Elmore, S. S. Etchin, M. R. Ewan, M. Fedurco, L. J. Fraser, K. V Fuentes Fajardo, W. Scott Furey, D. George, K. J. Gietzen, C. P. Goddard, G. S. Golda, P. A. Granieri, D. E. Green, D. L. Gustafson, N. F. Hansen, K. Harnish, C. D. Haudenschild, N. I. Heyer, M. M. Hims, J. T. Ho, A. M. Horgan, K. Hoschler, S. Hurwitz, D. V Ivanov, M. Q. Johnson, T. James, T. A. Huw Jones, G.-D. Kang, T. H. Kerelska, A. D. Kersey, I. Khrebtukova, A. P. Kindwall, Z. Kingsbury, P. I. Kokko-Gonzales, A. Kumar, M. A. Laurent, C. T. Lawley, S. E. Lee, X. Lee, A. K. Liao, J. A. Loch, M. Lok, S. Luo, R. M. Mammen, J. W. Martin, P. G. McCauley, P. McNitt, P. Mehta, K. W. Moon, J. W. Mullens, T. Newington, Z. Ning, B. Ling Ng, S. M. Novo, M. J. O’Neill, M. A. Osborne, A. Osnowski, O. Ostadan, L. L. Paraschos, L. Pickering, A. C. Pike, A. C. Pike, D. Chris Pinkard, D. P. Pliskin, J. Podhasky, V. J. Quijano, C. Raczy, V. H. Rae, S. R. Rawlings, A. Chiva Rodriguez, P. M. Roe, J. Rogers, M. C. Rogert Bacigalupo, N. Romanov, A. Romieu, R. K. Roth, N. J. Rourke, S. T. Ruediger, E. Rusman, R. M. Sanches-Kuiper, M. R. Schenker, J. M. Seoane, R. J. Shaw, M. K. Shiver, S. W. Short, N. L. Sizto, J. P. Sluis, M. A. Smith, J. Ernest Sohna Sohna, E. J. Spence, K. Stevens, N. Sutton, L. Szajkowski, C. L. Tregidgo, G. Turcatti, S. Vandevondele, Y. Verhovsky, S. M. Virk, S. Wakelin, G. C. Walcott, J. Wang, G. J. Worsley, J. Yan, L. Yau, M. Zuerlein, J. Rogers, J. C. Mullikin, M. E. Hurles, N. J. McCooke, J. S. West, F. L. Oaks, P. L. Lundberg, D. Klenerman, R. Durbin, and A. J. Smith, “Accurate whole human genome sequencing using reversible terminator chemistry.,” Nature, vol. 456, no. 7218, pp. 53–9, Nov. 2008.
[17] W. Zhang, J. Chen, Y. Yang, Y. Tang, J. Shang, and B. Shen, “A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies.,” PloS one, vol. 6, no. 3, p. e17915, Jan. 2011.
[18] H. Teeling and F. O. Glöckner, “Current opportunities and challenges in microbial metagenome analysis--a bioinformatic perspective.,” Briefings in bioinformatics, Sep. 2012.
[19] M. Baker, “De novo genome assembly: what every biologist should know,” Nature Methods, vol. 9, no. 4, pp. 333–337, Mar. 2012.
[20] P. Yilmaz, R. Kottmann, D. Field, R. Knight, J. R. Cole, L. Amaral-Zettler, J. A. Gilbert, I. Karsch-Mizrachi, A. Johnston, G. Cochrane, R. Vaughan, C. Hunter, J. Park, N. Morrison, P. Rocca-Serra, P. Sterk, M. Arumugam, M. Bailey, L. Baumgartner, B. W. Birren, M. J. Blaser, V. Bonazzi, T. Booth, P. Bork, F. D. Bushman, P. L. Buttigieg, P. S. G. Chain, E. Charlson, E. K. Costello, H. Huot-Creasy, P. Dawyndt, T. DeSantis, N. Fierer, J. A. Fuhrman, R. E. Gallery, D. Gevers, R. A. Gibbs, I. San Gil, A. Gonzalez, J. I. Gordon, R. Guralnick, W. Hankeln, S. Highlander, P. Hugenholtz, J. Jansson, A. L. Kau, S. T. Kelley, J. Kennedy, D. Knights, O. Koren, J. Kuczynski, N. Kyrpides, R. Larsen, C. L. Lauber, T. Legg, R. E. Ley, C. A. Lozupone, W. Ludwig, D. Lyons, E. Maguire, B. A. Methé, F. Meyer, B. Muegge, S. Nakielny, K. E. Nelson, D. Nemergut, J. D. Neufeld, L. K. Newbold, A. E. Oliver, N. R. Pace, G. Palanisamy, J. Peplies, J. Petrosino, L. Proctor, E. Pruesse, C. Quast, J. Raes, S. Ratnasingham, J. Ravel, D. A. Relman, S. Assunta-Sansone, P. D. Schloss, L. Schriml, R. Sinha, M. I. Smith, E. Sodergren, A. Spo, J. Stombaugh, J. M. Tiedje, D. V Ward, G. M. Weinstock, D. Wendel, O. White, A. Whiteley, A. Wilke, J. R. Wortman, T. Yatsunenko, and F. O. Glöckner, “Minimum
References
37
information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications.,” Nature biotechnology, vol. 29, no. 5, pp. 415–20, May 2011.
[21] “Web of Knowledge.” [Online]. Available: www.webofknowledge.com.
[22] J. L. Fox, “Natural-born eaters.,” Nature biotechnology, vol. 29, no. 2, pp. 103–6, Feb. 2011.
[23] P. Lorenz and J. Eck, “Metagenomics and industrial applications.,” Nature reviews. Microbiology, vol. 3, no. 6, pp. 510–6, Jun. 2005.
[24] “www.biopieces.org.” .
[25] Y. Peng, H. Leung, S. Yiu, and F. Chin, “IDBA – a practical iterative de Bruijn graph de novo assembler,” in 14th RECOMB 2010, 2010, pp. 426–440.
[26] Y. Peng, H. C. M. Leung, S. M. Yiu, and F. Y. L. Chin, “IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth.,” Bioinformatics (Oxford, England), vol. 28, no. 11, pp. 1420–8, Jun. 2012.
[27] F. Meyer, D. Paarmann, M. D’Souza, R. Olson, E. M. Glass, M. Kubal, T. Paczian, A. Rodriguez, R. Stevens, A. Wilke, J. Wilkening, and R. A. Edwards, “The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes.,” BMC bioinformatics, vol. 9, no. 1, p. 386, Jan. 2008.
[28] M. Kircher, U. Stenzel, and J. Kelso, “Improved base calling for the Illumina Genome Analyzer using machine learning strategies.,” Genome biology, vol. 10, no. 8, p. R83, Jan. 2009.
[29] O. Tange, “GNU Parallel: the command-line power tool | USENIX,” ;login: The USENIX Magazine, pp. 42–47, 2011.
[30] E. W. Weisstein, “Eulerian Cycle -- from Wolfram MathWorld.” Wolfram Research, Inc.
[31] P. E. C. Compeau, P. A. Pevzner, and G. Tesler, “How to apply de Bruijn graphs to genome assembly.,” Nature biotechnology, vol. 29, no. 11, pp. 987–91, Nov. 2011.
[32] E. W. Weisstein, “Hamiltonian Cycle -- from Wolfram MathWorld.” Wolfram Research, Inc.
[33] E. W. Weisstein, “NP-Complete Problem -- from Wolfram MathWorld.” Wolfram Research, Inc.
[34] E. W. Weisstein, “Eulerian Path -- from Wolfram MathWorld.” Wolfram Research, Inc.
[35] J. R. Miller, S. Koren, and G. Sutton, “Assembly algorithms for next-generation sequencing data.,” Genomics, vol. 95, no. 6, pp. 315–27, Jun. 2010.
[36] M. Rho, H. Tang, and Y. Ye, “FragGeneScan: predicting genes in short and error-prone reads.,” Nucleic acids research, vol. 38, no. 20, p. e191, Nov. 2010.
[37] J. G. Caporaso, J. Kuczynski, J. Stombaugh, K. Bittinger, F. D. Bushman, E. K. Costello, N. Fierer, A. G. Peña, J. K. Goodrich, J. I. Gordon, G. A. Huttley, S. T. Kelley, D. Knights, J. E. Koenig, R. E. Ley, C. A. Lozupone, D. McDonald, B. D. Muegge, M. Pirrung, J. Reeder, J. R. Sevinsky, P. J. Turnbaugh, W. A. Walters, J. Widmann, T.
References
38
Yatsunenko, J. Zaneveld, and R. Knight, “QIIME allows analysis of high-throughput community sequencing data.,” Nature methods, vol. 7, no. 5, pp. 335–6, May 2010.
[38] R. C. Edgar, “Search and clustering orders of magnitude faster than BLAST.,” Bioinformatics (Oxford, England), vol. 26, no. 19, pp. 2460–1, Oct. 2010.
[39] W. J. Kent, “BLAT--the BLAST-like alignment tool.,” Genome research, vol. 12, no. 4, pp. 656–64, Apr. 2002.
[40] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock, “Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.,” Nature genetics, vol. 25, no. 1, pp. 25–9, May 2000.
[41] M. Kanehisa and S. Goto, “KEGG: Kyoto encyclopedia of genes and genomes.,” Nucleic acids research, vol. 28, no. 1, pp. 27–30, Jan. 2000.
[42] M. Kanehisa, S. Goto, Y. Sato, M. Furumichi, and M. Tanabe, “KEGG for integration and interpretation of large-scale molecular data sets.,” Nucleic acids research, vol. 40, no. Database issue, pp. D109–14, Jan. 2012.
[43] E. W. Sayers, T. Barrett, D. A. Benson, S. H. Bryant, K. Canese, V. Chetvernin, D. M. Church, M. DiCuccio, R. Edgar, S. Federhen, M. Feolo, L. Y. Geer, W. Helmberg, Y. Kapustin, D. Landsman, D. J. Lipman, T. L. Madden, D. R. Maglott, V. Miller, I. Mizrachi, J. Ostell, K. D. Pruitt, G. D. Schuler, E. Sequeira, S. T. Sherry, M. Shumway, K. Sirotkin, A. Souvorov, G. Starchenko, T. A. Tatusova, L. Wagner, E. Yaschenko, and J. Ye, “Database resources of the National Center for Biotechnology Information.,” Nucleic acids research, vol. 37, no. Database issue, pp. D5–15, Jan. 2009.
[44] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and E. W. Sayers, “GenBank.,” Nucleic acids research, vol. 37, no. Database issue, pp. D26–31, Jan. 2009.
[45] R. Overbeek, T. Begley, R. M. Butler, J. V Choudhuri, H.-Y. Chuang, M. Cohoon, V. de Crécy-Lagard, N. Diaz, T. Disz, R. Edwards, M. Fonstein, E. D. Frank, S. Gerdes, E. M. Glass, A. Goesmann, A. Hanson, D. Iwata-Reuyl, R. Jensen, N. Jamshidi, L. Krause, M. Kubal, N. Larsen, B. Linke, A. C. McHardy, F. Meyer, H. Neuweger, G. Olsen, R. Olson, A. Osterman, V. Portnoy, G. D. Pusch, D. A. Rodionov, C. Rückert, J. Steiner, R. Stevens, I. Thiele, O. Vassieva, Y. Ye, O. Zagnitko, and V. Vonstein, “The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes.,” Nucleic acids research, vol. 33, no. 17, pp. 5691–702, Jan. 2005.
[46] R. K. Aziz, D. Bartels, A. A. Best, M. DeJongh, T. Disz, R. A. Edwards, K. Formsma, S. Gerdes, E. M. Glass, M. Kubal, F. Meyer, G. J. Olsen, R. Olson, A. L. Osterman, R. A. Overbeek, L. K. McNeil, D. Paarmann, T. Paczian, B. Parrello, G. D. Pusch, C. Reich, R. Stevens, O. Vassieva, V. Vonstein, A. Wilke, and O. Zagnitko, “The RAST Server: rapid annotations using subsystems technology.,” BMC genomics, vol. 9, p. 75, Jan. 2008.
[47] The UniProt Consortium, “Reorganizing the protein space at the Universal Protein Resource (UniProt).,” Nucleic acids research, vol. 40, no. Database issue, pp. D71–5, Jan. 2012.
[48] J. J. Gillespie, A. R. Wattam, S. A. Cammer, J. L. Gabbard, M. P. Shukla, O. Dalay, T. Driscoll, D. Hix, S. P. Mane, C. Mao, E. K. Nordberg, M. Scott, J. R. Schulman, E. E. Snyder, D. E. Sullivan, C. Wang, A. Warren, K. P. Williams, T. Xue, H. S. Yoo, C. Zhang, Y. Zhang, R. Will, R. W. Kenyon, and B. W. Sobral, “PATRIC: the comprehensive bacterial bioinformatics resource with a focus on human pathogenic species.,” Infection and immunity, vol. 79, no. 11, pp. 4286–98, Nov. 2011.
References
39
[49] S. Powell, D. Szklarczyk, K. Trachana, A. Roth, M. Kuhn, J. Muller, R. Arnold, T. Rattei, I. Letunic, T. Doerks, L. J. Jensen, C. von Mering, and P. Bork, “eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges.,” Nucleic acids research, vol. 40, no. Database issue, pp. D284–9, Jan. 2012.
[50] I. Letunic, T. Yamada, M. Kanehisa, and P. Bork, “iPath: interactive exploration of biochemical pathways and networks.,” Trends in biochemical sciences, vol. 33, no. 3, pp. 101–3, Mar. 2008.
[51] T. Yamada, I. Letunic, S. Okuda, M. Kanehisa, and P. Bork, “iPath2.0: interactive pathway explorer.,” Nucleic acids research, vol. 39, no. Web Server issue, pp. W412–5, Jul. 2011.
[52] T. Z. DeSantis, P. Hugenholtz, N. Larsen, M. Rojas, E. L. Brodie, K. Keller, T. Huber, D. Dalevi, P. Hu, and G. L. Andersen, “Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB.,” Applied and environmental microbiology, vol. 72, no. 7, pp. 5069–72, Jul. 2006.
[53] C. Quast, E. Pruesse, P. Yilmaz, J. Gerken, T. Schweer, P. Yarza, J. Peplies, and F. O. Glöckner, “The SILVA ribosomal RNA gene database project: improved data processing and web-based tools.,” Nucleic acids research, vol. 41, no. Database issue, pp. D590–6, Jan. 2013.
[54] J. R. Cole, Q. Wang, E. Cardenas, J. Fish, B. Chai, R. J. Farris, A. S. Kulam-Syed-Mohideen, D. M. McGarrell, T. Marsh, G. M. Garrity, and J. M. Tiedje, “The Ribosomal Database Project: improved alignments and new tools for rRNA analysis.,” Nucleic acids research, vol. 37, no. Database issue, pp. D141–5, Jan. 2009.
[55] J. Oksanen, R. Blanchet, F. Guillaume Kindt, P. Legendre, P. R. Minchin, R. B. O’Hara, G. L. Simpson, P. Solymos, M. H. H. Stevens, and H. Wagner, “vegan: Community Ecology Package. R package version 2.0-7.” 2013.
[56] J. Oksanen, “Vegan: ecological diversity.” .
[57] N. J. Gotelli and R. K. Colwell, “Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness,” Ecology Letters, vol. 4, no. 4, pp. 379–391, Jul. 2001.
[58] B. D. Ondov, N. H. Bergman, and A. M. Phillippy, “Interactive metagenomic visualization in a web browser.,” BMC bioinformatics, vol. 12, p. 385, Jan. 2011.
[59] D. P. Faith, P. R. Minchin, and L. Belbin, “Compositional dissimilarity as a robust measure of ecological distance,” Vegetatio, vol. 69, no. 1–3, pp. 57–68, Apr. 1987.
[60] Y. Wang, H.-F. Sheng, Y. He, J.-Y. Wu, Y.-X. Jiang, N. F.-Y. Tam, and H.-W. Zhou, “Comparison of the levels of bacterial diversity in freshwater, intertidal wetland, and marine sediments by using millions of illumina tags.,” Applied and environmental microbiology, vol. 78, no. 23, pp. 8264–71, Dec. 2012.
[61] M. T. Madigan, J. M. Martinko, P. V. Dunlap, and D. P. Clark, Brock Biology of Microorganisms, 12th ed. Pearson, 2009.
[62] H. Li, Y. Yu, W. Luo, Y. Zeng, and B. Chen, “Bacterial diversity in surface sediments from the Pacific Arctic Ocean.,” Extremophiles : life under extreme conditions, vol. 13, no. 2, pp. 233–46, Mar. 2009.
References
40
[63] K. T. Konstantinidis, J. Braff, D. M. Karl, and E. F. DeLong, “Comparative metagenomic analysis of a microbial community residing at a depth of 4,000 meters at station ALOHA in the North Pacific subtropical gyre.,” Applied and environmental microbiology, vol. 75, no. 16, pp. 5345–55, Aug. 2009.
[64] W. Xie, F. Wang, L. Guo, Z. Chen, S. M. Sievert, J. Meng, G. Huang, Y. Li, Q. Yan, S. Wu, X. Wang, S. Chen, G. He, X. Xiao, and A. Xu, “Comparative metagenomics of microbial communities inhabiting deep-sea hydrothermal vent chimneys with contrasting chemistries.,” The ISME journal, vol. 5, no. 3, pp. 414–26, Mar. 2011.
[65] J. Wang, Y. Wu, H. Jiang, C. Li, H. Dong, Q. Wu, J. Soininen, and J. Shen, “High beta diversity of bacteria in the shallow terrestrial subsurface,” Environmental Microbiology, vol. 10, no. 10, pp. 2537–2549, Oct. 2008.
Appendix
41
Appendix
Figure 20 - Taxonomic distribution of the reads from the seven samples at the phylum level
Appendix
42
Figure 21 - Krona graph of the distribution of the reads of Gammaproteobacteria in sample 1
Figure 22 - Krona graph of the distribution of the reads of Gammaproteobacteria in sample 2
Appendix
43
Figure 23 - Krona graph of the distribution of the reads of Gammaproteobacteria in sample 3
Figure 24 - Krona graph of the distribution of the reads of Gammaproteobacteria in sample 5
Appendix
44
Figure 25 - Krona graph of the distribution of the reads of Gammaproteobacteria in sample 6
Figure 26 - Krona graph of the distribution of the reads of Gammaproteobacteria in sample 7
Appendix
45
Figure 27 - Krona graph of the distribution of the reads of Gammaproteobacteria in sample 8
Figure 28 - Beta diversity related to spacial distance
y = 0.0013x + 0.3036 R² = 0.0999
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 5 10 15 20 25 30 35 40
Dis
sim
ila
rity
Distance (cm)
Correlation between Spacial Distance and Dissimilarity
Appendix
46
Figure 29 - Heatmap of the reads agains the M5RNA database at 97% identity
Appendix
47
Figure 30 - Heatmap of the reads agains the M5NR database at 90% identity
Top Related