Download - UNIVERSIDADE DE LISBOArepositorio.ul.pt/bitstream/10451/9436/1/ulfc... · MARIANA TRENCH SEDIMENT SAMPLES Vera Maria Leal Carvalho Thesis conducted at the Molecular Microbial Ecology

UNIVERSIDADE DE LISBOA

FACULDADE DE CIÊNCIAS

DEPARTAMENTO DE BIOLOGIA ANIMAL

METAGENOMIC ANALYSIS OF

MARIANA TRENCH SEDIMENT

SAMPLES

Vera Maria Leal Carvalho

Dissertação de

MESTRADO EM BIOINFORMÁTICA E BIOLOGIA

COMPUTACIONAL

ESPECIALIZAÇÃO EM BIOINFORMÁTICA

2013

UNIVERSIDADE DE LISBOA

FACULDADE DE CIÊNCIAS

DEPARTAMENTO DE BIOLOGIA ANIMAL



SAMPLES


Dissertação de

MESTRADO EM BIOINFORMÁTICA E BIOLOGIA

COMPUTACIONAL

ESPECIALIZAÇÃO EM BIOINFORMÁTICA

Dissertação orientada pelo Professor Doutor Francisco Couto (DI-FCUL) e

pelo Post-doctoral fellow Martin Asser Hansen (MME-KU)

2013



SAMPLES


Thesis conducted at the Molecular Microbial Ecology Group of the

Department of Biology of the Faculty of Science of the University of

Copenhagen

MSc IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

2013

i

Acknowledgements

First of all, I would like to thank my external supervisor at the Molecular Microbial

Ecology Group of the University of Copenhagen, Doctor Martin Asser Hansen, who,

undoubtedly, was the person that contributed the most for the success of this project.

He taught me some things but, most importantly, he pushed me to learn by myself

everything else. He was a source of inspiration and support, and the hours of patient

discussion were invaluable. Moreover, he introduced me to the group, and made me

feel welcome since day one. I feel extremely happy that I got the opportunity to have

him as a supervisor.

Secondly, I would like to thank my supervisor at the Faculty of Science of the University

of Lisbon, Professor Francisco Couto, for accepting to supervise me and checking

regularly if my work was progressing. I’d also like to thank him for allowing me to do a

short collaboration within the project EPIWORK, which set the ground for my research

career.

I should thank Professor Søren Sørensen for having me in his group, and Assistant

Professor Waleed Abu Al-Soud for integrating me in this project. In addition, I want to

thank Associate Professor Lars H. Hansen for the times when I needed some direction,

and he immediately engaged in a session of brainstorming, as well as Lea Skov

Hansen who never failed to help me when I asked. Also, I’d like to state my sincere

gratitude to the Associate Professor Emeritus Annelise Kjøller for reviewing the

manuscript.

But since work is not just about working, I want to thank the people at MME who made

my working hours and coffee-breaks, enjoyable ones. Working in such a great

atmosphere was a wonderful experience! I want to thank specially Tue, Lea, Analia,

Stefan, Michael, Peter, Lars B. and Witold.

I also want to thank my roommates Chris and Emil for all the fun and hyggeligt times.

I shall also thank all of my friends and family in Portugal who have always been

present, supporting me, giving me love and care, and reminding me every day that

distance meant simply a plane ride…nothing else.

However, there were some people who had a direct effect on the turnover of this

thesis, and therefore I feel their names should be clearly stated. First of all, Gil for

keeping me company every day; for sharing the tiny details of our days; for discussing

ii

silly and more serious thoughts. Ricardo Purificação for bugging me the whole year to

write the Introduction. Carolina, Mogli, Anas and João for maintaining my insanity, with

our group conversation that is about…(What is it about?) And finally, Natalia Cięciwa

and Pascaline Serra for keeping track of me all these months...for the love they shared

with me, for their friendship, for not giving up.

Most of all, I want to thank Inês and Luísa. For they made me feel home in the cold

kingdoms of the north: they took care of me, they gave me colo, they told me I would

be alright. They became those people that I want to be around when I’m sad; or when

I’m happy! They became special to me…. Somehow, these two girls overcame the

barrier that I had in me when I arrived in Copenhagen, and found their way to my heart.

So…if to write a thesis you need a certain state of mind, a certain focus, a certain

equilibrium, they were certainly the people that made sure during this year that I had

them. And as Luísa said in the airport, when we were running for the security, there are

certain people in life that you don’t need to know for too long, to know that they are

going to stay in your life for a long time.

Finally, I would like to thank the NSF, the Moore Foundation and the European

BIOTRIANGLE project for supporting the Course on Marine Bioinformatics "Marine

Omics” at the University of Delaware, to which I was fortunately selected to go, and

that was definitely a turning point for the writing of this dissertation.

This work is dedicated to my brother Filipe, to my Avó Fernanda and to my parents.

But mainly to my parents, since without them, I wouldn’t have had the opportunity to go

to Denmark in first place. And they keep arguing that they are doing what they are

supposed to, but they are not – they are doing more, and they should know that I am

extremely grateful, that I am not forgetting what they are giving me, and that I love

them.

iii

So long, and thanks for all the fish!

v

Resumo

O “Challenger Deep” na Fossa das Marianas é um dos ambientes mais extremos à

face da Terra. A combinação da baixa temperatura de 2.5ºC, e a pressão de quase

112MPa, devido à coluna de água de 11km, tornam-no único, e sujeito à curiosidade

humana por esse facto. Contudo, os métodos tradicionais de cultura microbiana em

laboratório tornavam muito difícil obter uma visão completa da comunidade que habita

esse ambiente, uma vez que nem todos os microorganismos são passíveis de ser

cultivados, nem as condições extremas simples de ser recriadas.

Em 1998, Jo Handeslman cunhou o termo “Metagenómica”, ao tentar estudar a

microflora como um todo, em vez de estudar organismos individuais, denominando-o

assim de metagenoma. Desde aí que a Metagenómica evoluiu, passando a englobar a

identificação de sequências genómicas duma comunidade, bem como a sua análise

funcional e evolutiva.

A análise metagenómica tipicamente inclui vários passos que começam na

amostragem, seguindo-se a filtração (embora esta seja facultativa, dependendo do

objectivo do estudo), a sequenciação, até à análise das sequências e publicação dos

dados gerados. Este trabalho lidou exclusivamente com os dois últimos passos.

O objectivo do trabalho foi, não só gerar mais questões, como é habitual em análises

metagenómicas, mas também investigar que comunidade habita este ambiente, e

explorar um pouco o seu potencial metabólico. Contudo, com a publicação de um

estudo que descreve as mesmas amostras, surgiu o objectivo de explorar os

resultados no sentido de corroborar a descoberta de que há consumo de oxigénio ao

longo do sedimento.

A análise seguiu os moldes normalmente usados em trabalhos semelhantes. De 8

amostras iniciais, correspondendo a intervalos de 5cm da superfície até 40cm de

profundidade do sedimento, 7 foram sequenciadas. Inicialmente, as sequências foram

automaticamente pré-processadas de forma a que apenas a informação relevante e

fidedigna passasse à fase seguinte. Nomeadamente, removeram-se os adaptadores

utilizados na sequenciação, bem como sequências demasiado curtas, e bases de má

qualidade. Para isto, foi utilizada a colecção de ferramentas “Biopieces”, que permite

organizar os comandos numa pipeline de uma forma simples e intuitiva.

Seguidamente, as sequências foram sujeitas a assemblagem, utilizando o programa

IDBA-UD, de forma a gerar sequências mais longas, para serem anotadas com maior

vi

percentagem de identidade, e consequentemente, com maior confiança. Mais uma

vez, este passo é facultativo, uma vez que ao assemblar sequências perde-se

informação relativamente à abundância. Antes da anotação, as sequências foram

classificadas de codificantes ou não codificantes, e as primeiras foram então

mapeadas contra sequências conhecidas em bases de dados. A anotação foi feita em

termos taxonómicos e funcionais. Todos os passos que se seguiram à assemblagem

foram realizados pelo servidor MG-RAST, no entanto, tanto sequências assembladas

(“contigs”) como não-assembladas (“reads”) foram submetidas, para haver informação

relativamente à abundância, mas também informação sólida relativamente a

determinadas características de interesse.

Os resultados gerados pelo MG-RAST mostram claramente que Betaproteobacteria

domina a amostra de superfície (0-5cm), enquanto que nas restantes amostras é a

classe Gammaproteobacteria a mais abundante. É interessante observar que

enquanto que Gammaproteobacteria nas amostras 1 (0-5cm) e 2 (5-10cm) é dominada

por um género, da amostra 3 (10-15cm) à 8 (35-40cm) o número de géneros

abundantes aumenta. Em termos de diversidade-alfa, a amostra 1 apresenta o valor

mais elevado (430.83 espécies), em comparação com as outras que variam entre

184.10 e 252.14 espécies. A diversidade-beta foi calculada entre todas as amostras,

usando o pacote “vegan” da linguagem de programação estatística R. Especulou-se

que poderia haver uma correlação entre esta e a profundidade, mas tal não se

verificou. Para averiguar se haveria alguma relação entre a profundidade do sedimento

e a composição da comunidade, utilizou-se a análise de componentes principais

(PcoA). Estes resultados não permitiram a confirmação da hipótese, no entanto, ao

comparar com amostras de outros projectos bastante diferentes, as amostras da

Fossa das Marianas agruparam-se de modo consistente, mostrando que a

composição da comunidade é característica deste ambiente. Além disto, gerou-se uma

curva de rarefacção, que é utilizada para verificar se o esforço de sequenciação foi

suficiente para representar a comunidade por inteiro, e dado que as curvas das 7

amostras estão a chegar perto da assímptota, pôde-se concluir que os resultados são

razoáveis.

Em termos funcionais, a análise focou-se no metabolismo energético. A maior parte

das sequências “reads” deste metabolismo mapeavam para fosforilação oxidativa, que

é o último passo da respiração aeróbia. Analisando as sequências “contigs” que

mapeavam para o mesmo, verificou-se que existia mais de 91% de identidade contra

sequências na base de dados escolhida, o que indica que os resultados são credíveis.

vii

O metabolismo do metano e do azoto foram também investigados e, apesar de menos

abundantes, algumas enzimas envolvidas na metanogénese e no ciclo do azoto foram

identificadas nas sequências “contigs”.

Finalmente, gerou-se um mapa geral com todas as enzimas identificadas nas

amostras, utilizando o programa iPath que se baseia nos mapas metabólicos KEGG. É

de notar, todavia, que este mapeamento pode ser erróneo, como se constatou quando

se observou que a fotossíntese estava indicada como presente, o que é altamente

improvável a 11km de profundidade. Quando se investigou porquê, descobriu-se que

era devido a uma ATPase que está presente tanto na fotossíntese como na

fosforilação oxidativa.

Os resultados gerados, permitem concluir que efectivamente o consumo de oxigénio,

medido no estudo efectuado por colaboradores, se deve a metabolismo aeróbio

mesmo nas camadas de sedimento mais profundas. Esse estudo também previu que

os processos de mineralização acentuados neste ambiente fossem mediados pela

comunidade microbiana, o que se coaduna com a presença de enzimas envolvidas no

ciclo do azoto. A dominância de Gammaproteobacteria é partilhada por sedimentos no

Oceano Pacífico a 4000m de profundidade, bem como sedimentos no Oceano Pacífico

Ártico, que se encontra igualmente a baixas temperaturas. Curiosamente, a microflora

de fontes hidrotermais em profundidade, a mais de 310ºC também são dominadas por

Gammaproteobacteria.

Este estudo mostrou que é possível investigar em detalhe a composição da

comunidade bacteriana de ambientes extremos. Contudo, este trabalho poderia ter

sido mais robusto se houvesse réplicas das unidades amostrais, e mais dados

contextuais que permitissem fazer comparações com outros estudos. No futuro, seria

também interessante tirar amostras a diversas profundidades do “Challenger Deep” de

forma a estudar a variação da composição da comunidade com a profundidade.

Uma vez que esta área é ainda bastante jovem, a colecção de ferramentas disponíveis

apesar de vasta, ainda está sujeita a melhoramentos. Desta forma, os resultados aqui

apresentados poder-se-ão revelar imprecisos daqui a 10 anos. Também é provável

que uma escolha alternativa não produzisse exactamente os mesmos resultados.

Assim, o produto deste trabalho é fruto da escolha das ferramentas e dos seus

parâmetros, com todas as vantagens e incovenientes que lhes são inerentes.

ix

Abstract

The emergence of Metagenomics allowed the study of the microbial community in the

deepest point on Earth: the Challenger Deep on the Mariana Trench. Its extreme

conditions, a water depth of almost 11km, a temperature of 2.5 degrees Celsius and a

pressure around 112 MPa, made it very difficult to perform a comprehensive study of

its microecology, given the previous dependency on culturing methods. This

metagenomic analysis included taxonomic identification and exploration of some

functional potential of the genomic sequences of the community, generated by Illumina

Next-Generation Sequencing technique, therefore bypassing the need for cloning. Here

we show that Proteobacteria clearly dominate this environment but that there is no

obvious correlation between the sediment depth and the community composition.

Moreover, the abundance of enzymes involved in oxidative phosphorylation in all

samples, suggests aerobic activity within the sediment. This supports the finding that

there is oxygen consumption along the depth of the sediment. An extensive description

of all the data generated was prohibitive; however as soon as the data becomes

available, it will be accessible to the public to search for their features of interest.

Keywords: metagenomics, Mariana Trench, Challenger Deep, extreme environments,

Illumina, community structure, energy metabolism

O aparecimento da Metagenómica permitiu o estudo da comunidade microbiana no

ponto mais profundo na Terra: o “Challenger Deep” na Fossa das Marianas. As

condições extremas aí presentes - a coluna de água de quase 11km, 2.5ºC de

temperatura e a pressão à volta de 112MPa - tornaram um estudo aprofundado da sua

microecologia muito difícil de executar, dada a prévia dependência em métodos que

envolviam culturas em laboratório. Esta análise metagenómica incluiu identificação

taxonómica e a pesquisa do potencial funcional das sequências genómicas da

comunidade, geradas utilizando a tecnologia de nova geração de sequenciação da

Illumina, ultrapassando assim a necessidade de clonagem. Neste trabalho demonstra-

se que Proteobacteria domina claramente este habitat, mas que não há uma

correlação inequívoca entre a profundidade do sedimento e a composição da

x

comunidade. Além disso, a abundância de enzimas envolvidas na oxidação

fosforilativa em todas as amostras, sugere actividade aeróbia no sedimento. Isto

sustenta a descoberta de que há consumo de oxigénio ao longo da profundidade do

sedimento. Uma descrição extensa de todos os dados que foram gerados era

proibitivo, no entanto, assim que os dados se tornarem públicos, serão acessíveis a

todos os que os queiram investigar consoante os seus interesses.

Palavras-chave: metagenómica, Fossa das Marianas, “Challenger Deep”, ambientes

extremos, Illumina, estrutura da comunidade, metabolismo energético

xi

Contents

Acknowledgements ........................................................................................................ i

Resumo ........................................................................................................................ v

Abstract ....................................................................................................................... ix

List of Figures ............................................................................................................. xiii

List of Tables .............................................................................................................. xv

1. Introduction ............................................................................................................ 1

1.1. Background .................................................................................................... 1

1.2. Metagenomic Analysis .................................................................................... 2

1.3. Objective ........................................................................................................ 4

1.4. Structure of the thesis ..................................................................................... 5

2. Methods ................................................................................................................. 7

2.1. Sample collection, preparation and sequencing .............................................. 7

2.2. Preliminary Analysis ....................................................................................... 7

2.3. Biopieces ........................................................................................................ 8

2.4. Assembly ...................................................................................................... 10

2.5. MG-RAST ..................................................................................................... 12

3. Results ................................................................................................................ 17

3.1. Taxonomic Hits Distribution .......................................................................... 17

3.2. Functional Category Hits Distribution ............................................................ 23

4. Discussion ........................................................................................................... 29

5. Conclusion ........................................................................................................... 33

References ................................................................................................................. 35

Appendix ..................................................................................................................... 41

xiii

List of Figures

Figure 1 - Challenger Deep location (11º 22.1'N 142º 25.8' E) ...................................... 1

Figure 2 - Total number of metagenomics articles published since 1998 ...................... 4

Figure 3 – Cleaning script ............................................................................................. 8

Figure 4 – order_pairs script ......................................................................................... 9

Figure 5 – Genome assembly strategies: Hamiltonian and Eulerian cycles[31]. .......... 11

Figure 6 – Taxonomic distribution of the reads at the domain level ............................. 18

Figure 7 - Taxonomic distribution of the reads at the class level (Proteobacteria) ....... 18

Figure 8 - β-diversity barchart ..................................................................................... 19

Figure 9 – Rarefaction curve of annotated species richness ....................................... 20

Figure 10 - PCoA using the M5RNA database (left) and the M5NR database (right) .. 21

Figure 11 - PCoA of the reads against the M5NR database. Red - Mariana Trench;

Blue - Activated Sludge; Green – Gut Microbiota ................................................. 22

Figure 12 - Number of features in the reads of sample 7 annotated by the different

databases ............................................................................................................ 23

Figure 13 – Number of features in the contigs of sample 7 annotated by the different

databases ............................................................................................................ 23

Figure 14 - Oxidative Phosphorylation, pathway ko00190. .......................................... 24

Figure 15 – Photosynthesis, pathway ko00195. .......................................................... 25

Figure 16 - Methane metabolism, pathway ko00680. In red the enzymes found in the

samples. .............................................................................................................. 26

Figure 17 - Nitrogen metabolism, pathway ko00910. In red the enzymes found in

samples 2, 5 and 8. ............................................................................................. 27

Figure 18 - Metabolic map of the seven samples ........................................................ 28

Figure 19 – Oxygen micro-profiles at 6,018 m water depth (a); and at Challenger Deep

(b) [1]. .................................................................................................................. 30

Figure 20 - Taxonomic distribution of the reads from the seven samples at the phylum

level ..................................................................................................................... 41

Figure 21 - Krona graph of the distribution of the reads of Gammaproteobacteria in

sample 1 .............................................................................................................. 42


sample 2 .............................................................................................................. 42


sample 3 .............................................................................................................. 43

xiv


sample 5 .............................................................................................................. 43


sample 6 .............................................................................................................. 44


sample 7 .............................................................................................................. 44


sample 8 .............................................................................................................. 45

Figure 28 - Beta diversity related to spacial distance .................................................. 45

Figure 29 - Heatmap of the reads agains the M5RNA database at 97% identity ......... 46

Figure 30 - Heatmap of the reads agains the M5NR database at 90% identity ........... 47

xv

List of Tables

Table 1 - Percentage of reads removed with the cleaning ............................................. 7

Table 2 – Analysis of the assembly with minimum contig size 200 bp. ........................ 12

Table 3 – Analysis of the assembly with minimum contig size 500 bp. ........................ 12

Table 4 - ID's of the contigs submitted to MG-RAST ................................................... 17

Table 5 - α-diversity .................................................................................................... 19

Table 6 – Pairwise β-diversity ..................................................................................... 19

Introduction

1

1. Introduction

1.1. Background

The Challenger Deep on the Mariana Trench is one of the most extreme environments

on Earth, with a depth of almost 11km, a temperature of 2.5 degrees Celsius[1] and a

pressure around 111.79 MPa – calculated assuming the mean density of sea water

1036 kg/m3[2] and the gravity to be 9,81 m/s2[3]. It is located roughly at 11ºN 22.1’N

142º 25.8’ E [1](Figure 1).

Figure 1 - Challenger Deep location (11º 22.1'N 142º 25.8' E)

It has been subject to human curiosity for many years[4], however so far, there wasn't a

detailed study of its microecology. With the emergence of Metagenomics, it was now

finally possible to unravel which organisms live in the deepest point on Earth, and what

are they doing.

It was in 1998 that the term "Metagenomics" was first used by Jo Handeslman [5] in an

effort to study the microflora as a unit, the metagenome, instead of addressing each

type of organism individually.

Previously, it was thought that it was necessary to study the morphology, physiology

and pathogenic characters in order to classify a microorganism[6], but since Woese in

Introduction

2

1977 pioneered the use of 16S sequences for classification[7], sequence comparison

has been widely used and accepted as valid to do so.

With the development of the sequencing technology, one can now take a sample

directly from the environment, extract its DNA, sequence it, and infer the microbial

composition of the sample, therefore overcoming the bottleneck of growing pure

cultures in the laboratory. This method enables the discovery of new forms of life that

are not cultivable, and to assess the genetic richness and diversity, as well as the

metabolic potential, of a community of organisms as a whole[8].

Metagenomic analysis can accordingly be defined as “the identification, and functional

and evolutionary analysis of the genomic sequences of a community of organisms”.[9]

Moreover, the paradigm that most of the microbial world was known changed, to the

acknowledgement that there is still a lot to know and to explore[10]. Discovering new

forms of life in extreme environments can provide insights into a variety of topics, like

the biogeochemical activities that occur in the ocean[11], and the impact that human

activity may have on them[12] .

1.2. Metagenomic Analysis

To analyse a metagenome, several steps are typically involved, from the experimental

design to sharing the data[13]. Firstly, one has to obtain the samples. Ideally, true

replicates should be taken as well. Afterwards, one may filter the samples, to target a

(more-or-less) specific group of organisms [14].

The following step is sequencing. There are several technologies to sequence DNA,

each with its own advantages and weaknesses. The Mariana Trench sediment

samples were sequenced using Illumina’s paired-end assay. Its advantage is that it is

cheap and generates a large number of reads per run, however they are very short (50

– 250 bp), which can pose a problem for assembly and comparison since it becomes

more difficult to assign a read unequivocally to a template[15].

Illumina’s technology consists in attaching random DNA fragments to a surface, amplify

them to form clusters of the same sequence, and then use them as templates for

repeated cycles of polymerase-directed single base extension. This is guaranteed by

using 3′-modified nucleotides, labeled with a removable fluorophore. After determining

the identity of the nucleotide incorporated by laser-induced excitation of the

fluorophores, these as well as the side arm (that prevents the incorporation of more

than one nucleotide per cycle) are removed. The images of the fluorescent signal are

Introduction

3

used to determine the sequence (each nucleotide is attached to a fluorophore of a

different colour), and its quality, defined as the likelihood of each call being correct[16].

The paired-end option means that a fragment is sequenced in both directions (5’ → 3’,

and 3’ → 5’), therefore being helpful for the assembly[17].

Assembly is the next step in the Metagenomic analysis pipeline, although it is

sometimes skipped. Its usefulness is debatable[18], given that the accuracy of the

assemblers is difficult to assess, since there is currently no microbial community with

known reference sequences to compare to[13].

The main problem with assembly is that it distorts abundance information, since

abundant fragments will be considered as belonging to the most abundant species,

when in reality they may be present in rare species[18]. Moreover, some fragments

may be incorrectly discarded as mistakes or repeats, or joined up in the wrong places

or orientations[19]. Nonetheless, if these setbacks are taken into account when doing

the analysis, then assembly can be advantageous as it produces longer sequences

that are easily unambiguously annotated.

Gene prediction and annotation usually follow. The first classifies the sequences as

coding or non-coding, and the second tries to find homology between the coding

sequences and known sequences stored in databases. Once again, these methods

have their own flaws, mainly because they are based on models, hence failing to

predict exceptions that can occur in the biological world.

Typically, the final step is to share the sequence data on public databases together

with the metadata. Contextual data is necessary to compare with other datasets,

essentially making the sequences useful for the database and the scientific community.

By complying with standard languages for metadata, such as MIMS, the data becomes

more accessible, as complex searches will retrieve more information[20].

The whole set of drawbacks that are surrounding metagenomic analysis, are not at all

surprising, if one considers that it is still a very young field. A quick search on Web of

Knowledge[21] for the total number of articles featuring the term “metagenome” or

“metagenomics”, gives a very clear perception on how novel this field is, and how much

data has been produced (Figure 2).

With the popularity of the field expanding, a multitude of tools were developed making

the choice of which one to use, a not so trivial one. There is still no evident consensus

on which is the best tool for each step (not even for sequencing), so the errors in the

Introduction

4

data are most likely directly related to the flaws in each method, which means that a

different set of methods will yield a different set of errors.

Figure 2 - Total number of metagenomics articles published since 1998

Given this explosion of data, an obvious question is on its applicability. One example

would be bioremediation[22]. The process of biodegradation encompasses several

metabolic pathways that being considered in a community-basis, instead of an

individual-basis, lead to a global understanding of what is essential and what is

superfluous, easing the design of such a system. Moreover, the industry sector is

always in search of novel enzymes and processes[23].

Even so, metagenomics tends to be regarded as exploratory research, raising more

questions instead of addressing them. Accordingly, the aim of this project was not only

to answer some simple questions, but also to raise some more, and hopefully to

encourage further studies in this environment.

1.3. Objective

This project dealt solely with the analysis of the raw data output by sequencing. The

goal was to assess the taxonomical distribution of the community along the depth of

the sediment and to explore its metabolic potential, using the most adequate tools.

However, with the publication of the article [1], which included these sediment samples,

the focus turned to assess if the data generated by this analysis would corroborate the

published data, namely to confirm the O2 consumption throughout the sediment depth.

1 3 4 7 19 52 110 225 383 637 1,046 1,689

6,538

9,381

13,106 13,853

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

To

tal n

um

ber

of

art

icle

s

Year

Number of articles with the keyword metagenome* or metagenomic*

Introduction

5

1.4. Structure of the thesis

This report is organized in five chapters. Starting with the Introduction, some

background information is presented regarding both the site of the samples, as well as

the technology and pipeline typically employed in this kind of studies. The second

chapter has the methodology, with an explanation of each method and its output.

Chapter three includes the selected results, and in chapter four a critical discussion of

the previous is given. The last chapter has the conclusion with the final remarks and

some future directions to similar studies.

Methods

7

2. Methods

Seven out of eight sequenced samples from different depths were analysed. Each

sample corresponded to a gradient of 5cm, starting from 0-5cm (sample 1) to 35-40cm

(sample 8). The data was cleaned using the collection of tools Biopieces[24], the reads

assembled using IDBA-UD[25] [26], and both the generated contigs and the clean

reads were submitted to MG-RAST[27].

2.1. Sample collection, preparation and sequencing

The upstream methodology was done by collaborators and consisted on the following:

the DNA was extracted from 5g of sediment collected at different depths from the

Challenger Deep-Mariana Trench at 10,900m, using PowerMax soil DNA isolation kit

(MoBio Laboratories, CA USA). Eight DNA samples, each corresponding to a different

depth, were sent to BGI-Shenzhen (China), for library preparation and sequencing.

Since one of the samples (sample 4: 15-20 cm) did not contain enough DNA for library

preparation (as reported from the Sample Test Report of BGI), 14 fastq files were

received back, 2 for each of the seven samples – one with the forward and another

with the reverse reads.

2.2. Preliminary Analysis

The initial number of reads on each sample ranged from around 41 million to almost 84

million (Table 1).

Table 1 - Percentage of reads removed with the cleaning

Sample Number of raw reads (forward + reverse)

Number of clean reads

Percentage of Cleaning

1 58,814,066 43,569,096 25.921%

2 45,533,260 18,717,708 58.892%

3 47,163,190 34,419,612 27.020%

5 83,968,942 36,784,382 56.193%

6 61,751,498 43,891,904 28.922%

7 46,894,236 33,786,396 27.952%

8 41,030,848 28,508,242 30.520%

Methods

8

2.3. Biopieces

Sub-quality residues from the ends of the reads were removed, as well as the adaptors

used in the sequencing. The reads with a length inferior to 30 bp were also excluded, in

addition to reads with a local mean score under 15, to overcome errors propagated

from cycle to cycle[28]. The cleaning removed from 27% to almost 59% of the reads in

the samples (Table 1). The Biopieces script used is shown in Figure 3.

The tool trim_seq removes residues from the ends of sequences whose quality, in the

scores of the FASTQ file, does not match the minimum quality specified (in this case

25). The flag “-l” makes sure that residues are removed until a stretch of at least 3

residues with good quality is found, to avoid a premature termination due to a good

quality residue at the end. This step is necessary to overcome the effect of phasing and

pre-phasing. These are caused by incomplete removal of the 3' terminators and

fluorophores, sequences missing an incorporation cycle, or by the incorporation of

nucleotides without effective 3' terminators[28]. This means that each cycle’s signal is

affected by the signal of the previous and subsequent cycles, hindering the detection of

the right base.

read_fastq –i - |

trim_seq –m 25 –l 3 |

find_adaptor –l 6 –L 6 –f ACACGACGCTCTTCCGATCT –r AGATCGGAAGAGCACACGTC |

clip_adaptor |

merge_pair_seq |

grab –e ‘SEQ_LEN_LEFT >= 30’ |

grab –e ‘SEQ_LEN_RIGHT >= 30’ |

mean_scores –l |

grab –e ‘SCORES_MEAN_LOCAL >= 15’ |

split_pair_seq |

write_fastq –x

Figure 3 – Cleaning script

Find_adaptor searches the reads for the given adaptors (forward:

ACACGACGCTCTTCCGATCT and reverse: AGATCGGAAGAGCACACGTC), or

partial adaptors with at least 6 residues of length – flags “-l” for the forward and “-L” for

the reverse adaptor. By default, a percentage of the adaptor length is allowed for

mismatches, insertions, and deletions (10%, 5% and 5%, respectively).

Once the adaptors are found, clip_adaptor removes them, based on the keys output by

find_adaptor: ADAPTOR_POS_RIGHT, ADAPTOR_POS_LEFT, and ADAPTOR_LEN_LEFT.

Methods

9

The merge_pair_seq merges paired sequences, as long as they are interleaved.

Sequence names must be in either Illumina1.3/1.5 format trailing a “/1” or “/2” or

Illumina1.8 containing “1:” or “2:”. The sequence names should also match.

Grab is an improved version of Unix’s “grep”. It selects records that match a pattern, a

regular expression, or a numerical evaluation. In this case, we selected for reads with a

length superior to 30bp, by examining the keys SEQ_LEN_LEFT and

SEQ_LEN_RIGHT, output by merge_pair_seq.

Afterwards, mean_scores –l was used to calculate the local mean scores, which means

that instead of calculating the mean as the sum of all the scores over the length of the

string, it uses means from a sliding window, and returns the smallest value.

Finally, split_pair_seq was used to split the sequences merged with merge_pair_seq.

To speed up the process, this script was ran with GNU parallel[29] with the –L 8 option,

which takes two records at a time (each record has 4 lines), to circumvent breaking the

pairs. GNU Parallel allows Biopieces to be executed in parallel using multiple CPUs on

multiple cores and servers[24].

The merge_pair_seq and split_pair_seq tools were created within this project, to

overcome speed and memory problems originated by the use of order_pairs. The latter

interleaves the sequences, as long as their names are in Illumina 1.5 or 1.8 scheme,

and ads a key stating if the read is “paired” or “orphan”. This should be used after the

trimming and grabbing steps, and subsequently, only the paired reads should be

grabbed.

Example of a script using order_pairs (Figure 4):

read_fastq –i - |

trim_seq –m 25 –l 3 |

find_adaptor –l 6 –L 6 –f ACACGACGCTCTTCCGATCT –r AGATCGGAAGAGCACACGTC |

clip_adaptor |

grab –e ‘SEQ_LEN >= 30’ |

mean_scores –l |

grab –e ‘SCORES_MEAN_LOCAL >= 15’ |

order_pairs |

grab –p ‘pair’ –k ORDER |

write_fastq –x

Figure 4 – order_pairs script

Methods

10

2.4. Assembly

The decision to assemble smaller reads into larger contigs was made based on the

postulation that “The longer the sequence information, the better is the ability to obtain

accurate information.” The annotation procedure becomes easier since longer

sequences yield more information to compare with the databases, but it also applies for

classification of DNA fragments, as well as to rise the confidence in accuracy due to

the lower quality of single reads, by having multiple reads covering the same segment

of information, provided that the coverage is high enough[13]. The IDBA-UD algorithm

is based on de Bruijn graphs adapted for metagenomic sequencing technologies with

uneven sequencing depths[26].

De Bruijn graphs have every possible (k-1)-mer assigned to a node and it has a direct

edge to another one if there is some k-mer whose prefix is the former and whose suffix

is the latter. This means that all the edges in the graph represent all possible k-mers.

The idea is to find an Eulerian cycle[30] with the shortest superstring that contains each

k-mer exactly once (Figure 5).

By visiting each edge only once, the time to run the algorithm is roughly proportional to

the number of edges[31], unlike in a Hamiltonian cycle[32], where each node is visited

only once, making it an NP-complete problem[33] (meaning the time to solve it

increases quickly with the size of the input).

Applied to genome assembly, all the k-mers are the ones present in the reads

generated by sequencing[31], so ideally, the Eulerian cycle would generate the

genome. In practice this method cannot be applied directly, since there are some

assumptions that do not hold. Firstly, we cannot be sure that all the k-mers present in

the genome were generated; secondly, k-mers are not error-free; thirdly, each k-mer is

very likely to appear more than once in the genome; and lastly, we should not assume

that the genome is a single circular chromosome.

To deal with the first problem, instead of trying to assemble the reads, the algorithm

breaks them into smaller k-mers which are more likely to be representative of the whole

genome. To handle errors, the assembler chooses the path which is supported by

higher coverage. Regarding repeats, if a k-mer appears more than once in the

genome, it shall be represented by several edges connecting the same two nodes.

Finally, rather than searching for an Eulerian cycle, if the algorithm is modified to

search for an Eulerian path[34], then it is not required to end in the same node where it

began[31].

Methods

11

Figure 5 – Genome assembly strategies: Hamiltonian and Eulerian cycles[31].

The main problem with metagenomic data is that species with different abundances will

be represented by reads with uneven depth, and this cannot be disregarded as, e.g.,

an amplification bias. IDBA_UD solves this problem by adopting variable thresholds on

the multiplicity of the k-mers, making them dependent on the sequencing depth of the

neighboring contigs. The idea is that contigs with much lower sequencing depths that

their neighbors are more likely erroneous[26]. Moreover, IDBA_UD uses paired-end

information, namely the distance between the pairs, to solve issues such as missing k-

mers and repeats.

The assembler IDBA_UD was firstly used with the default minimum contig size setting

(200 bp), which yielded a N50 from 3545 to 9240. N50 is the length of the smallest

contig that contains the fewest largest contigs whose combined length represents no

less than 50% of the assembly. It is one of the common assembly statistics[35].

Therefore, then a higher minimum contig size of 500 bp was chosen, which improved

the N50 values, so these contigs were uploaded to the server MG-RAST[27]. The

complete analysis of both assemblies (using the Biopiece analyze_assembly) is shown

on Table 2 and Table 3, including N50, contig length (maximum, minimum, mean and

total) and the number of contigs.

Methods

12

Table 2 – Analysis of the assembly with minimum contig size 200 bp.

200 bp

Sample 1 2 3 5 6 7 8

N50 3545 4705 4397 6430 8726 7136 9240

Leng

th

Max 614,662 215,848 466,951 305,081 551,041 305,025 452,197

Min 200 200 200 200 200 200 200

Mean 1439 1956 1651 2206 2068 1822 2234

Total 106,683,337 42,414,452 79,943,730 68,959,229 61,784,508 70,790,963 64,146,341

Number contigs 74,124 21,681 48,418 31,250 29,868 38,839 28,704

Table 3 – Analysis of the assembly with minimum contig size 500 bp.

500 bp

Sample 1 2 4 5 6 7 8

N50 14,106 6,199 14,122 8,662 17,856 16,340 16,698

Leng

th

Max 614,662 215,848 548,284 305,081 551,037 337,423 551,034

Min 504 502 503 501 505 518 503

Mean 3,261 3,180 3,384 3,883 4,277 4,235 4,492

Total 76,906,252 36,624,750 60,016,272 59,604,646 51,333,136 55,864,260 54,104,132

Number contigs 23,581 11,514 17,732 15,349 12,000 13,190 12,044

2.5. MG-RAST

MG-RAST[27] uses several bioinformatics tools in its pipeline. Firstly, it filters

sequences based on length, number of ambiguous bases and quality values. All the

contigs from all the 7 samples uploaded, passed this preprocessing stage.

Then, “technical replicates”, identified as sequences with identical first 50 base-pairs,

are removed in a step called Dereplication. Between 0,7% (surface sample) and 2,3%

(sample 7) of the contigs were removed in this step, but no reads were removed. This

can be explained by the use of the same reads for different contigs.

After that, FragGeneScan[36] is used to predict coding regions. This tool is an ab-initio

gene calling algorithm that uses hidden Markov Model for coding and non-coding

regions, and that was developed specially for metagenomes. It includes codon usage

bias, sequencing error models and start/stop codon patterns. A gene is reported if it’s

longer than 60 bp, and begins either with a start or an internal codon of a gene and

ends with a stop or an internal codon. This way, both complete and partial genes are

predicted. From 29,239 (sample 2) to 63,877 (sample 1) coding sequences were

Methods

13

predicted within the contigs, and from 16,387,405 (sample 2) to 40,199,546 (sample 6)

within the reads.

The sequences output from FragGeneScan are then clustered at 90% identity with

qiime-uclust. QIIME[37] is a software package developed specially for high throughput

amplicon sequencing data, although it also supports metagenomic data. It incorporates

many third party tools, such as UCLUST[38]. This algorithm clusters sequences based

on their similarity, according to a threshold set by the user (or in this case by MG-

RAST). Each cluster is therefore represented by a sequence, and all the sequences in

it should have a similarity higher than the threshold to the sequence representing the

cluster (centroid), and centroids should have similarity below the threshold to the other

centroids. The algorithm starts with no centroids, and each sequence is compared to

the list of centroids and it is either assigned to a cluster or selected as a new centroid.

The centroids and the singletons (unclustered sequences) are then searched using

BLAT[39] against the M5NR protein database. M5NR is a non-redundant protein

database which incorporates data from GO[40], KEGG[36][37], NCBI[38][39],

SEED[40][41], UniProt[47], VBI[48] and eggNOG[49], and has almost 16,000,000

sequences. BLAT builds an index of the database and then scans linearly through the

query sequence, unlike BLAST which builds an index of the query sequence and then

scans linearly through the database, making it faster since it does not have to scan

through a database of gigabases of sequence but only through a relatively short query

sequence. BLAT, however, looses to BLAST in terms of sensitivity, since it needs an

exact or nearly-exact match to find a hit, making it suitable mostly for closely related

species. The alignment identified between 25,261 (sample 2) and 50,816 (sample 1)

protein features in the contigs, and from 4,859,593 (sample 2) to 10,890,942 (sample

6) in the reads, which proved to be correlated at 98% with the number of dereplicated

reads, using Pearson’s coefficient:

Where and are the average of the number of dereplicated reads and the number of

protein features, respectively.

The results of the search against the M5NR database were retrieved for each of the

samples, at 90% identity, to map against the metabolic pathways maps based on

KEGG data, using KEGG Mapper[41] [42] and iPath[50] [51].

Methods

14

Besides from being the input for the Dereplication step, the filtered sequences are pre-

screened to identify ribosomal sequences at 70% identity, and then they are clustered

using UCLUST at 97% identity. The clusters are then searched for similarity against the

M5RNA database (Greengenes[52], SILVA[53] and RDP[54]), using BLAT[39]. This

alignment identified between 36 rRNA features (sample 2) to 72 (sample 1) in the

contigs, whilst in the reads the number ranged from 19,014 (sample 2) to 38,639

(sample 1).

MG-RAST also calculated automatically the alpha diversity of each sample, to

summarize the distribution of species-level annotations in that sample, using the

following equation:

Where p is a ratio of the number of annotations for each species to the total number of

annotations and m is the total number of different species annotations, using all the

annotation source databases incorporated by MG-RAST[27].

Based on the abundances of each species in each sample (using the reads), the R

package vegan[55] was used to calculate the beta diversity, as suggested in the

manual[56]. Therefore it was calculated pair wise between samples, using the

Sørensen index of dissimilarity:

Where a is the number of species shared by the two samples, and b and c are the

number of unique species to each sample; as well as the widely known Whittaker's

species turnover:

Where γ is the total number of species in the collection of samples (gamma diversity),

and is the average richness per sample. Subtraction of one guarantees that β=0

means that there are no excess species or no heterogeneity between samples.

Rarefaction curves were also automatically generated. The theory behind it, is to

repeatedly re-sample the pool of reads, at random, plotting the average number of

species represented by 1, 2,…N reads[57].

Methods

15

Krona[58] was used to view the percentage of reads with predicted proteins and

ribosomal RNA genes annotated based on all the databases.

Results

17

3. Results

The reads and contigs submitted to MG-RAST were automatically attributed with

unique ID’s, as indicated on Table 4.

Table 4 - ID's of the contigs submitted to MG-RAST

Sample Reads Contigs

1 4525786.3 4518922.3

2 4525785.3 4518923.3

3 4525784.3 4518924.3

5 4525781.3 4518925.3

6 4525782.3 4518926.3

7 4525783.3 4518927.3

8 4525787.3 4518928.3

To compare the abundances among the samples, the results were extracted from the

reads, whereas to assess presence or absence of a defined feature, the contigs’

results were retrieved.

3.1. Taxonomic Hits Distribution

Extracting the best hit classification from the reads compared to M5NR using a

maximum e-value of 1e-5, a minimum identity of 90%, and a minimum alignment length

of 15 aa, it is clear that Bacteria, and more specifically Proteobacteria, largely dominate

in all the 7 samples (Figure 6 and Figure 20).

In terms of class, Betaproteobacteria seems to comprise 78% of Proteobacteria in

Sample 1, unlike the other samples, where Gammaproteobacteria seems to be the

dominant class (Figure 7). Sample 3 shows a larger representation of

Alphaproteobacteria compared to the other samples.

Most of Gammaproteobacteria in sample 1 is Pseudoalteromonas, in sample 2 is

Pseudomonas, whereas from sample 3 to sample 8 other genera, namely

Marinobacter, become just as dominant (See Figure 21 to Figure 27).

Results

18

Figure 6 – Taxonomic distribution of the reads at the domain level

Figure 7 - Taxonomic distribution of the reads at the class level (Proteobacteria)

In terms of α-diversity, calculated using the reads against all the annotation databases

used by MG-RAST, sample 1 shows the highest: 430.83 species. The other samples

have diversities between 184.10 species (sample 6) and 252.14 species (sample 7).

The values of α-diversity for all the samples are shown on Table 5.

Results

19

Table 5 - α-diversity

α-diversity

Sample 1 430.83

Sample 2 213.47

Sample 3 232.97

Sample 5 210.42

Sample 6 184.10

Sample 7 252.14

Sample 8 240.39

The β-diversity value, using the Whittaker's species turnover was 1.181461, and the

pairwise comparisons are shown on Table 6 and Figure 8.

Table 6 – Pairwise β-diversity

Sample 1 Sample 2 Sample 3 Sample 5 Sample 6 Sample 7

2 0.422489

3 0.353043 0.319049

5 0.382264 0.283298 0.292187

6 0.364884 0.30632 0.292165 0.288654

7 0.360278 0.333708 0.314927 0.307126 0.287154

8 0.365677 0.324216 0.309876 0.306393 0.292684 0.294254

Figure 8 - β-diversity barchart

Results

20

A correlation analysis of the distance between samples and their β-diversity, shows no

relation between them (Figure 28).

The rarefaction curves of annotated species richness for all the samples show a quick

rise at first, and then they become flatter but without leveling off towards an asymptote

(Figure 9). This means that if there had been more reads, probably more species would

be found. Even so, these results allow a reasonable guess of the community structure.

Figure 9 – Rarefaction curve of annotated species richness

The Principle Component Analysis for the reads of the 7 samples, with annotation

against the M5RNA database, using the Bray-Curtis measure (chosen for showing a

robust relationship with ecological distance[59]), an e-value of 1e-5 and a minimum

identity of 97%, does not show a clear trend, neither when using the M5NR database,

with a minimum identity of 90% (Figure 10). See Figure 29 and Figure 30 for the

heatmaps with the same thresholds and normalized values to the size of the samples.

Results

21

Figure 10 - PCoA using the M5RNA database (left) and the M5NR database (right)

Nevertheless, when comparing with metagenomes from 1) the gut microbiota of 91

pregnant women of varying prepregnancy BMIs and gestational diabetes status and

their infants (http://metagenomics.anl.gov/linkin.cgi?project=265), and 2) metagenomes

from activated sludge from 2 full-scale tannery wastewater treatment plants

(http://metagenomics.anl.gov/linkin.cgi?project=922), it is clearly seen, that the Mariana

Trench samples group together in a very distinct group. As these two environments are

expected to be very and quite different, respectively, from the deep sea Mariana

Trench samples, this is a good indicator on the reliability of the latter. See for example

Figure 11, for a comparison against the M5NR database, at 90% minimum identity, and

an e-value of 1e-5.

Results

22

Figure 11 - PCoA of the reads against the M5NR database. Red - Mariana Trench; Blue - Activated Sludge; Green – Gut Microbiota

Results

23

3.2. Functional Category Hits Distribution

Looking at the number of features that were annotated based on the reads compared

to the contigs, it is noticeable that the latter provide a much more reliable source for

annotation, as seen from the range of e-values, which was expected. See, for example,

sample 7 in Figure 12 and Figure 13. One might notice that there were more features

predicted from the reads, but at the same time there were more reads than contigs.

Figure 12 - Number of features in the reads of sample 7 annotated by the different databases

Figure 13 – Number of features in the contigs of sample 7 annotated by the different databases

Moreover, taking again sample 7 as an example, only 50.7% of the predicted protein

features in the reads could be annotated with similarity to a protein of known function,

whereas 84.9% of the predicted protein features of the contigs were annotated.

Results

24

From all the databases that were used to compare the protein sequences generated

from the contigs, SEED Subsystems[45] had the higher number of annotations. (Figure

12 and Figure 13) It is worth noting, however, that each database has a different type

of annotation data, hence the different number of hits. Since the tools to analyse the

pathways (KEGG Mapper and iPath) use the KEGG database, the focus was put on

the functional hierarchy given by KEGG Orthology (KO)[41][42].

Comparing the reads to KO, using a maximum e-value of 1e-5, a minimum identity of

90%, and a minimum alignment length of 15, on average 53% (±0.03) of the reads with

predicted protein functions were annotated as belonging to the Metabolism category.

From those, 14% (±0.05) of the reads belong to Energy metabolism.

Roughly 100% of the reads from Energy metabolism, in the reads from sample 1,

correspond to oxidative phosphorylation, and on the rest of the samples, this value lays

around 77% (±0.07).

In fact, the F-type H+-transporting ATPase subunit beta (K02112), involved in both

oxidative phosphorylation (Figure 14) and photosynthesis (Figure 15), is the second

most abundant hit in sample 1 (out of 54 hits), with an average identity of 91.06% and

an average e-value of -6.14.

Figure 14 - Oxidative Phosphorylation, pathway ko00190.

Results

25

In sample 2, K02112 appears in 11th place (out of 239 hits) with an abundance of 9187

together with F-type H+-transporting ATPase subunit alpha (K02111) in 10th place with

an abundance of 9307.

In sample 3, K02112 has an abundance of 9513 and K02111 of 9758, appearing in 8th

and 6th, respectively, when sorting for abundance. For sample 5 the values are 13405

for K02112 and 12764 for K02111 (10th and 12th). Sample 6 has even higher

abundances for K02112 and K02111: 16632 and 16260 (8th and 9th most abundant). In

samples 7 and 8 they appear in 5th and 6th place, out of 108 and 115 hits, with

abundances of 11492 and 11257, and 10691 and 10294. In all samples from the

second to the seventh, these subunits have an average identity above 91.5%.

Figure 15 – Photosynthesis, pathway ko00195.

Using the contigs, with the same settings, only K02112 was found, and only in samples

2 and 8. However, the average alignment length of the hits was 356.55 and 332.22,

respectively, whereas for the reads it was 27.67 and 27.57. Nevertheless, other hits

also classified as belonging to Oxidative Phosphorylation were found, like NADH-

quinone oxidoreductase subunit (K13380 and K13378), NADH-quinone oxidoreductase

subunits (K00338 and K00340), F-type H+-transporting ATPase subunit c (K02110), V-

type H+-transporting ATPase subunits (K02118 and K02122), cytochrome c oxidase

Results

26

assembly protein subunit 17 (K02260), nucleosome-remodeling factor 38 kDa subunit

(K11726), cytochrome o ubiquinol oxidase subunit III (K02299), cytochrome o ubiquinol

oxidase operon protein cyoD (K02300) and NAD(P)H-quinone oxidoreductase subunit

5 (K05577).

To address, with some degree of confidence, whether alternative energy metabolism

processes occur in any of the samples, the contigs results were further explored.

Indeed, all samples contained contigs involved in Methane Metabolism (Figure 16).

Figure 16 - Methane metabolism, pathway ko00680. In red the enzymes found in the samples.

In addition, contigs from samples 2, 5 and 8, matched hits from nitrogen metabolism

(Figure 17). In all the three samples, nitric oxide reductase subunit B (K04561)

(EC:1.7.2.5) was present, which is involved in denitrification (nitrate → nitrogen).

Results

27

Sample 2 also had a nitrogenase iron protein NifH (K02588) (EC:1.18.6.1), a

nitrogenase molybdenum-cofactor synthesis protein NifE (K02587) and a nitrogen

fixation protein NifX (K02596).

Figure 17 - Nitrogen metabolism, pathway ko00910. In red the enzymes found in samples 2, 5 and 8.

Finally, the map generated with iPATH (Figure 18) gives a general overview of the

pathways present, when combining all samples. It is worth noting that photosynthesis

appears mapped; however, this is most likely a misleading mapping, since the enzyme

Results

28

identified is an F-type H+-transporting ATPase, which is involved in photosynthesis but

also in oxidative phosphorylation, as mentioned earlier.

Figure 18 - Metabolic map of the seven samples

Discussion

29

4. Discussion

Marine sediments, and in particular hadal trenches, receive substantial deposition of

microbes and organic matter from the upper water layer[1], and provide a matrix of

complex nutrients and solid surfaces for microbial growth[60]. However, the low

temperature and the extreme hydrostatic pressure demand a certain degree of

adaptation from the organisms inhabiting such an environment. Even so, there seems

to be a fairly high diversity along the sediment depth, as seen in Table 5 and Figure 9.

Proteobacteria is the largest and most metabolically diverse group of Bacteria. They

are all gram-negative, and they divide into 5 classes: alpha, beta, gamma, delta and

epsilon[61]. The dominance of Gammaproteobacteria is in accordance with a study

from the Pacific Artic Ocean, where the temperatures are also very low[62], and

somewhat with the study of sediments at 4000m depth in Pacific Ocean, where not

only Gammaproteobacteria but also Alphaproteobacteria dominate the community[63].

Intriguingly, the outer-layer of an actively venting black-smoker chimney from a

hydrothermal vent field on the Juan de Fuca Ridge[64], is also dominated by

Gammaproteobacteria, even though its temperature lies above 310ºC.

The PCoA graphs show samples that exhibit similar abundance profiles, in terms of

taxonomy or function, grouped together. However, when comparing the seven

samples, there is no obvious trend in the community towards the depth of the sediment

(Figure 10). Nevertheless, the fact that this project’s samples group together and very

distinctly from other project’s samples, is a good indicator that this environment has its

own community structure.

The poor correlation between β-diversity and distance between samples also supports

the PCoA results (Table 6 and Figure 28). This means that the difference in microbial

community composition (as defined in [65]) is most likely due to factors other than

depth. It is possible that, under such high pressure, some centimeters of sediment do

not really make a difference in the community structure. Alternatively, there might have

been some mixing of the communities during the sampling process.

It should be noted however, that the fact that the community as a whole does not show

a shift alongside the depth of the sediment, does not exclude the hypothesis that some

taxa correlate with it.

Regarding the decision to assemble, the range of e-values of the number of features

annotated with the different databases, as well as the percentage of predicted protein

Discussion

30

features that were annotated, should provide some degree of confidence in the

assembly.

The high number of hits of the oxidative phosphorylation pathway supported the

predictions from [1], that there is intensified O2 consumption within the sediment, unlike

in the sediment of the reference site (≈6000m of water depth), where the microbial

activity has reduced rates. This was supported by measurements of the O2

concentration throughout the depth of the sediment. Attenuation in the O2

concentration reflects higher rates of its consumption[1] (Figure 19), which is consistent

with the presence of genes involved in aerobic respiration in all the samples.

Figure 19 – Oxygen micro-profiles at 6,018 m water depth (a); and at Challenger Deep (b) [1].

Even though oxidative phosphorylation dominates the energy metabolism processes,

methane and nitrogen metabolism still play a part in the community’s energetic

potential.

Normally, methanogenesis is associated with anoxic environments; still, it is known that

even in oxic environments, anoxic microenvironments can form, where

methanogenesis takes place[61].

Discussion

31

Once more, the predictions that there is intensified mineralization mediated by the

prokaryotic community at Challenger Deep[1] are supported by the contigs with

homology to features involved in nitrogen metabolism.

Finally, the misleading mapping of the ATPase (Figure 18), should be taken as an

example that care and criticism are fundamental when using automated tools.

Conclusion

33

5. Conclusion

This study was a first description of both the community structure and its functional

potential, in the Mariana Trench, a unique environment for its extreme conditions. The

amount of data generated made it prohibitive to describe it in total. The energy

metabolism was selected for this thesis, since it was interesting to compare with the

results from [1]. The finding that there are enzymes involved in the oxidative

phosphorylation pathway in all 7 samples, supported the published measurements of

oxygen consumption throughout the sediment.

It was expected to observe a taxonomic and/or functional gradient along the depth of

the sediment but that does not seem to happen. A further investigation on this matter

would be helpful to prove if there are any signature taxa of the depth.

The data used in the study will soon be publicly available on MG-RAST, therefore

accessible for additional investigation. However, in the future, it would be sensible to

sample with true replicates, and take a broader number of environmental

measurements, to allow the data to be more comparable to other studies. It would also

be interesting to take samples from sediments from other depths along the Challenger

Deep, to assess if the community uniqueness is due to the extreme depth or to the

overall conditions on that site.

To conclude, it is probable that in 10 years time, with the development of new tools or

with the improvement of the existing ones, all of these results will be proved inaccurate.

However, the aim of this thesis was neither to develop new tools, nor to compare the

existing ones, but to use them wisely and understand their purpose for this analysis.

Hence, the argument of this project is that with this set of tools, this is the product.

References

35

References

[1] R. N. Glud, F. Wenzhöfer, M. Middelboe, K. Oguri, R. Turnewitsch, D. E. Canfield, and H. Kitazato, “High rates of microbial carbon turnover in sediments in the deepest oceanic trench on earth,” Nature Geoscience, vol. 6, no. 4, pp. 284–288, Mar. 2013.

[2] R. Pawlowicz, “Key physical variables in the ocean: temperature, salinity, and density,” Nature Education Knowledge, vol. 4, no. 4, p. 13, 2013.

[3] “The international system of units.” Bureau International des Poids et Mesures, 2006.

[4] R. A. Lutz and P. G. Falkowski, “Ocean science. A dive to Challenger Deep.,” Science (New York, N.Y.), vol. 336, no. 6079, pp. 301–2, Apr. 2012.

[5] J. Handelsman, M. R. Rondon, S. F. Brady, J. Clardy, and R. M. Goodman, “Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products,” Chemistry & Biology, vol. 5, no. 10, pp. R245–R249, Oct. 1998.

[6] Society of American Bacteriologists., Bergey’s manual of determinative bacteriology, 1st ed. Baltimore, Williams & Wilkins Co., 1923.

[7] C. R. Woese and G. E. Fox, “Phylogenetic structure of the prokaryotic domain: The primary kingdoms,” Proceedings of the National Academy of Sciences, vol. 74, no. 11, pp. 5088–5090, Nov. 1977.

[8] P. Hugenholtz and G. W. Tyson, “Microbiology: metagenomics.,” Nature, vol. 455, no. 7212, pp. 481–3, Sep. 2008.

[9] E. M. Glass and F. Meyer, “Analysis of metagenomics data,” in in Bioinformatics for High Throughput Sequencing, N. Rodríguez-Ezpeleta, M. Hackenberg, and A. M. Aransay, Eds. New York, NY: Springer New York, 2012, pp. 219–229.

[10] J. Handelsman, “Metagenomics: application of genomics to uncultured microorganisms.,” Microbiology and molecular biology reviews : MMBR, vol. 68, no. 4, pp. 669–85, Dec. 2004.

[11] X. Hao and T. Chen, “OTU analysis using metagenomic shotgun sequencing data,” PLoS ONE, vol. 7, no. 11, p. e49785, Nov. 2012.

[12] V. Iverson, R. M. Morris, C. D. Frazar, C. T. Berthiaume, R. L. Morales, and E. V. Armbrust, “Untangling genomes from metagenomes: revealing an uncultured class of marine Euryarchaeota.,” Science (New York, N.Y.), vol. 335, no. 6068, pp. 587–90, Feb. 2012.

[13] T. Thomas, J. Gilbert, and F. Meyer, “Metagenomics - a guide from sampling to data analysis.,” Microbial informatics and experimentation, vol. 2, no. 1, p. 3, Jan. 2012.

[14] J. C. Wooley, A. Godzik, and I. Friedberg, “A primer on metagenomics.,” PLoS computational biology, vol. 6, no. 2, p. e1000667, Feb. 2010.

[15] N. Whiteford, N. Haslam, G. Weber, A. Prügel-Bennett, J. W. Essex, P. L. Roach, M. Bradley, and C. Neylon, “An analysis of the feasibility of short read sequencing.,” Nucleic acids research, vol. 33, no. 19, p. e171, Jan. 2005.

References

36

[16] D. R. Bentley, S. Balasubramanian, H. P. Swerdlow, G. P. Smith, J. Milton, C. G. Brown, K. P. Hall, D. J. Evers, C. L. Barnes, H. R. Bignell, J. M. Boutell, J. Bryant, R. J. Carter, R. Keira Cheetham, A. J. Cox, D. J. Ellis, M. R. Flatbush, N. A. Gormley, S. J. Humphray, L. J. Irving, M. S. Karbelashvili, S. M. Kirk, H. Li, X. Liu, K. S. Maisinger, L. J. Murray, B. Obradovic, T. Ost, M. L. Parkinson, M. R. Pratt, I. M. J. Rasolonjatovo, M. T. Reed, R. Rigatti, C. Rodighiero, M. T. Ross, A. Sabot, S. V Sankar, A. Scally, G. P. Schroth, M. E. Smith, V. P. Smith, A. Spiridou, P. E. Torrance, S. S. Tzonev, E. H. Vermaas, K. Walter, X. Wu, L. Zhang, M. D. Alam, C. Anastasi, I. C. Aniebo, D. M. D. Bailey, I. R. Bancarz, S. Banerjee, S. G. Barbour, P. A. Baybayan, V. A. Benoit, K. F. Benson, C. Bevis, P. J. Black, A. Boodhun, J. S. Brennan, J. A. Bridgham, R. C. Brown, A. A. Brown, D. H. Buermann, A. A. Bundu, J. C. Burrows, N. P. Carter, N. Castillo, M. Chiara E Catenazzi, S. Chang, R. Neil Cooley, N. R. Crake, O. O. Dada, K. D. Diakoumakos, B. Dominguez-Fernandez, D. J. Earnshaw, U. C. Egbujor, D. W. Elmore, S. S. Etchin, M. R. Ewan, M. Fedurco, L. J. Fraser, K. V Fuentes Fajardo, W. Scott Furey, D. George, K. J. Gietzen, C. P. Goddard, G. S. Golda, P. A. Granieri, D. E. Green, D. L. Gustafson, N. F. Hansen, K. Harnish, C. D. Haudenschild, N. I. Heyer, M. M. Hims, J. T. Ho, A. M. Horgan, K. Hoschler, S. Hurwitz, D. V Ivanov, M. Q. Johnson, T. James, T. A. Huw Jones, G.-D. Kang, T. H. Kerelska, A. D. Kersey, I. Khrebtukova, A. P. Kindwall, Z. Kingsbury, P. I. Kokko-Gonzales, A. Kumar, M. A. Laurent, C. T. Lawley, S. E. Lee, X. Lee, A. K. Liao, J. A. Loch, M. Lok, S. Luo, R. M. Mammen, J. W. Martin, P. G. McCauley, P. McNitt, P. Mehta, K. W. Moon, J. W. Mullens, T. Newington, Z. Ning, B. Ling Ng, S. M. Novo, M. J. O’Neill, M. A. Osborne, A. Osnowski, O. Ostadan, L. L. Paraschos, L. Pickering, A. C. Pike, A. C. Pike, D. Chris Pinkard, D. P. Pliskin, J. Podhasky, V. J. Quijano, C. Raczy, V. H. Rae, S. R. Rawlings, A. Chiva Rodriguez, P. M. Roe, J. Rogers, M. C. Rogert Bacigalupo, N. Romanov, A. Romieu, R. K. Roth, N. J. Rourke, S. T. Ruediger, E. Rusman, R. M. Sanches-Kuiper, M. R. Schenker, J. M. Seoane, R. J. Shaw, M. K. Shiver, S. W. Short, N. L. Sizto, J. P. Sluis, M. A. Smith, J. Ernest Sohna Sohna, E. J. Spence, K. Stevens, N. Sutton, L. Szajkowski, C. L. Tregidgo, G. Turcatti, S. Vandevondele, Y. Verhovsky, S. M. Virk, S. Wakelin, G. C. Walcott, J. Wang, G. J. Worsley, J. Yan, L. Yau, M. Zuerlein, J. Rogers, J. C. Mullikin, M. E. Hurles, N. J. McCooke, J. S. West, F. L. Oaks, P. L. Lundberg, D. Klenerman, R. Durbin, and A. J. Smith, “Accurate whole human genome sequencing using reversible terminator chemistry.,” Nature, vol. 456, no. 7218, pp. 53–9, Nov. 2008.

[17] W. Zhang, J. Chen, Y. Yang, Y. Tang, J. Shang, and B. Shen, “A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies.,” PloS one, vol. 6, no. 3, p. e17915, Jan. 2011.

[18] H. Teeling and F. O. Glöckner, “Current opportunities and challenges in microbial metagenome analysis--a bioinformatic perspective.,” Briefings in bioinformatics, Sep. 2012.

[19] M. Baker, “De novo genome assembly: what every biologist should know,” Nature Methods, vol. 9, no. 4, pp. 333–337, Mar. 2012.

[20] P. Yilmaz, R. Kottmann, D. Field, R. Knight, J. R. Cole, L. Amaral-Zettler, J. A. Gilbert, I. Karsch-Mizrachi, A. Johnston, G. Cochrane, R. Vaughan, C. Hunter, J. Park, N. Morrison, P. Rocca-Serra, P. Sterk, M. Arumugam, M. Bailey, L. Baumgartner, B. W. Birren, M. J. Blaser, V. Bonazzi, T. Booth, P. Bork, F. D. Bushman, P. L. Buttigieg, P. S. G. Chain, E. Charlson, E. K. Costello, H. Huot-Creasy, P. Dawyndt, T. DeSantis, N. Fierer, J. A. Fuhrman, R. E. Gallery, D. Gevers, R. A. Gibbs, I. San Gil, A. Gonzalez, J. I. Gordon, R. Guralnick, W. Hankeln, S. Highlander, P. Hugenholtz, J. Jansson, A. L. Kau, S. T. Kelley, J. Kennedy, D. Knights, O. Koren, J. Kuczynski, N. Kyrpides, R. Larsen, C. L. Lauber, T. Legg, R. E. Ley, C. A. Lozupone, W. Ludwig, D. Lyons, E. Maguire, B. A. Methé, F. Meyer, B. Muegge, S. Nakielny, K. E. Nelson, D. Nemergut, J. D. Neufeld, L. K. Newbold, A. E. Oliver, N. R. Pace, G. Palanisamy, J. Peplies, J. Petrosino, L. Proctor, E. Pruesse, C. Quast, J. Raes, S. Ratnasingham, J. Ravel, D. A. Relman, S. Assunta-Sansone, P. D. Schloss, L. Schriml, R. Sinha, M. I. Smith, E. Sodergren, A. Spo, J. Stombaugh, J. M. Tiedje, D. V Ward, G. M. Weinstock, D. Wendel, O. White, A. Whiteley, A. Wilke, J. R. Wortman, T. Yatsunenko, and F. O. Glöckner, “Minimum

References

37

information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications.,” Nature biotechnology, vol. 29, no. 5, pp. 415–20, May 2011.

[21] “Web of Knowledge.” [Online]. Available: www.webofknowledge.com.

[22] J. L. Fox, “Natural-born eaters.,” Nature biotechnology, vol. 29, no. 2, pp. 103–6, Feb. 2011.

[23] P. Lorenz and J. Eck, “Metagenomics and industrial applications.,” Nature reviews. Microbiology, vol. 3, no. 6, pp. 510–6, Jun. 2005.

[24] “www.biopieces.org.” .

[25] Y. Peng, H. Leung, S. Yiu, and F. Chin, “IDBA – a practical iterative de Bruijn graph de novo assembler,” in 14th RECOMB 2010, 2010, pp. 426–440.

[26] Y. Peng, H. C. M. Leung, S. M. Yiu, and F. Y. L. Chin, “IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth.,” Bioinformatics (Oxford, England), vol. 28, no. 11, pp. 1420–8, Jun. 2012.

[27] F. Meyer, D. Paarmann, M. D’Souza, R. Olson, E. M. Glass, M. Kubal, T. Paczian, A. Rodriguez, R. Stevens, A. Wilke, J. Wilkening, and R. A. Edwards, “The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes.,” BMC bioinformatics, vol. 9, no. 1, p. 386, Jan. 2008.

[28] M. Kircher, U. Stenzel, and J. Kelso, “Improved base calling for the Illumina Genome Analyzer using machine learning strategies.,” Genome biology, vol. 10, no. 8, p. R83, Jan. 2009.

[29] O. Tange, “GNU Parallel: the command-line power tool | USENIX,” ;login: The USENIX Magazine, pp. 42–47, 2011.

[30] E. W. Weisstein, “Eulerian Cycle -- from Wolfram MathWorld.” Wolfram Research, Inc.

[31] P. E. C. Compeau, P. A. Pevzner, and G. Tesler, “How to apply de Bruijn graphs to genome assembly.,” Nature biotechnology, vol. 29, no. 11, pp. 987–91, Nov. 2011.

[32] E. W. Weisstein, “Hamiltonian Cycle -- from Wolfram MathWorld.” Wolfram Research, Inc.

[33] E. W. Weisstein, “NP-Complete Problem -- from Wolfram MathWorld.” Wolfram Research, Inc.

[34] E. W. Weisstein, “Eulerian Path -- from Wolfram MathWorld.” Wolfram Research, Inc.

[35] J. R. Miller, S. Koren, and G. Sutton, “Assembly algorithms for next-generation sequencing data.,” Genomics, vol. 95, no. 6, pp. 315–27, Jun. 2010.

[36] M. Rho, H. Tang, and Y. Ye, “FragGeneScan: predicting genes in short and error-prone reads.,” Nucleic acids research, vol. 38, no. 20, p. e191, Nov. 2010.

[37] J. G. Caporaso, J. Kuczynski, J. Stombaugh, K. Bittinger, F. D. Bushman, E. K. Costello, N. Fierer, A. G. Peña, J. K. Goodrich, J. I. Gordon, G. A. Huttley, S. T. Kelley, D. Knights, J. E. Koenig, R. E. Ley, C. A. Lozupone, D. McDonald, B. D. Muegge, M. Pirrung, J. Reeder, J. R. Sevinsky, P. J. Turnbaugh, W. A. Walters, J. Widmann, T.

References

38

Yatsunenko, J. Zaneveld, and R. Knight, “QIIME allows analysis of high-throughput community sequencing data.,” Nature methods, vol. 7, no. 5, pp. 335–6, May 2010.

[38] R. C. Edgar, “Search and clustering orders of magnitude faster than BLAST.,” Bioinformatics (Oxford, England), vol. 26, no. 19, pp. 2460–1, Oct. 2010.

[39] W. J. Kent, “BLAT--the BLAST-like alignment tool.,” Genome research, vol. 12, no. 4, pp. 656–64, Apr. 2002.

[40] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock, “Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.,” Nature genetics, vol. 25, no. 1, pp. 25–9, May 2000.

[41] M. Kanehisa and S. Goto, “KEGG: Kyoto encyclopedia of genes and genomes.,” Nucleic acids research, vol. 28, no. 1, pp. 27–30, Jan. 2000.

[42] M. Kanehisa, S. Goto, Y. Sato, M. Furumichi, and M. Tanabe, “KEGG for integration and interpretation of large-scale molecular data sets.,” Nucleic acids research, vol. 40, no. Database issue, pp. D109–14, Jan. 2012.

[43] E. W. Sayers, T. Barrett, D. A. Benson, S. H. Bryant, K. Canese, V. Chetvernin, D. M. Church, M. DiCuccio, R. Edgar, S. Federhen, M. Feolo, L. Y. Geer, W. Helmberg, Y. Kapustin, D. Landsman, D. J. Lipman, T. L. Madden, D. R. Maglott, V. Miller, I. Mizrachi, J. Ostell, K. D. Pruitt, G. D. Schuler, E. Sequeira, S. T. Sherry, M. Shumway, K. Sirotkin, A. Souvorov, G. Starchenko, T. A. Tatusova, L. Wagner, E. Yaschenko, and J. Ye, “Database resources of the National Center for Biotechnology Information.,” Nucleic acids research, vol. 37, no. Database issue, pp. D5–15, Jan. 2009.

[44] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and E. W. Sayers, “GenBank.,” Nucleic acids research, vol. 37, no. Database issue, pp. D26–31, Jan. 2009.

[45] R. Overbeek, T. Begley, R. M. Butler, J. V Choudhuri, H.-Y. Chuang, M. Cohoon, V. de Crécy-Lagard, N. Diaz, T. Disz, R. Edwards, M. Fonstein, E. D. Frank, S. Gerdes, E. M. Glass, A. Goesmann, A. Hanson, D. Iwata-Reuyl, R. Jensen, N. Jamshidi, L. Krause, M. Kubal, N. Larsen, B. Linke, A. C. McHardy, F. Meyer, H. Neuweger, G. Olsen, R. Olson, A. Osterman, V. Portnoy, G. D. Pusch, D. A. Rodionov, C. Rückert, J. Steiner, R. Stevens, I. Thiele, O. Vassieva, Y. Ye, O. Zagnitko, and V. Vonstein, “The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes.,” Nucleic acids research, vol. 33, no. 17, pp. 5691–702, Jan. 2005.

[46] R. K. Aziz, D. Bartels, A. A. Best, M. DeJongh, T. Disz, R. A. Edwards, K. Formsma, S. Gerdes, E. M. Glass, M. Kubal, F. Meyer, G. J. Olsen, R. Olson, A. L. Osterman, R. A. Overbeek, L. K. McNeil, D. Paarmann, T. Paczian, B. Parrello, G. D. Pusch, C. Reich, R. Stevens, O. Vassieva, V. Vonstein, A. Wilke, and O. Zagnitko, “The RAST Server: rapid annotations using subsystems technology.,” BMC genomics, vol. 9, p. 75, Jan. 2008.

[47] The UniProt Consortium, “Reorganizing the protein space at the Universal Protein Resource (UniProt).,” Nucleic acids research, vol. 40, no. Database issue, pp. D71–5, Jan. 2012.

[48] J. J. Gillespie, A. R. Wattam, S. A. Cammer, J. L. Gabbard, M. P. Shukla, O. Dalay, T. Driscoll, D. Hix, S. P. Mane, C. Mao, E. K. Nordberg, M. Scott, J. R. Schulman, E. E. Snyder, D. E. Sullivan, C. Wang, A. Warren, K. P. Williams, T. Xue, H. S. Yoo, C. Zhang, Y. Zhang, R. Will, R. W. Kenyon, and B. W. Sobral, “PATRIC: the comprehensive bacterial bioinformatics resource with a focus on human pathogenic species.,” Infection and immunity, vol. 79, no. 11, pp. 4286–98, Nov. 2011.

References

39

[49] S. Powell, D. Szklarczyk, K. Trachana, A. Roth, M. Kuhn, J. Muller, R. Arnold, T. Rattei, I. Letunic, T. Doerks, L. J. Jensen, C. von Mering, and P. Bork, “eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges.,” Nucleic acids research, vol. 40, no. Database issue, pp. D284–9, Jan. 2012.

[50] I. Letunic, T. Yamada, M. Kanehisa, and P. Bork, “iPath: interactive exploration of biochemical pathways and networks.,” Trends in biochemical sciences, vol. 33, no. 3, pp. 101–3, Mar. 2008.

[51] T. Yamada, I. Letunic, S. Okuda, M. Kanehisa, and P. Bork, “iPath2.0: interactive pathway explorer.,” Nucleic acids research, vol. 39, no. Web Server issue, pp. W412–5, Jul. 2011.

[52] T. Z. DeSantis, P. Hugenholtz, N. Larsen, M. Rojas, E. L. Brodie, K. Keller, T. Huber, D. Dalevi, P. Hu, and G. L. Andersen, “Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB.,” Applied and environmental microbiology, vol. 72, no. 7, pp. 5069–72, Jul. 2006.

[53] C. Quast, E. Pruesse, P. Yilmaz, J. Gerken, T. Schweer, P. Yarza, J. Peplies, and F. O. Glöckner, “The SILVA ribosomal RNA gene database project: improved data processing and web-based tools.,” Nucleic acids research, vol. 41, no. Database issue, pp. D590–6, Jan. 2013.

[54] J. R. Cole, Q. Wang, E. Cardenas, J. Fish, B. Chai, R. J. Farris, A. S. Kulam-Syed-Mohideen, D. M. McGarrell, T. Marsh, G. M. Garrity, and J. M. Tiedje, “The Ribosomal Database Project: improved alignments and new tools for rRNA analysis.,” Nucleic acids research, vol. 37, no. Database issue, pp. D141–5, Jan. 2009.

[55] J. Oksanen, R. Blanchet, F. Guillaume Kindt, P. Legendre, P. R. Minchin, R. B. O’Hara, G. L. Simpson, P. Solymos, M. H. H. Stevens, and H. Wagner, “vegan: Community Ecology Package. R package version 2.0-7.” 2013.

[56] J. Oksanen, “Vegan: ecological diversity.” .

[57] N. J. Gotelli and R. K. Colwell, “Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness,” Ecology Letters, vol. 4, no. 4, pp. 379–391, Jul. 2001.

[58] B. D. Ondov, N. H. Bergman, and A. M. Phillippy, “Interactive metagenomic visualization in a web browser.,” BMC bioinformatics, vol. 12, p. 385, Jan. 2011.

[59] D. P. Faith, P. R. Minchin, and L. Belbin, “Compositional dissimilarity as a robust measure of ecological distance,” Vegetatio, vol. 69, no. 1–3, pp. 57–68, Apr. 1987.

[60] Y. Wang, H.-F. Sheng, Y. He, J.-Y. Wu, Y.-X. Jiang, N. F.-Y. Tam, and H.-W. Zhou, “Comparison of the levels of bacterial diversity in freshwater, intertidal wetland, and marine sediments by using millions of illumina tags.,” Applied and environmental microbiology, vol. 78, no. 23, pp. 8264–71, Dec. 2012.

[61] M. T. Madigan, J. M. Martinko, P. V. Dunlap, and D. P. Clark, Brock Biology of Microorganisms, 12th ed. Pearson, 2009.

[62] H. Li, Y. Yu, W. Luo, Y. Zeng, and B. Chen, “Bacterial diversity in surface sediments from the Pacific Arctic Ocean.,” Extremophiles : life under extreme conditions, vol. 13, no. 2, pp. 233–46, Mar. 2009.

References

40

[63] K. T. Konstantinidis, J. Braff, D. M. Karl, and E. F. DeLong, “Comparative metagenomic analysis of a microbial community residing at a depth of 4,000 meters at station ALOHA in the North Pacific subtropical gyre.,” Applied and environmental microbiology, vol. 75, no. 16, pp. 5345–55, Aug. 2009.

[64] W. Xie, F. Wang, L. Guo, Z. Chen, S. M. Sievert, J. Meng, G. Huang, Y. Li, Q. Yan, S. Wu, X. Wang, S. Chen, G. He, X. Xiao, and A. Xu, “Comparative metagenomics of microbial communities inhabiting deep-sea hydrothermal vent chimneys with contrasting chemistries.,” The ISME journal, vol. 5, no. 3, pp. 414–26, Mar. 2011.

[65] J. Wang, Y. Wu, H. Jiang, C. Li, H. Dong, Q. Wu, J. Soininen, and J. Shen, “High beta diversity of bacteria in the shallow terrestrial subsurface,” Environmental Microbiology, vol. 10, no. 10, pp. 2537–2549, Oct. 2008.

Appendix

41

Appendix

Figure 20 - Taxonomic distribution of the reads from the seven samples at the phylum level

Appendix

42

Figure 21 - Krona graph of the distribution of the reads of Gammaproteobacteria in sample 1


Appendix

43



Appendix

44



Appendix

45


Figure 28 - Beta diversity related to spacial distance

y = 0.0013x + 0.3036 R² = 0.0999

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 5 10 15 20 25 30 35 40

Dis

sim

ila

rity

Distance (cm)

Correlation between Spacial Distance and Dissimilarity

Appendix

46

Figure 29 - Heatmap of the reads agains the M5RNA database at 97% identity

Appendix

47

Figure 30 - Heatmap of the reads agains the M5NR database at 90% identity