IonGAP - Uni of Westminster 23-10-2015

Adrian Baez-OrtegaTransmissible Cancer Group

University of Cambridge

University of Westminster23/10/2015

iongap.hpc.iter.es

Hello!

My name is Adrian Baez-Ortega, and I am a bioinformatician and a PhD student at the Transmissible Cancer Group of the University of Cambridge.

However, I’m going to brief you about a Spanish project I’ve been involved in for the last two years, called IonGAP: an integrated Genome Analysis Platform for Ion Torrent sequence data.

bacterial genomics This tool, which is currently online at iongap.hpc.iter.es, is intended as an accessible way to do bacterial genomics research.

This is a rapidly growing field, since infectious diseases are the main cause of death in most developing countries.Research on pathogens often requires bacterial genome sequencing, as a way of finding out new treatments for preventing or fighting infectious diseases.

However, genome sequencing is still a luxury in many countries. Accessible sequencing technologies, such as Ion Torrent, are gaining popularity as a means of making “low-cost” bacterial genomics research possible.

Illustration: James ProvostPGM photo: Thermo Fisher Inc.

ion torrent sequencing Let’s start from the beginning: Ion Torrent sequencing.

Ion Torrent is a genome sequencing technology commercialised by Thermo Fisher Scientific.

This machine is an Ion Torrent sequencer (a device used to read the sequence of DNA molecules) called Ion Personal Genome Machine, or PGM.

As contrary to the leading sequencing technology, Illumina, Ion Torrent is not based on optical detection of nucleotides, but in the detection of changes in pH. This design brings an advantage in terms of sequencing speed (and cost, in some cases), which is why Ion Torrent is becoming a widely adopted choice for the sequencing of small genomes.

origins

Photo: NASA

Tenerife, Spain

The IonGAP project was conceived two years ago (2013), at the Spanish island of Tenerife –which I know is quite popular in the UK...

origins University Institute of Tropical

Diseases and Public Health

Photo: Univ. de La Laguna

This is the University Institute of Tropical Diseases and Public Health of the Canary Islands, part of the University of La Laguna.

The researchers here had an Ion PGM for bacterial genome sequencing, but they were having problems to assemble and analyse the data coming from it.

origins

Photo: José Luis Roda García

School of Computer EngineeringUniversidad de La Laguna

So they just crossed the road to speak with professors at the School of Computer Engineering of the University of La Laguna, where I studied, and together they outlined a project directed to solve their problem.

I carried out that project, and that’s why I’m a bioinformatician today.

objectives

Optimal set of applications and settings

Publicly accessible web platform

Ion PGM data

ACGTACGT

PGM photo: Thermo Fisher Inc.

The aims of the project were clear. In first place, we wanted to take the raw output data from the sequencer, and design a processing pipeline from a set of optimal tools, optimally configured for dealing with Ion Torrent data.

And secondly, we decided to take a leap and make this platform available to the community in the form of a publicly accessible web platform.

And that’s how IonGAP began.

Genome assembly

Comparative genomics

Bacterial characterisati

on

platform functionality The platform has 3 main functions.

First, bacterial whole-genome assembly from Ion Torrent sequence reads.

Second, comparative genomics, for comparing our assembled genome against a reference genome, in form of both large-scale comparison (whole-genome alignment) and base-by-base comparison (variant calling).

And last, bacterial characterisation, for determining the concrete species, strain, “family” within the strain, and even particular genomic traits of our bacteria, like the presence of plasmids or antibiotic resistance genes.

platform functionality These 3 distinct tasks translate into 3 different modules in IonGAP, which are clearly displayed in the user interface.

Moreover, users have the possibility of disabling any of these modules, and running only the modules they are interested in.

applications

Microbiological research

Clinical pathology

Food safety

But what are the practical applications of this kind of analysis toolkit? These are three of the many fields that may benefit from integrated bacterial genome analysis.

The first is microbiological research, of course.

But there’s also the clinical pathology, which demands fast routines for bacterial pathogenicity analysis.

And food safety controls, which inspect for pathogens in alimentary products before they are delivered to consumers, thus suffering from time constraints as well.

existing alternatives

https://orione.crs4.it

Which platforms are already available for doing this kind of analysis? Surprisingly, we found only one comparable alternative to what we wanted to offer.

It’s a platform called Orione, which offers an astounding variety of genome assembly and analysis tools, that can be combined into any workflow.

However, not all the routines we wanted to implement were included here, and we also wanted to do something really straightforward for the user.

So we opted for reducing the user’s possibilities in comparison with Orione, in order to increase the user-friendliness of our platform.

genome assembly module Now, let’s get into the details. I’ll start with our more fundamental module, the one devoted to genome assembly.

Genome assembly (the reconstruction of a genome from the sequence reads) is performed by a program called genome assembler.

The problem is, there are nearly 200 genome assemblers out there. And according to the authors, every one of them is the best.

So we needed to find really the best assembler for our data.

genome assembly I mentioned before that the leading sequencing technology is Illumina, which is quite far from Ion Torrent. In fact, the sequence reads that you obtain after Illumina sequencing are remarkably different from what you would get from Ion Torrent.

This means that most of the assemblers are not intended for working with our Ion Torrent reads, but with Illumina reads. And that was a major problem.

genome assembly

MIRA

Celera Assembler

SGA

ABySS

RayVelvet

SparseAssembler

Minia

SOAPdenovo

ALLPATHS-LG

Arapan SPAdes

Edena MaSuRCA Euler Forge Geneious

So I started the project by looking for a selection of the most highly-regarded assemblers, and then discarding the ones which were specifically designed for Illumina only.

genome assembly

MIRA

Celera Assembler

SGA

ABySS

RayVelvet

SparseAssembler

Minia

That left me with these eight candidate assemblers, which can be grouped into two categories: overlap-layout-consensus assemblers, and De Bruijn graph assemblers.

genome assembly

MIRA

Celera Assembler

SGA

ABySS

RayVelvet

SparseAssembler

Minia

Overlap-layout-consensus

De Bruijn graph

Overlap-layout-consensus is an older family of assemblers that follow a more logical approach: they try to reconstruct the original DNA sequence by looking at the overlap between sequence reads. They are well-suited for long reads (such as Ion Torrent reads), but they are remarkably inefficient.

The second, more modern family, De Bruijn graph assemblers, use a less intuitive approach based on the construction of a complex kind of graph called De Bruijn graph. Their main advantage is their efficiency, but they are normally appropriate for short reads only, and thus they can have trouble when facing repetitive genomic regions.

genome assembly Streptococcus agalactiae

I tested and optimised each of these eight assemblers using our best benchmark dataset, consisting of about 700,000 sequence reads from the sequencing of a Streptococcus agalactiae isolate.

This pathogen, which is usually only dangerous for pregnant women and new-born babies, was isolated in a hospital in Madrid, and sequenced in Tenerife.

I then compared the different settings for each assembler, and the assemblers against each other, according to the fragmentation of the assembly.

genome assembly On one hand, I looked at the number of contigs, which are the contiguous DNA sequences produced by the assembler.

You can appreciate the stark contrast between MIRA and the rest of assemblers. Here, MIRA was using its default settings, while the optimal parameter values are indicated for the others.

(Celera Assembler wasn’t even able to finish the assembly with the available resources.)

genome assembly And on the other hand we looked at the N50 length, which reflects the size distribution of the contigs, being basically the bigger, the better.

This figure is an exact complement of the previous one, showing the same ranking of assemblers.

This is telling us that MIRA assembles the reads into less, but much longer contigs, and therefore it’s the most effective assembler for Ion Torrent sequence reads.

genome assembly

MIRAby Bastien Chevreux

Automatic editing

Data preprocessing

Fast read comparison

Smith-Waterman alignment

Contig assembly

Project finished

http://sourceforge.net/p/mira-assembler

MIRA is an all-terrain assembler developed by Bastien Chevreux, that is capable of assembling nearly any kind of data, including hybrid assemblies.

This is a basic diagram of the assembly workflow. After data preprocessing and cleaning, a fast all-against-all read comparison is performed, initiating a cycle of automatic sequence editing, contig assembly, Smith-Waterman alignment and comparison again, that is normally performed in five iterations.

Parameters

genome assembly

number of assembly iterations

uniform read distribution

separation of long repeats in different contigs

maximum number of times a contig can be rebuilt during an iteration

minimum number of reads per contig

minimum size of a contig for being considered as "large"

minimum read length

minimum repeat length

minimum overlap length

minimum overlap score

At this point, I identified a number of parameters with potential relevance for the user, and played with them until I realised that all of them were already in their optimum values.

However, I decided to move these same parameters to the interface of our assembly module, so that the users themselves can play with them.

read preprocessing

Default MIRA

MIRA

PRINSEQ

ERNE-filter

Trimmomatic

Assembly fragmentation?

At this stage, we wondered if the quality of the assembly could be improved by preprocessing the sequence reads.

MIRA has a number of options for read trimming and quality control, which I considered. But also, we decided to try out these three external trimming tools, and then compared the resulting assemblies with that coming from the default settings.

To measure the overall quality of the assemblies, I initially relied on assembly fragmentation.

What I found was that the default settings always outperformed all the alternatives.

Mauve Assembly Metrics

MIRA user manual

For heavens' sake: do NOT try to clip or trim by quality yourself. Do NOT try to remove standard sequencing adaptors yourself. Just leave the data alone!

read preprocessing But I was still sceptical.

So I resorted to this fantastic software, Mauve Assembly Metrics, which compares different assemblies according to a range of criteria.

This is just one example showing the density of missing and extra segments in the assembly, with respect to the reference genome. Here, the black line at the bottom represents the default assembly, and it equally outperformed the others in every comparison.

So, after all, we decided to follow the author’s advice, and just leave the data alone.

genome assembly module

FastQC

MIRA

http://www.bioinformatics.babraham.ac.uk/projects/fastqc

Well, after some months… The assembly stage was solved.

We decided to add a purely informative read quality assessment with FastQC, and with this, we had our first and most important module.

We have subsequently improved it to allow several input file formats, input file submission by means of an FTP or Dropbox URL, optional output formats, and also the possibility of modifying the assembly parameters shown before.

genome assembly module

FastQC

MIRA

http://www.bioinformatics.babraham.ac.uk/projects/fastqc

And as I said, any of our modules can be disabled.

But, in the event that you don’t need to use the genome assembly module, you’ll need to provide your own set of assembled contigs, as a required input for the other modules.

The possibility of using the user’s own contigs also makes the rest of the platform accessible to users of other sequencing technologies.

comparative genomics module The next stop is the comparative genomics module.

comparative genomics module

Whole-genome

alignment

Alignment plotting

Variant calling

SNP annotation

This module is composed of two different routines. In the beginning, all it did was whole-genome alignment of the assembled contigs against a reference genome provided by the user, complemented by different graphical representations of said alignment.

A bit later, we decided to increase its functionality by including two very relevant processes: variant calling directly from the sequence reads, and subsequent functional SNP annotation.

whole-genome alignment

http://darlinglab.org/mauve

For the whole-genome alignment step, we relied on Mauve, which is highly regarded in the comparative genomics field.

Besides being a graphical-user-interface application, Mauve can be used to obtain information like the percentage of missing and extra segments in the assembly, their location, and even a file containing their sequences.

alignment plotting Mauve & genoPlotR

http://genoplotr.r-forge.r-project.org

The Mauve alignment is the seed for our next tool, genoPlotR.

Among other things, it is useful for plotting the alignment generated by Mauve.

alignment plotting Circos & Circoletto

http://tools.bat.infspire.org/circolettohttp://circos.ca

We then took a more popular software, Circos, together with its fellow Circoletto, to make circos diagrams of the whole-genome alignment, where the reference is usually depicted on the left, and the contigs, on the other side.

alignment plotting MUMmer

http://mummer.sourceforge.net

And finally, MUMmer allows us to see more classical figures such as transversal alignment diagrams, genome coverage diagrams, etcetera.

variant calling and annotation

TRAMSTool for Rapid Annotation of Microbial SNPs

http://dx.doi.org/10.6084/m9.figshare.782261http://cortexassembler.sourceforge.net

Regarding the other part of the module, we chose Cortex for calling variants (if the user provides the sequence reads for the assembly module).

Cortex has proven itself to be really fast and sensitive, and after using it, we realised that of the single-nucleotide variants (SNPs) that we had previously identified by using Mauve for comparing the assembled contigs to the reference genome, most of them were artefacts.

We can know that, because one of the best things of Cortex is that it gives you not only the variants, but also their estimated likelihood values, so that you can set a quality threshold and keep only the really trustable variants.

variant calling and annotation

TRAMSTool for Rapid Annotation of Microbial SNPs

http://dx.doi.org/10.6084/m9.figshare.782261http://cortexassembler.sourceforge.net

The SNPs found by Cortex are then fed to TRAMS, a tool for annotation of microbial SNPs. This is one of the most valuable routines in IonGAP, since it can characterise a bacterial isolate against the reference strain (this is useful for population genetics studies), while telling us the effect of the found variants on the bacterium’s genome.

comparative genomics module

CircosCircolettogenoPlotRMauveMUMmer

CortexTRAMS

With this selection of tools in hand, we built the comparative genomics module, which only needs the contigs coming from the genome assembly module, and a reference genome in FASTA format.

Nevertheless, it is also capable of retrieving the reference genome from the NCBI if the user just provides its accession or GI number.

And, of course, it can be disabled if not needed.

bacterial classification & annotation module And the last component of the platform is the bacterial classification and annotation module, which can also be divided into two main tasks.

bacterial classification & annotation module

Taxonomic classification

Multilocus sequence

typing

Genome annotation

Identification of plasmids & virulence

factors

First, we wanted to characterise our bacterial genomes beyond the strain level.

For that, we start by doing a taxonomic classification for identifying the bacterial species. followed by multilocus sequence typing for determining the sequence type of the genome.

And, on the other hand, we wanted to characterise the genome in terms of its genetic traits and pathogenic potential. We can do that by annotating the assembled genome, and also by searching for elements such as plasmids and virulence factors.

taxonomic classification & mlst

16SMicrobial

https://github.com/tseemann/mlstftp://ftp.ncbi.nlm.nih.gov/blast/db

ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST

For the classification routine, we targeted the 16S ribosomal RNA gene, which can identify bacteria at the species level.

So I simply obtained the BLAST 16SMicrobial database, for searching in the assembled contigs.

Then, we supply the user with all the matches reaching at least 97% of sequence identity (since this is the minimum level of identity for determining bacterial species).

This is useful for verifying that the bacterium you’ve sequenced is exactly what you expected, and for assessing contamination by other bacterial species.

taxonomic classification & mlst

Torsten Seemann’s

mlst

16SMicrobial

https://github.com/tseemann/mlstftp://ftp.ncbi.nlm.nih.gov/blast/db

ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST

As for the multilocus sequence typing, we relied on Torsten Seemann’s fantastic mlst script, which searches for the alleles that determine a sequence type in a concrete bacterial species, in a quick an easy way.

The sequence type is a sort of specific “family within the strain” of the isolated bacterium. Knowing the sequence type of our bacteria allows us to group different clinical isolates together, thus helping to carry out epidemiology studies.

And I’ve also “doped” this tool, so that the user receives not only the sequence type information, but also the very sequence of the found alleles.

Prokka

genome annotation

http://www.vicbioinformatics.com/software.prokka.shtml

For the bacterial genome annotation, we chose the Australian tool Prokka, which can annotate a whole genome in minutes, thanks to a series of embedded gene databases.

identification of plasmids & virulence factors

http://mvirdb.llnl.govftp://ftp.ncbi.nlm.nih.gov/genomes/Plasmids

Plasmid DB

And finally, the identification of plasmids in the genome relies on the NCBI plasmid database, which I used to build a BLAST database for searching across the genome.

identification of plasmids & virulence factors

MvirDB

http://mvirdb.llnl.govftp://ftp.ncbi.nlm.nih.gov/genomes/Plasmids

Plasmid DB

For the identification of virulence factors, we use MvirDB, a super-database that gathers a number of existing catalogues of virulence proteins, antibiotic resistance genes, and pathogenicity islands.

These three categories are also searched over the genome by BLAST, and the results are output in three different files.

The output from the detection of both plasmids and virulence factors is, by default, filtered to include only the most relevant matches, but this can be disabled by the user, should they prefer manual filtering.

bacterial classification & annotation module

BLASTNCBI 16SMicrobialT. Seemann’s mlst

BLASTNCBI Plasmid DBMvirDB

Prokka

All these tools were fit together into the bacterial classification and annotation module.

As you can see, apart from the contigs coming from the assembly module, it only requires specifying the species name if you want to run the MLST routine.

Also note the filtering options for the plasmid and virulence factor identification routines, which are enabled by default.

computational infrastructure

http://teidehpc.iter.es

Now, let’s talk a bit about hardware.

Our service is hosted and running here, on the Teide-HPC supercomputer of the Institute of Technology and Renewable Energies of Tenerife (ITER), where I worked until very recently.

This is the second most powerful computer in Spain at the moment, and it’s within the world’s top 300.

Thanks to this infrastructure, we can expand our computational resources in the future to cope with a hopefully increasing user demand.

And thanks to all these efforts, IonGAP evolved from being a simple University engineering degree project to become a published...

worldwide use

iongap.hpc.iter.es/iongap/about

as of 21/10/2015

…and demanded tool.

This is the overall number of analysis jobs we have carried out, together with the anonymised locations of our users.

As you can see, we have already reached every single continent, including developing countries like South Africa, India and the Philippines, which depend on the higher accessibility of Ion Torrent sequencing.

And IonGAP has also been useful enough for being cited by South African researchers in this publication about the sequencing of a bacterial genome from… “horse biological material”.

the iongap group To finish, I’d like to introduce the members of our group.

From the Canary Islands’ Health Service, and the University Institute of Tropical Diseases and Public Health, are Carlos Flores and Fabián Lorenzo (co-first author of the paper), who first felt the need for this kind of integrated platform.

They are not only users, but key promoters of IonGAP and a source of ideas for the future.

the iongap group From the University of La Laguna, are José Luis Roda and Marcos Colebrook (at the School of Computer Engineering) and Mariano Hernández (at the Faculty of Biology).

They are also main drivers of this project, and keep on bouncing ideas for improvement.

the iongap group And last, but not least…

From the Institute of Technology and Renewable Energies, are Carlos González, who helps to maintain and improve the system, and… myself, the overwhelmed computer geek who actually had to implement all this thing.

thanks for your

attention

iongap.hpc.iter.es

For any doubts, questions, complaints or generous donations, please write to:

iongap.admatgmail.com

(Generous donations are strongly preferred.)

IonGAP - Uni of Westminster 23-10-2015

Health & Medicine

Transcript of IonGAP - Uni of Westminster 23-10-2015