RNA-seq: Mapping and quality control - part 3

Post on 11-May-2015

934 views 5 download

Tags:

description

Third part in the 'RNA-seq for DE analysis'. See http://www.bits.vib.be for more details.

Transcript of RNA-seq: Mapping and quality control - part 3

Mapping to assign reads to genes

Joachim Jacob20 and 27 January 2014

This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.

Goal

Assign reads to genes.

The result of the mapping will be used to construct a summary of the counts: the count table.

GeneA: 12GeneB: 5

2 scenarios

Reference genome sequence available

NO reference genome sequence available● De novo assembly of the reads (trinity) (transcriptome construction)● Map the reads to the assembly (RSEM mapper)● Extract count table(note:no removal of polyA is required.

Computationally expensive!)

Reference genome sequence available

Preprocessed reads are mapped to the reference sequence:

1. Reference is haplotype: mixture of alleles, leads to mismatches.

2. Reads contain sequencing errors

3. Reads derived from mRNA, genome is DNA.

35 for 2 alleles→together

If we compare samples within the same specimen, this effect is similar for all samples.

mRNA reads: some reads span introns

● Reads are derived from mRNA

mRNA

etc.

Many reads span introns: they need to be aligned with gaps. This can be used to detect intron-exon junctions

exon

intron

One isoform!

http://www.ensembl.org

mRNA reads: multiple isoforms exist

● Isoforms are transcribed at different levels, contributing differently to the number of reads.

http://www.ensembl.org

Algorithm: gapped read mapping

● Exon-first approach: TopHat (popular)

TopHat: discovering splice junctions with RNA-SeqVol. doi:10.1093/bioinformatics/btp12025 no. 9 2009, pages 1105–1111

Junction database constructed to try to map unmapped reads.

Principle of gapped read mapping

● STAR: fast and suited for longer reads

STAR: ultrafast universal RNA-seq alignerAlexander Dobin et al. Bioinformatics

Checklist for mapping to reference genome

1. A reference genome sequence (fasta), to be indexed by the alignment software.

2. A genome annotation file (GFF3 or GTF), with indication of currently known annotations (optional, but highly recommended)

3. The cleaned (preprocessed) reads (fastq)

Getting your reference genome sequence

● Genomes to be used by TopHat can be fetched from iGenomes and for STAR here

● If your genome is not

listed above, check

http://ensembl.org

and

http://ensemblgenomes.org ; and follow indexing software

● If still no luck, try a specialized species website, e.g.

Indexing a genome

● Mapping reads is fairly fast, because the heavy lifting is done beforehand: the reference genome sequence is preprocessed by indexing (taking a lot of time), making mapping fast.

● On Galaxy, the indexing has already been performed for you. Just choose your genome from the list.

Using genome annotation information

● Annotation info is stored in text files formatted as GTF or GFF3 files.

● If sequencing is deep enough, the complete transcriptome structure can be derived from the mapping: splice junctions, isoforms, variants,... CuffLinks for example reconstructs the annotation from an alignment, and generates a GFF file, to be used. Potentially novel transcripts are included in this file. But remember, this is NOT OUR GOAL.

● We will use a GTF file from an respected genome database to assist the mapping of reads.

http://cufflinks.cbcb.umd.edu/

Using genome annotation information

GTF example

TIP: You can look at an example in Galaxy!

Mapping in Galaxy

Mapping in Galaxy

Basic settings

!

Mapping in Galaxy

!!!

Mapping in Galaxy

Mapping QC

● TIP: align a subsample of reads in Galaxy. Play with the settings, and determine the best outcome.

● Set the mapping fairly liberal: map as much as possible, and let the mapper assign mapping qualities. Ideally, every read maps once ('uniquely mapped'). In the following step, we will discard reads mapped to multiple locations ('multi reads').

● The outcome of the alignment is a SAM or a BAM format, which you can visualize in Galaxy (or with a stand-alone viewer such as GenomeView or IGV.

Mapping QC

The outcome of the alignment is a SAM or a BAM format, which you can visualize in Galaxy (or with a stand-alone viewer such as GenomeView or IGV.

Let's visualize

Check whether this visualization matches:- paired end- splice junctions- strandedness- ...

Practical tips

Add the GTF to the viz

These are the reads, 2 coloursbecause of the sense and

antisense strand. (obviouslythis library was not stranded!)

Position on the reference genome sequence

Some reads span an intron

Mapping QC - RSeQC

After checking the mapping visually, determine more metrics with RseQC.

http://rseqc.sourceforge.net/

Mapping QC - RSeQC

Duplication rate observed in the RNA-seq data.

http://rseqc.sourceforge.net/

Mapping QC - RSeQC

Read quality of aligned reads

http://rseqc.sourceforge.net/

Mapping QC - RSeQC

Sequence depth saturation

http://rseqc.sourceforge.net/

Early flattening points to saturation

Q1 Q4: from low→count genes

to high count genes

Mapping QC - RSeQC

Sequence depth saturation

http://rseqc.sourceforge.net/

Mapping QC - RSeQC

After checking visually, determine more metrics with RseQC.

http://rseqc.sourceforge.net/

Mapping QC - RSeQC

After checking visually, determine more metrics with RseQC.

http://rseqc.sourceforge.net/

Deviating!

Mapping QC - BamQC

Another useful tool is BamQC of the Qualimap Suite. Be aware however: also useful for DNA-seq!

http://qualimap.bioinfo.cipf.es/

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/ Fraction of genome sequence not covered

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Mapping QC: BamQC

Some examples to watch out for.

http://qualimap.bioinfo.cipf.es/

Mapping QC: BamQC

Some examples to watch out for.

http://qualimap.bioinfo.cipf.es/

Mapping QC: BamQC

Some examples to watch out for.

http://qualimap.bioinfo.cipf.es/

Keywordshaplotype

Gapped mapping

GTF

duplication

isoforms

strandedness

coverage

Write in your own words what the terms mean

Break