RNA-seq: Mapping and quality control - part 3

43
Mapping to assign reads to genes Joachim Jacob 20 and 27 January 2014 This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.

description

Third part in the 'RNA-seq for DE analysis'. See http://www.bits.vib.be for more details.

Transcript of RNA-seq: Mapping and quality control - part 3

Page 1: RNA-seq: Mapping and quality control - part 3

Mapping to assign reads to genes

Joachim Jacob20 and 27 January 2014

This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.

Page 2: RNA-seq: Mapping and quality control - part 3

Goal

Assign reads to genes.

The result of the mapping will be used to construct a summary of the counts: the count table.

GeneA: 12GeneB: 5

Page 3: RNA-seq: Mapping and quality control - part 3

2 scenarios

Reference genome sequence available

NO reference genome sequence available● De novo assembly of the reads (trinity) (transcriptome construction)● Map the reads to the assembly (RSEM mapper)● Extract count table(note:no removal of polyA is required.

Computationally expensive!)

Page 4: RNA-seq: Mapping and quality control - part 3

Reference genome sequence available

Preprocessed reads are mapped to the reference sequence:

1. Reference is haplotype: mixture of alleles, leads to mismatches.

2. Reads contain sequencing errors

3. Reads derived from mRNA, genome is DNA.

35 for 2 alleles→together

If we compare samples within the same specimen, this effect is similar for all samples.

Page 5: RNA-seq: Mapping and quality control - part 3

mRNA reads: some reads span introns

● Reads are derived from mRNA

mRNA

etc.

Many reads span introns: they need to be aligned with gaps. This can be used to detect intron-exon junctions

exon

intron

One isoform!

http://www.ensembl.org

Page 6: RNA-seq: Mapping and quality control - part 3

mRNA reads: multiple isoforms exist

● Isoforms are transcribed at different levels, contributing differently to the number of reads.

http://www.ensembl.org

Page 7: RNA-seq: Mapping and quality control - part 3

Algorithm: gapped read mapping

● Exon-first approach: TopHat (popular)

TopHat: discovering splice junctions with RNA-SeqVol. doi:10.1093/bioinformatics/btp12025 no. 9 2009, pages 1105–1111

Junction database constructed to try to map unmapped reads.

Page 8: RNA-seq: Mapping and quality control - part 3

Principle of gapped read mapping

● STAR: fast and suited for longer reads

STAR: ultrafast universal RNA-seq alignerAlexander Dobin et al. Bioinformatics

Page 9: RNA-seq: Mapping and quality control - part 3

Checklist for mapping to reference genome

1. A reference genome sequence (fasta), to be indexed by the alignment software.

2. A genome annotation file (GFF3 or GTF), with indication of currently known annotations (optional, but highly recommended)

3. The cleaned (preprocessed) reads (fastq)

Page 10: RNA-seq: Mapping and quality control - part 3

Getting your reference genome sequence

● Genomes to be used by TopHat can be fetched from iGenomes and for STAR here

● If your genome is not

listed above, check

http://ensembl.org

and

http://ensemblgenomes.org ; and follow indexing software

● If still no luck, try a specialized species website, e.g.

Page 11: RNA-seq: Mapping and quality control - part 3

Indexing a genome

● Mapping reads is fairly fast, because the heavy lifting is done beforehand: the reference genome sequence is preprocessed by indexing (taking a lot of time), making mapping fast.

● On Galaxy, the indexing has already been performed for you. Just choose your genome from the list.

Page 12: RNA-seq: Mapping and quality control - part 3

Using genome annotation information

● Annotation info is stored in text files formatted as GTF or GFF3 files.

● If sequencing is deep enough, the complete transcriptome structure can be derived from the mapping: splice junctions, isoforms, variants,... CuffLinks for example reconstructs the annotation from an alignment, and generates a GFF file, to be used. Potentially novel transcripts are included in this file. But remember, this is NOT OUR GOAL.

● We will use a GTF file from an respected genome database to assist the mapping of reads.

http://cufflinks.cbcb.umd.edu/

Page 13: RNA-seq: Mapping and quality control - part 3

Using genome annotation information

Page 14: RNA-seq: Mapping and quality control - part 3

GTF example

TIP: You can look at an example in Galaxy!

Page 15: RNA-seq: Mapping and quality control - part 3

Mapping in Galaxy

Mapping in Galaxy

Basic settings

!

Page 16: RNA-seq: Mapping and quality control - part 3

Mapping in Galaxy

!!!

Page 17: RNA-seq: Mapping and quality control - part 3

Mapping in Galaxy

Page 18: RNA-seq: Mapping and quality control - part 3

Mapping QC

● TIP: align a subsample of reads in Galaxy. Play with the settings, and determine the best outcome.

● Set the mapping fairly liberal: map as much as possible, and let the mapper assign mapping qualities. Ideally, every read maps once ('uniquely mapped'). In the following step, we will discard reads mapped to multiple locations ('multi reads').

● The outcome of the alignment is a SAM or a BAM format, which you can visualize in Galaxy (or with a stand-alone viewer such as GenomeView or IGV.

Page 19: RNA-seq: Mapping and quality control - part 3

Mapping QC

The outcome of the alignment is a SAM or a BAM format, which you can visualize in Galaxy (or with a stand-alone viewer such as GenomeView or IGV.

Let's visualize

Check whether this visualization matches:- paired end- splice junctions- strandedness- ...

Page 20: RNA-seq: Mapping and quality control - part 3

Practical tips

Add the GTF to the viz

These are the reads, 2 coloursbecause of the sense and

antisense strand. (obviouslythis library was not stranded!)

Position on the reference genome sequence

Some reads span an intron

Page 21: RNA-seq: Mapping and quality control - part 3

Mapping QC - RSeQC

After checking the mapping visually, determine more metrics with RseQC.

http://rseqc.sourceforge.net/

Page 22: RNA-seq: Mapping and quality control - part 3

Mapping QC - RSeQC

Duplication rate observed in the RNA-seq data.

http://rseqc.sourceforge.net/

Page 23: RNA-seq: Mapping and quality control - part 3

Mapping QC - RSeQC

Read quality of aligned reads

http://rseqc.sourceforge.net/

Page 24: RNA-seq: Mapping and quality control - part 3

Mapping QC - RSeQC

Sequence depth saturation

http://rseqc.sourceforge.net/

Early flattening points to saturation

Q1 Q4: from low→count genes

to high count genes

Page 25: RNA-seq: Mapping and quality control - part 3

Mapping QC - RSeQC

Sequence depth saturation

http://rseqc.sourceforge.net/

Page 26: RNA-seq: Mapping and quality control - part 3

Mapping QC - RSeQC

After checking visually, determine more metrics with RseQC.

http://rseqc.sourceforge.net/

Page 27: RNA-seq: Mapping and quality control - part 3

Mapping QC - RSeQC

After checking visually, determine more metrics with RseQC.

http://rseqc.sourceforge.net/

Deviating!

Page 28: RNA-seq: Mapping and quality control - part 3

Mapping QC - BamQC

Another useful tool is BamQC of the Qualimap Suite. Be aware however: also useful for DNA-seq!

http://qualimap.bioinfo.cipf.es/

Page 29: RNA-seq: Mapping and quality control - part 3

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Page 30: RNA-seq: Mapping and quality control - part 3

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Page 31: RNA-seq: Mapping and quality control - part 3

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Page 32: RNA-seq: Mapping and quality control - part 3

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/ Fraction of genome sequence not covered

Page 33: RNA-seq: Mapping and quality control - part 3

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Page 34: RNA-seq: Mapping and quality control - part 3

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Page 35: RNA-seq: Mapping and quality control - part 3

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Page 36: RNA-seq: Mapping and quality control - part 3

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Page 37: RNA-seq: Mapping and quality control - part 3

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Page 38: RNA-seq: Mapping and quality control - part 3

Mapping QC: BamQC

Some examples to watch out for.

http://qualimap.bioinfo.cipf.es/

Page 39: RNA-seq: Mapping and quality control - part 3

Mapping QC: BamQC

Some examples to watch out for.

http://qualimap.bioinfo.cipf.es/

Page 40: RNA-seq: Mapping and quality control - part 3

Mapping QC: BamQC

Some examples to watch out for.

http://qualimap.bioinfo.cipf.es/

Page 41: RNA-seq: Mapping and quality control - part 3

Keywordshaplotype

Gapped mapping

GTF

duplication

isoforms

strandedness

coverage

Write in your own words what the terms mean

Page 43: RNA-seq: Mapping and quality control - part 3

Break