Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop...

Biomedical Informatics Shared

Resource Workshop

RNA-seq analysis

2015 03 12

Paolo Guarnieri, M.D.

Topics

• Experimental design and library selection

• Sequence handling and processing

• Quality assessment

– Library level

– Read sequence level

– Sample level

• Identification of genes differentially expressed

• Additional analyses and tools

Microarray

Experimental Design

• Platform

• Chips

• Samples

RNA preparation

• Library preparation for hybridization

Data analysis

• Image analysis

• Probe intensities

• Normalization

RNA-seq

Experimental Design

• Lanes

• Reads

• Samples

RNA preparation

• Library prep for sequencing

Data analysis

• Alignment

• Read counts

• Normalization

Comparing paradigms

Experimental design and library

selection

• Sample preparation:

Choose the proper kit for your experiment

Poly-A mRNA isolation Vs. rRNA depletion (Ribo-Ziro)

• Library preparation:

Single end vs. paired end

Please refer to manufacturer for details

Paired end vs. single end

adapted from: Zhernakova et al., PLoS Genet. 2013 Jun; 9(6)

Library preparation steps:

- adapters ligated

- amplified

- size selected

THE SEQUENCE

Sequencing process

Images: Werner Van Belle

Flow cell: composed of

multiple lanes

Lanes: contain multiple

imaging position

Sequencing by synthesis process


Positions are imaged 4 times

Each imaging position contains

multiple sequence cluster

Incorporation of

new nucleotide

Repeat n times

n = read length

Base calling

Sequencing complete Per cycle base calling

Sequence file:

<.fastq>


The sequence

The raw data is the <.fastq> file

• A collection of multiple reads

• Each read in the file has these main features:

– 4 lines

– starts with @

Often <.fastq> are compressed into <.fastq.qz>

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36

GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC

+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36

IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

<- Read ID + desc

<- Sequence

<- As 1 but opt

<- Quality score

Sequence handling

when dealing with sequencing files be aware of:

• Data transfer:

– samples start from around 1.5 Gb per sample (> if PE)

– compress as .gz to reduce transfer time

• Group files:

– representing reads (R1, R2) to samples

• Storage:

– files are required for publication. Must be kept safe.

PRE-PROCESSING

Pre-processing

• Remove adaptors

– fastx_clipper

• De-mulitplex (if required)

fastx_splitter, fastq-multx (handles both mates)

These steps usually performed by sequencing facility

Multiplexing removes variability– Lane specific: multiplex in different lanes or

flow cells (barcoding)

good bad

De-multiplexing

• Grouped barcodes file (.fil) looks like this:

<id1> <sequence1> <group1>

<id1> <sequence1> <group1>

<id2> <sequence2> <group2>...

https://code.google.com/p/ea-utils/wiki/FastqMultx

>fastq-multx -B barcode-definition.fil \

PE_read1.fq -o r1.%.fq \

PE_read2.fq -o r2.%.fq

>_

QUALITY ASSESSMENT

Library level QC

Before sequencing: RNA integrity measure with

the Bioanalyzer (usually performed by facility)

Sequence level QC

• FastQC: quality score for each nt position

• Additional quality based trimming

Very good Normal Bad

Sample/Dataset level

• Evaluate the distribution of the reads as

function of the expression

Coverage

After sequencing: evaluate your actual library

size as compared to the expected

Total reads

Mapped reads

Sum of counts

per sample

Alignment/Mapping – what

Reference genome / transcriptome

...GTGGGCCGGCAATTCGATATCGCGCATATATTTCGGCGCATGCTTAGC...

Reads to be mapped

TGGGCCGGCA

CCGGCAATTC

ATTCGATATC

GATATCGCGC

GCATATATTT

CATGCTTAGC

ATATTTCGGC

GCATATATTT

TCGCGCATAT

Alignment/Mapping – what

Reference genome / transcriptome

...GTGGGCCGGCAATTCGATATCGCGCATATATTTCGGCGCATGCTTAGC...

TGGGCCGGCA GCATATATTT CATGCTTAGC

CCGGCAATTC ATATTTCGGC

ATTCGATATC GCATATATTT

TCGCGCATAT

GATATCGCGC

Reads mapped

Alignment/Mapping – Coverage

Alignment/Mapping – how

• Alignment programs• Bowtie

• BWA

• GSNAP

• OLEGO

• STAR

• TopHat

• […]

• Parameters are important:

– Number of mismatches

– Unique alignments

STAR aligner

>STAR --genomeDir /star_indices/mm10 \

--runThreadN 8 \

--outFilterMultimapNmax 1

--outSAMtype BAM SortedByCoordinate \

--sjdbGTFfile mm10.igenome_ucsc.gtf \

--readFilesCommand zcat

--outSAMstrandField intronMotif

--readFilesIn S1_R1.fq S1_R2.fq \

--outFileNamePrefix somePrefix.

>_

Aligned sequences

Aligned sequences are stored in:

– SAM file: Sequence Alignment/Map file

– BAM file: BGZF compressed version of the SAM

– BAI file: indexed version of the BAM

Main features:

– store all the alignment information

– be compact

– can be processed line by line

– can be indexed for fast position access

SAM file

TAB-delimited text format consisting of

SAM/BAM and related specifications:

http://samtools.github.io/hts-specs/

@HD VN:1.5 SO:coordinate

@SQ SN:ref LN:45

r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *

r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *

r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA *

r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *

r003 2064 ref 29 17 6H5M * 0 0 TAGGC *

r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT *

Q

N

A

M

E

F

L

A

G

R

N

A

M

E

P

O

S

M

A

P

Q

C

I

G

A

R

R

N

E

X

T

P

N

E

X

T

T

L

E

N

S

E

Q

Q

U

A

L

Header section

SAM/BAM handling

• Several tools to fiddle with alignment files:

samtools, picard, sambamba etc.

• Functions:

– view

– merge

– sort (some apps require aligned BAMs)

– subset (eg. Filter chr 5)

– pipe (i.e. stream line by line) into other programs

– etc.

Library complexity

After alignment we can use PICARD function

EstimateLibraryComplexity

Adapted from Levin et al. 2010 Nature

Assigning aligned reads to genes

Gene name Count of reads

0610005C13Rik 0

0610007N19Rik 28

0610007P14Rik 1157

0610008F07Rik 0

0610009B14Rik 4

0610009B22Rik 544

0610009D07Rik 708

0610009L18Rik 4

0610009O20Rik 169

0610010B08Rik 0

0610010F05Rik 418

0610010K14Rik 248

0610011F06Rik 147

htseq-count

>samtools view S1.sorted.bam | \

htseq-count --stranded=no \

mm10.igenome_ucsc.gtf \

S1.sorted.bam.counts

>_

Normalization

• RPKM: Reads per kilobase of exon per million reads mapped

(RPKM) (Mortazavi et. al. 2008)

• FPKM: Fragments per kilobase of exon per million reads

mapped (Trapnell et al. 2010)

SE: FPKM = RPKM PE: FPKM ≠ RPKM

• TPM: proportion of transcripts in your pool of RNA (Bo Li et

al. 2009)

• TMM: trimmed mean of M-values (Robinson et al. 2010)

Cufflinks

>cuffnorm –o yourOuputDir mm10.gtf \

Sample1.bam \

Sample2.bam \

Sample3.bam

>_

Identification of genes differentially

expressed

Specific nature of counts data requires adequate

statistical test methods

• Underlying counts data is not normally

distributed (over-dispersion): no t-test

• Negative binomial based methods:

– edgeR (R/Bioconductor)

– DESeq2 (R/Bioconductor)

– cuffdiff (stand alone)

DESeq2

library("DESeq2")

dds <- DESeqDataSetFromMatrix(countData = countData,

colData = colData,

design = ~ condition)

dds <- DESeq(dds)

res <- results(dds)

## log2 fold change (MAP): condition treated vs untreated

## Wald test p-value: condition treated vs untreated

## DataFrame with 6 rows and 6 columns

## baseMean log2FoldChange lfcSE stat pvalue padj

## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>

## FBgn0039155 453 -3.71 0.160 -23.2 4.01e-119 3.09e-115

## FBgn0029167 2165 -2.08 0.104 -20.1 6.68e-90 2.57e-86

## FBgn0035085 367 -2.23 0.137 -16.3 1.89e-59 4.85e-56

ADDITIONAL ANALYSIS

Splicing analysis

Image: wikipedia

Fusion proteins

Image: wikipedia

Splicing analysis

• Splicing usually requires specific alignments and is divided in:

– annotation dependent

– annotation free (discovery)

– Examples:

• Olego/Quantas (CUMC: Zhang lab)

• MISO

• Scripture

• DEXSeq

• SGSeq

GUI

• Galaxy: https://usegalaxy.org/

• SAMate: http://sammate.sourceforge.net/

• OneChannelGUI (R/Bioconductor)

• Others…

Sequence handling

Images from Werner Van Belle

Data refers to 36bp read length

Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop...

Documents

Transcript of Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop...