Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop...
Transcript of Biomedical Informatics Shared Resource Workshop...Biomedical Informatics Shared Resource Workshop...
Biomedical Informatics Shared
Resource Workshop
RNA-seq analysis
2015 03 12
Paolo Guarnieri, M.D.
Topics
• Experimental design and library selection
• Sequence handling and processing
• Quality assessment
– Library level
– Read sequence level
– Sample level
• Identification of genes differentially expressed
• Additional analyses and tools
Microarray
Experimental Design
• Platform
• Chips
• Samples
RNA preparation
• Library preparation for hybridization
Data analysis
• Image analysis
• Probe intensities
• Normalization
RNA-seq
Experimental Design
• Lanes
• Reads
• Samples
RNA preparation
• Library prep for sequencing
Data analysis
• Alignment
• Read counts
• Normalization
Comparing paradigms
Experimental design and library
selection
• Sample preparation:
Choose the proper kit for your experiment
Poly-A mRNA isolation Vs. rRNA depletion (Ribo-Ziro)
• Library preparation:
Single end vs. paired end
Please refer to manufacturer for details
Paired end vs. single end
adapted from: Zhernakova et al., PLoS Genet. 2013 Jun; 9(6)
Library preparation steps:
- adapters ligated
- amplified
- size selected
THE SEQUENCE
Sequencing process
Images: Werner Van Belle
Flow cell: composed of
multiple lanes
Lanes: contain multiple
imaging position
Sequencing by synthesis process
Images: Werner Van Belle
Positions are imaged 4 times
Each imaging position contains
multiple sequence cluster
Incorporation of
new nucleotide
Repeat n times
n = read length
Base calling
Sequencing complete Per cycle base calling
Sequence file:
<.fastq>
Images: Werner Van Belle
The sequence
The raw data is the <.fastq> file
• A collection of multiple reads
• Each read in the file has these main features:
– 4 lines
– starts with @
Often <.fastq> are compressed into <.fastq.qz>
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
<- Read ID + desc
<- Sequence
<- As 1 but opt
<- Quality score
Sequence handling
when dealing with sequencing files be aware of:
• Data transfer:
– samples start from around 1.5 Gb per sample (> if PE)
– compress as .gz to reduce transfer time
• Group files:
– representing reads (R1, R2) to samples
• Storage:
– files are required for publication. Must be kept safe.
PRE-PROCESSING
Pre-processing
• Remove adaptors
– fastx_clipper
• De-mulitplex (if required)
fastx_splitter, fastq-multx (handles both mates)
These steps usually performed by sequencing facility
Multiplexing removes variability– Lane specific: multiplex in different lanes or
flow cells (barcoding)
good bad
De-multiplexing
• Grouped barcodes file (.fil) looks like this:
<id1> <sequence1> <group1>
<id1> <sequence1> <group1>
<id2> <sequence2> <group2>...
https://code.google.com/p/ea-utils/wiki/FastqMultx
>fastq-multx -B barcode-definition.fil \
PE_read1.fq -o r1.%.fq \
PE_read2.fq -o r2.%.fq
>_
QUALITY ASSESSMENT
Library level QC
Before sequencing: RNA integrity measure with
the Bioanalyzer (usually performed by facility)
Sequence level QC
• FastQC: quality score for each nt position
• Additional quality based trimming
Very good Normal Bad
Sample/Dataset level
• Evaluate the distribution of the reads as
function of the expression
Coverage
After sequencing: evaluate your actual library
size as compared to the expected
Total reads
Mapped reads
Sum of counts
per sample
Alignment/Mapping – what
Reference genome / transcriptome
...GTGGGCCGGCAATTCGATATCGCGCATATATTTCGGCGCATGCTTAGC...
Reads to be mapped
TGGGCCGGCA
CCGGCAATTC
ATTCGATATC
GATATCGCGC
GCATATATTT
CATGCTTAGC
ATATTTCGGC
GCATATATTT
TCGCGCATAT
Alignment/Mapping – what
Reference genome / transcriptome
...GTGGGCCGGCAATTCGATATCGCGCATATATTTCGGCGCATGCTTAGC...
TGGGCCGGCA GCATATATTT CATGCTTAGC
CCGGCAATTC ATATTTCGGC
ATTCGATATC GCATATATTT
TCGCGCATAT
GATATCGCGC
Reads mapped
Alignment/Mapping – Coverage
Alignment/Mapping – how
• Alignment programs• Bowtie
• BWA
• GSNAP
• OLEGO
• STAR
• TopHat
• […]
• Parameters are important:
– Number of mismatches
– Unique alignments
STAR aligner
>STAR --genomeDir /star_indices/mm10 \
--runThreadN 8 \
--outFilterMultimapNmax 1
--outSAMtype BAM SortedByCoordinate \
--sjdbGTFfile mm10.igenome_ucsc.gtf \
--readFilesCommand zcat
--outSAMstrandField intronMotif
--readFilesIn S1_R1.fq S1_R2.fq \
--outFileNamePrefix somePrefix.
>_
Aligned sequences
Aligned sequences are stored in:
– SAM file: Sequence Alignment/Map file
– BAM file: BGZF compressed version of the SAM
– BAI file: indexed version of the BAM
Main features:
– store all the alignment information
– be compact
– can be processed line by line
– can be indexed for fast position access
SAM file
TAB-delimited text format consisting of
SAM/BAM and related specifications:
http://samtools.github.io/hts-specs/
@HD VN:1.5 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA *
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
r003 2064 ref 29 17 6H5M * 0 0 TAGGC *
r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT *
Q
N
A
M
E
F
L
A
G
R
N
A
M
E
P
O
S
M
A
P
Q
C
I
G
A
R
R
N
E
X
T
P
N
E
X
T
T
L
E
N
S
E
Q
Q
U
A
L
Header section
SAM/BAM handling
• Several tools to fiddle with alignment files:
samtools, picard, sambamba etc.
• Functions:
– view
– merge
– sort (some apps require aligned BAMs)
– subset (eg. Filter chr 5)
– pipe (i.e. stream line by line) into other programs
– etc.
Library complexity
After alignment we can use PICARD function
EstimateLibraryComplexity
Adapted from Levin et al. 2010 Nature
Assigning aligned reads to genes
Gene name Count of reads
0610005C13Rik 0
0610007N19Rik 28
0610007P14Rik 1157
0610008F07Rik 0
0610009B14Rik 4
0610009B22Rik 544
0610009D07Rik 708
0610009L18Rik 4
0610009O20Rik 169
0610010B08Rik 0
0610010F05Rik 418
0610010K14Rik 248
0610011F06Rik 147
htseq-count
>samtools view S1.sorted.bam | \
htseq-count --stranded=no \
mm10.igenome_ucsc.gtf \
S1.sorted.bam.counts
>_
Normalization
• RPKM: Reads per kilobase of exon per million reads mapped
(RPKM) (Mortazavi et. al. 2008)
• FPKM: Fragments per kilobase of exon per million reads
mapped (Trapnell et al. 2010)
SE: FPKM = RPKM PE: FPKM ≠ RPKM
• TPM: proportion of transcripts in your pool of RNA (Bo Li et
al. 2009)
• TMM: trimmed mean of M-values (Robinson et al. 2010)
Cufflinks
>cuffnorm –o yourOuputDir mm10.gtf \
Sample1.bam \
Sample2.bam \
Sample3.bam
>_
Identification of genes differentially
expressed
Specific nature of counts data requires adequate
statistical test methods
• Underlying counts data is not normally
distributed (over-dispersion): no t-test
• Negative binomial based methods:
– edgeR (R/Bioconductor)
– DESeq2 (R/Bioconductor)
– cuffdiff (stand alone)
DESeq2
library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData = countData,
colData = colData,
design = ~ condition)
dds <- DESeq(dds)
res <- results(dds)
## log2 fold change (MAP): condition treated vs untreated
## Wald test p-value: condition treated vs untreated
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## FBgn0039155 453 -3.71 0.160 -23.2 4.01e-119 3.09e-115
## FBgn0029167 2165 -2.08 0.104 -20.1 6.68e-90 2.57e-86
## FBgn0035085 367 -2.23 0.137 -16.3 1.89e-59 4.85e-56
ADDITIONAL ANALYSIS
Splicing analysis
Image: wikipedia
Fusion proteins
Image: wikipedia
Splicing analysis
• Splicing usually requires specific alignments and is divided in:
– annotation dependent
– annotation free (discovery)
– Examples:
• Olego/Quantas (CUMC: Zhang lab)
• MISO
• Scripture
• DEXSeq
• SGSeq
GUI
• Galaxy: https://usegalaxy.org/
• SAMate: http://sammate.sourceforge.net/
• OneChannelGUI (R/Bioconductor)
• Others…
EOF
Sequence handling
Images from Werner Van Belle
Data refers to 36bp read length