NGS Data Analysis: An Intro to RNA-Seq - Amazon S3 · NGS Data Analysis: An Intro to RNA-Seq March...
Transcript of NGS Data Analysis: An Intro to RNA-Seq - Amazon S3 · NGS Data Analysis: An Intro to RNA-Seq March...
NGS Data Analysis: An Intro to RNA-Seq
March 25th, 2014
GST Colloquim: March 25th, 2014 1 / 1
Workshop Design
Basics of NGS
Sample Prep
RNA-Seq Analysis
GST Colloquim: March 25th, 2014 2 / 1
Experimental Design
There are lots of of sequencing experiments available:
Resequencing
Assembly
RNA-Seq
CHiP-Seq
Meta-genomics
GST Colloquim: March 25th, 2014 3 / 1
Common experimental questions:
Measure variation within or between species
Generate a genome sequence
Transcriptome characterization
Identify protein binding sites
Population genetics
Differential expression studies
GST Colloquim: March 25th, 2014 4 / 1
Basic Process
GST Colloquim: March 25th, 2014 5 / 1
Design Considerations
What resources do you have already? (reference genome, curatedgene models, etc.)
Do you need biological reps? (Depends on the experiment, but theanswer is usually yes.)
Do you need technical reps? (Most likely not.)
Do you need controls? (Depends on the experiment.)
Do you need deep sequencing coverage?(Again, depends on theexperiment.)
All of these questions should be answered before you start.
GST Colloquim: March 25th, 2014 6 / 1
Types of reads
Single: Fast runsCheapest overall cost
Paired: More data for each fragmentMore data for alignment/assemblySame inputs as single-endBest for iso-form detection.
Mate-Paired: Longer pairs than Paired-endAllow sequencing over long repeatsGood for detecting structural variationsRequre more input DNA than any other library
GST Colloquim: March 25th, 2014 7 / 1
How many reads?
Genomic Depends on the size of your genome.You want enough reads to cover your genome at depth.
RNA-Seq Depends on complexity of the transcriptional profileyou’re working on and if you need to capture rare eventsRule of thumb is that more replicates are moreimportant than more sequences.
Again this is another decision that is entirely dependent on the questionyou are trying to answer and the organism you are working in.In reality, there is usually more sequencing capacity in a lane than youneed for a sample so the real question is how many samples can you poolinto a given lane.
GST Colloquim: March 25th, 2014 8 / 1
Read length:
Again completely dependent on experiment and organism.
Longer is usually better.
But sometimes short is good enough.
GST Colloquim: March 25th, 2014 9 / 1
Selecting a technology:
Based on Read / Library Type
Illumina Paired, Single, Mate PairIon Torrent Paired, Single, Mate Pair
Solid Single, Mate Pair454 Single , Mate Pair
PacBio Single
Read Length:
Illumina 150 - 250 bpIon Torrent 200-400 bp (100-200 bp for Paired).
454 500-1000 bpPacBio 1000+bp
Solid 75bp
GST Colloquim: March 25th, 2014 10 / 1
Selecting a technology:
Read Number: (Manufacturer’s claims, and machine dependent)
Illumina 0.3-1000 GigaBasesIon Torrent 60 - 80 Million reads
454 1 million readsPacBio ?
Solid 90-300 Gigabases
GST Colloquim: March 25th, 2014 11 / 1
Sample Prep (RNA-Seq Specific)
Sample Collection and Storage:
RNA-Later - Stabilization buffer 1 month storage time atRT. Good for field collection.
Liquid Nitrogen - Fast , cheap , effective as long as you haveconstant access.
RNA extraction Some sequencing centers only want total RNA so thatthey can verify sample quality before library prep.
GST Colloquim: March 25th, 2014 12 / 1
Sample Prep (RNA-Seq Specific)
rRNA Depletion:
Poly-A Enrichment polyA tails of mRNA used to enrich asample (most common)
rRNA depletion rRNA is actively bound and removed(important if large amount of rRNA present)
cDNA Library:
Non Stranded total RNA used for cDNA libraryconstruction. Strand information not preserved.
Stranded Strand information is preserved. Crucial inorganisms with overlapping genes.
GST Colloquim: March 25th, 2014 13 / 1
Library Prep
It is common to have a sequencing center do this step for you, butdepending on budget and experience you may want to do this yourself.
Fragment DNA Sonication or Enzyme based methods followed by sizeselection
DNA-Repair Blunting + A overhang
Ligate Adaptors
Attachment Site PCR addition of attachment site to one end.
Barcode Attachemnt PCR addition of bar-code and attachment site toother end
Clean Up Remove un ligated adapters etc.
GST Colloquim: March 25th, 2014 14 / 1
Sequencing
Send your samples off to the sequencing center.
You’ll get raw data back when it’s done.
GST Colloquim: March 25th, 2014 15 / 1
Quality Control of Raw Data
Need to measure:
Proportion of high quality bases called.
Distribution of called nucleotides.
Number of reads that are high overall quality
Distribution of read qualities at each position
GST Colloquim: March 25th, 2014 16 / 1
Trimming and Filtering Reads
It is common practice to:
remove reads with overall poor quality
trim the ends of reads to remove low quality sequences
remove low quality nucleotides
There are compelling arguments why you may want to do this later, but ingeneral its always safe to do these steps before you align reads.
GST Colloquim: March 25th, 2014 17 / 1
What comes next?
1
1Eyras, Eduardo; P. Alamancos, Gael; Agirre, Eneritz (2013): Methods to StudySplicing from RNA-Seq. figshare. http://dx.doi.org/10.6084/m9.figshare.679993
GST Colloquim: March 25th, 2014 18 / 1
What comes next?
2
2Eyras, Eduardo; P. Alamancos, Gael; Agirre, Eneritz (2013): Methods to StudySplicing from RNA-Seq. figshare. http://dx.doi.org/10.6084/m9.figshare.679993
GST Colloquim: March 25th, 2014 19 / 1
The Actual Workshop:
Learning ObjectivesRNA-seq data quality-control (FastQC)
Align sequence reads to a reference genome using Tophat
Review samtools and file formats conversion
View alignments in the IGV
Analyze differential gene expression (in R environment)
GST Colloquim: March 25th, 2014 20 / 1
The Actual Workshop:
Analysis workflow
GST Colloquim: March 25th, 2014 21 / 1
The Actual Workshop:
Toolsbowtie2tophat2FastQCsamtoolsR and required Bioconductor packages (DESeq)RStudioHTSeq 0.6.0Integrative Genomics Viewer (IGV)Java
Most of these required tools are already installed in my bin folder:
/lustre/home/qjia2/bin
GST Colloquim: March 25th, 2014 22 / 1
The Actual Workshop:
The dataData used in this tutorial was acquired from this paper:
Trapnell C, et al: Differential gene and transcript expression analysis of RNA-seqexperiments with TopHat and Cufflinks. Nature protocols 2012, 7(3):562-578. Pubmed
It is generated in silico in Drosophila melanogaster and contains 6 paired-endsamples corresponding to 3 biological replicates each of 2 conditions.For more details, please click here.
File name Description
C1_R1_1.fq.gz, C1_R1_2.fq.gz Simulated Condition 1, replicate 1
C1_R2_1.fq.gz, C1_R2_2.fq.gz Simulated Condition 1, replicate 2
C1_R3_1.fq.gz, C1_R3_2.fq.gz Simulated Condition 1, replicate 3
C2_R1_1.fq.gz, C2_R1_2.fq.gz Simulated Condition 2, replicate 1
C2_R2_1.fq.gz, C2_R2_2.fq.gz Simulated Condition 2, replicate 2
C2_R3_1.fq.gz, C2_R3_2.fq.gz Simulated Condition 2, replicate 3
GST Colloquim: March 25th, 2014 23 / 1
The Actual Workshop:
Download the reference genome and genemodel annotations
You also need the reference genome and gene model annotations (GTF models),which can be downloaded from Ensembl or Illumina
wget ftp://ftp.ensembl.org/pub//mnt2/release-75/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP5.75.dna.toplevel.fa.gz
wget ftp://ftp.ensembl.org/pub//mnt2/release-75/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP5.75.gtf.gz
gunzip Drosophila_melanogaster.BDGP5.75.*
Indexing your reference genome:
/lustre/home/qjia2/bin/bowtie2-build -f Drosophila_melanogaster.BDGP5.75.dna.toplevel.fa Dme_BDGP5_75
After executing the command, the following BT2 files will be created:
Dme_BDGP5_75.1.bt2Dme_BDGP5_75.2.bt2Dme_BDGP5_75.3.bt2Dme_BDGP5_75.4.bt2Dme_BDGP5_75.rev.1.bt2Dme_BDGP5_75.rev.2.bt2
For model species, you can download pre-built Bowtie and Bowtie 2 indexes fromBowtie website.
GST Colloquim: March 25th, 2014 24 / 1
The Actual Workshop:
Create links to the required dataThose required files are stored in the following directory in Newton:
/data/scratch/qjia2/data2012
In your working directory, you can create links to these files so that you don’tneed to copy these files into your folders.
To create links, type the following commands from your working directory:
ln -s /data/scratch/qjia2/data2012/Dme_BDGP5_75.* .ln -s /data/scratch/qjia2/data2012/genes.gtf .ln -s /data/scratch/qjia2/data2012/GSM79448* .
Then, type:
ls
You will see those files.
GST Colloquim: March 25th, 2014 25 / 1
The Actual Workshop:
Assess data qualityIn this workshop, we’ll use FastQC to check the quality and integrity of the RNA-seqreads.
“FastQC aims to provide a simple way to do some quality control checks on rawsequence data coming from high throughput sequencing pipelines. It provides a modularset of analyses which you can use to give a quick impression of whether your data hasany problems of which you should be aware before doing any further analysis.”
Create a directory to store output files:
mkdir fastqc_reports
Run FastQC:
/lustre/home/qjia2/bin/fastqc -f fastq -o fastqc_reports *.fq.gz
Inspect the output:FastQC generates its output as an HTML file for each file and you need view it inyour web browser.
FastQC report for a good Illumina datasetFastQC report for a bad Illumina dataset
GST Colloquim: March 25th, 2014 26 / 1
The Actual Workshop:
Align RNA-seq reads to the genome usingTopHat2
Create a job definition file called C1R1.sge:
#$ -N C1R1#$ -q medium*#$ -cwd#$ -pe threads 8/home/qjia2/bin/tophat2 -G genes.gtf -o C1_R1_thout Dme_BDGP5_75 GSM794483_C1_R1_1.fq.gz GSM794483_C1_R1_2.fq.gz
Submit the job using the qsub command:
qsub C1R1.sge
Use the qstat command to check the status of your jobs:
qstat
Kill your job:
qdel your_job_PID
GST Colloquim: March 25th, 2014 27 / 1
The Actual Workshop:
TopHat2 outputThe tophat2 produces a number of files, most of which are internal, intermediate filesthatare generated for use within the pipeline.The output files you will likely want to look at are:
accepted_hits.bam: This file details the alignments for mapped reads. align_summary.txtdeletions.bed:insertions.bed junctions.bed: This file contains all the splice-sites detected by TopHat during the alignment. logs/ prep_reads.info unmapped.bam
The accepted_hits.bam file is used for our further analysis. This file is not “human-readable”,but we can use Samtools to convert it to the .sam format. Next, we’ll talk aboutSmatools first and then use IGV to look at our alignments.
GST Colloquim: March 25th, 2014 28 / 1
The Actual Workshop:
samtools“SAM Tools provide various utilities for manipulating alignments in the SAM format,including sorting, merging, indexing and generating alignments in a per-position format.”
samtools
Program: samtools (Tools for alignments in the SAM format)Version: 0.1.19-44428cd
Usage: samtools <command> [options]
Command: view SAM<->>BAM conversion sort sort alignment file mpileup multi-way pileup depth compute the depth faidx index/extract FASTA tview text alignment viewer index index alignment idxstats BAM index stats (r595 or later) fixmate fix mate information flagstat simple stats calmd recalculate MD/NM tags and '=' bases merge merge sorted alignments rmdup remove PCR duplicates reheader replace BAM header cat concatenate BAMs bedcov read depth per BED region targetcut cut fosmid regions (for fosmid pool only) phase phase heterozygotes bamshuf shuffle and group alignments by name
GST Colloquim: March 25th, 2014 29 / 1
The Actual Workshop:
File manipulationTo analyse differential expression, we need to count the reads that align to each gene.The htseq-count script needs sorted .sam files as an input, so run the followingcommandsto sort and create .sam files.
samtools sort -n C1_R1_thout/accepted_hits.bam C1_R1_snsamtools view -o C1_R1_sn.sam C1_R1_sn.bam
In order to view the alignments in IGV, the .bam files must be sorted by positionand indexed.
samtools sort C1_R1_thout/accepted_hits.bam C1_R1_ssamtools index C1_R1_s.bam
GST Colloquim: March 25th, 2014 30 / 1
The Actual Workshop:
View alignments in the IGV1. Start the IGV software
If you haven’t installed it or have trouble starting it, please click here.
2. Load genome and gene annotation into IGVUnder the Main Menu, click Genomes -> Create .genome File…,and thefollowing window will appear:
GST Colloquim: March 25th, 2014 31 / 1
The Actual Workshop:
View alignments in the IGV - cont.3. Load mapped reads into IGV
Under the Main Menu, click on File -> Load from File…. ChooseC1_R1_s.bam, and wait for IGV to finish loading.
4. Navigate in IGVFor further details see the IGV user guide at here.
GST Colloquim: March 25th, 2014 32 / 1
The Actual Workshop:
Count reads in features with htseq-countHTSeq is a python package, so it can be used as a library. It also provides a set ofstand-alonescripts that we can use from command line.
The script called heseq-count will be used to count the reads overlapping with knowngenes.It accepts .sam files and a genome annotation file (gtf format) as inputs.
htseq-count -s no -a 10 C1_R1_sn.sam genes.gtf > C1_R1.count
-s: whether the data is from a strand-specific assay (default: yes)-a: skip all reads with alignment quality lower than the given minimum value (default:10)
It outputs a table with counts for each feature.
FBgn0000003 0FBgn0000008 622FBgn0000014 91FBgn0000015 73FBgn0000017 2700... ...
After running this command on the other five samples, merge htseq-count files intoone (mergedCounts.txt).
gene_id C1R1 C1R2 C1R3 C2R1 C2R2 C2R3FBgn0000003 0 0 0 0 0 0FBgn0000008 622 618 555 530 606 547FBgn0000014 91 81 104 87 125 102FBgn0000015 73 67 53 55 71 73GST Colloquim: March 25th, 2014 33 / 1
The Actual Workshop:
Find differentially expressed genes (DESeq)The commands used here are also described in the DESeq vignette (PDF).
1. Starting R and loading required modules
Rlibrary("DESeq")
2. Set your working directory
# make sure you are under For_DESeq directory.setwd("/Users/mac/Documents/rna_seq/files/dataset/For_DESeq")
# You can use getwd() command to check your current working directory.getwd()
3. Read in your count table.
CountTable = read.table("mergedCounts.txt", header = TRUE, row.names = 1)
You table should look like this:
head(CountTable)
## C1R1 C1R2 C1R3 C2R1 C2R2 C2R3## FBgn0000003 0 0 0 0 0 0## FBgn0000008 622 618 555 530 606 547## FBgn0000014 91 81 104 87 125 102## FBgn0000015 73 67 53 55 71 73## FBgn0000017 2700 2425 2485 2575 2643 2604## FBgn0000018 328 343 363 304 288 345
GST Colloquim: March 25th, 2014 34 / 1
The Actual Workshop:
Find differentially expressed genes (DESeq)- cont.
4. Add treatment information to the data.
condition = factor(c("C1", "C1", "C1", "C2", "C2", "C2"))
condition
## [1] C1 C1 C1 C2 C2 C2## Levels: C1 C2
5. Create a newCountDataSet
cds <- newCountDataSet(CountTable, condition)
6. Estimate the size factors from the count data (Normalization)
cds <- estimateSizeFactors(cds)
To see these size factors, do this:
sizeFactors(cds)
## C1R1 C1R2 C1R3 C2R1 C2R2 C2R3 ## 1.0297 1.0295 1.0302 0.9755 0.9762 0.9777
GST Colloquim: March 25th, 2014 35 / 1
The Actual Workshop:
Find differentially expressed genes (DESeq)- cont.
Then, we can normalize the counts by the size factors using the following command:
head(counts(cds, normalized = TRUE))
## C1R1 C1R2 C1R3 C2R1 C2R2 C2R3## FBgn0000003 0.00 0.00 0.00 0.00 0.00 0.00## FBgn0000008 604.04 600.27 538.70 543.30 620.78 559.46## FBgn0000014 88.37 78.68 100.95 89.18 128.05 104.32## FBgn0000015 70.89 65.08 51.44 56.38 72.73 74.66## FBgn0000017 2622.02 2355.43 2412.04 2639.61 2707.45 2663.31## FBgn0000018 318.53 333.16 352.34 311.63 295.02 352.86
7. Calculate dispersion values
cds <- estimateDispersions(cds)
8. Inspect the estimated dispersions
plotDispEsts(cds)
GST Colloquim: March 25th, 2014 36 / 1
The Actual Workshop:
Find differentially expressed genes (DESeq)- cont.
9. Perform the test for differential expression
deg = nbinomTest(cds, "C1", "C2")
10. Plot the log2 fold changes against the mean normalised counts
plotMA(deg)
GST Colloquim: March 25th, 2014 37 / 1
The Actual Workshop:
Find differentially expressed genes (DESeq)- cont.
11. Plot histogram of p values
hist(deg$pval, breaks = 100, col = "skyblue", main = "")
12. Filter for significant genes at a 10% false discovery rate (FDR)
degSig = deg[deg$padj < 0.1, ]
Count the number of significant genes:
addmargins(table(deg$padj < 0.1))
## ## FALSE TRUE Sum ## 10012 269 10281
GST Colloquim: March 25th, 2014 38 / 1
The Actual Workshop:
Find differentially expressed genes (DESeq)- cont.
13. Look at the significantly upregulated and downregulated genes
head(degSig[order(degSig$log2FoldChange, decreasing = TRUE), ])
## id baseMean baseMeanA baseMeanB foldChange log2FoldChange## 2388 FBgn0025682 12095 6624 17565 2.652 1.407## 126 FBgn0000370 15468 8495 22440 2.641 1.401## 13444 FBgn0086904 17531 9887 25174 2.546 1.348## 15309 FBgn0263749 5510 3113 7908 2.540 1.345## 2103 FBgn0022893 15478 8754 22203 2.536 1.343## 2076 FBgn0022268 5276 2989 7562 2.530 1.339## pval padj## 2388 1.483e-96 7.626e-93## 126 3.985e-97 4.097e-93## 13444 7.761e-91 2.660e-87## 15309 5.640e-72 6.443e-69## 2103 1.261e-89 3.241e-86## 2076 1.574e-81 2.312e-78
head(degSig[order(degSig$log2FoldChange, decreasing = FALSE), ])
## id baseMean baseMeanA baseMeanB foldChange log2FoldChange## 11475 FBgn0051953 12.78 19.42 6.146 0.3165 -1.6597## 2685 FBgn0027513 160.39 189.03 131.764 0.6971 -0.5206## 5844 FBgn0033781 245.78 281.27 210.280 0.7476 -0.4197## 5682 FBgn0033539 546.01 624.37 467.646 0.7490 -0.4170## 8947 FBgn0038348 258.89 295.20 222.584 0.7540 -0.4073## 13333 FBgn0086251 529.43 600.73 458.118 0.7626 -0.3910## pval padj## 11475 2.557e-03 0.098108## 2685 8.797e-04 0.035054## 5844 1.432e-03 0.055784## 5682 3.157e-05 0.001319## 8947 1.942e-03 0.074794## 13333 1.093e-04 0.004514
GST Colloquim: March 25th, 2014 39 / 1
The Actual Workshop:
Find differentially expressed genes (DESeq)- cont.
14. Save our output to a file
write.csv(deg, file = "Result_table.csv")
write.csv(degSig, file = "Result_table_0.01FDR.csv")
You can use a spreadsheet program such as Excel to open .csv files.
GST Colloquim: March 25th, 2014 40 / 1
The Actual Workshop:
References
1. S. Anders, D. J. McCarthy, Y. S. Chen, M. Okoniewski, G. K. Smyth, W. Huber,M. D. Robinson, Count-based differential expression analysis of RNA sequencingdata using R and Bioconductor. Nature protocols 8, 1765-1786 (2013);published online EpubSep (Doi 10.1038/Nprot.2013.099).
2. C. Trapnell, A. Roberts, L. Goff, G. Pertea, D. Kim, D. R. Kelley, H. Pimentel, S.L. Salzberg, J. L. Rinn, L. Pachter, Differential gene and transcript expressionanalysis of RNA-seq experiments with TopHat and Cufflinks. Nature protocols7, 562-578 (2012); published online EpubMar (10.1038/nprot.2012.016).
3. DESeq vignette:http://bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf
GST Colloquim: March 25th, 2014 41 / 1