NGS Data Analysis: An Intro to RNA-Seq - Amazon S3 · NGS Data Analysis: An Intro to RNA-Seq March...

NGS Data Analysis: An Intro to RNA-Seq

March 25th, 2014

GST Colloquim: March 25th, 2014 1 / 1

Workshop Design

Basics of NGS

Sample Prep

RNA-Seq Analysis


Experimental Design

There are lots of of sequencing experiments available:

Resequencing

Assembly

RNA-Seq

CHiP-Seq

Meta-genomics


Common experimental questions:

Measure variation within or between species

Generate a genome sequence

Transcriptome characterization

Identify protein binding sites

Population genetics

Differential expression studies


Basic Process


Design Considerations

What resources do you have already? (reference genome, curatedgene models, etc.)

Do you need biological reps? (Depends on the experiment, but theanswer is usually yes.)

Do you need technical reps? (Most likely not.)

Do you need controls? (Depends on the experiment.)

Do you need deep sequencing coverage?(Again, depends on theexperiment.)

All of these questions should be answered before you start.


Types of reads

Single: Fast runsCheapest overall cost

Paired: More data for each fragmentMore data for alignment/assemblySame inputs as single-endBest for iso-form detection.

Mate-Paired: Longer pairs than Paired-endAllow sequencing over long repeatsGood for detecting structural variationsRequre more input DNA than any other library


How many reads?

Genomic Depends on the size of your genome.You want enough reads to cover your genome at depth.

RNA-Seq Depends on complexity of the transcriptional profileyou’re working on and if you need to capture rare eventsRule of thumb is that more replicates are moreimportant than more sequences.

Again this is another decision that is entirely dependent on the questionyou are trying to answer and the organism you are working in.In reality, there is usually more sequencing capacity in a lane than youneed for a sample so the real question is how many samples can you poolinto a given lane.


Read length:

Again completely dependent on experiment and organism.

Longer is usually better.

But sometimes short is good enough.


Selecting a technology:

Based on Read / Library Type

Illumina Paired, Single, Mate PairIon Torrent Paired, Single, Mate Pair

Solid Single, Mate Pair454 Single , Mate Pair

PacBio Single

Read Length:

Illumina 150 - 250 bpIon Torrent 200-400 bp (100-200 bp for Paired).

454 500-1000 bpPacBio 1000+bp

Solid 75bp


Selecting a technology:

Read Number: (Manufacturer’s claims, and machine dependent)

Illumina 0.3-1000 GigaBasesIon Torrent 60 - 80 Million reads

454 1 million readsPacBio ?

Solid 90-300 Gigabases


Sample Prep (RNA-Seq Specific)

Sample Collection and Storage:

RNA-Later - Stabilization buffer 1 month storage time atRT. Good for field collection.

Liquid Nitrogen - Fast , cheap , effective as long as you haveconstant access.

RNA extraction Some sequencing centers only want total RNA so thatthey can verify sample quality before library prep.


Sample Prep (RNA-Seq Specific)

rRNA Depletion:

Poly-A Enrichment polyA tails of mRNA used to enrich asample (most common)

rRNA depletion rRNA is actively bound and removed(important if large amount of rRNA present)

cDNA Library:

Non Stranded total RNA used for cDNA libraryconstruction. Strand information not preserved.

Stranded Strand information is preserved. Crucial inorganisms with overlapping genes.


Library Prep

It is common to have a sequencing center do this step for you, butdepending on budget and experience you may want to do this yourself.

Fragment DNA Sonication or Enzyme based methods followed by sizeselection

DNA-Repair Blunting + A overhang

Ligate Adaptors

Attachment Site PCR addition of attachment site to one end.

Barcode Attachemnt PCR addition of bar-code and attachment site toother end

Clean Up Remove un ligated adapters etc.


Sequencing

Send your samples off to the sequencing center.

You’ll get raw data back when it’s done.


Quality Control of Raw Data

Need to measure:

Proportion of high quality bases called.

Distribution of called nucleotides.

Number of reads that are high overall quality

Distribution of read qualities at each position


Trimming and Filtering Reads

It is common practice to:

remove reads with overall poor quality

trim the ends of reads to remove low quality sequences

remove low quality nucleotides

There are compelling arguments why you may want to do this later, but ingeneral its always safe to do these steps before you align reads.


What comes next?

1

1Eyras, Eduardo; P. Alamancos, Gael; Agirre, Eneritz (2013): Methods to StudySplicing from RNA-Seq. figshare. http://dx.doi.org/10.6084/m9.figshare.679993


What comes next?

2

2Eyras, Eduardo; P. Alamancos, Gael; Agirre, Eneritz (2013): Methods to StudySplicing from RNA-Seq. figshare. http://dx.doi.org/10.6084/m9.figshare.679993


The Actual Workshop:

Learning ObjectivesRNA-seq data quality-control (FastQC)

Align sequence reads to a reference genome using Tophat

Review samtools and file formats conversion

View alignments in the IGV

Analyze differential gene expression (in R environment)



Analysis workflow



Toolsbowtie2tophat2FastQCsamtoolsR and required Bioconductor packages (DESeq)RStudioHTSeq 0.6.0Integrative Genomics Viewer (IGV)Java

Most of these required tools are already installed in my bin folder:

/lustre/home/qjia2/bin



The dataData used in this tutorial was acquired from this paper:

Trapnell C, et al: Differential gene and transcript expression analysis of RNA-seqexperiments with TopHat and Cufflinks. Nature protocols 2012, 7(3):562-578. Pubmed

It is generated in silico in Drosophila melanogaster and contains 6 paired-endsamples corresponding to 3 biological replicates each of 2 conditions.For more details, please click here.

File name Description

C1_R1_1.fq.gz, C1_R1_2.fq.gz Simulated Condition 1, replicate 1








Download the reference genome and genemodel annotations

You also need the reference genome and gene model annotations (GTF models),which can be downloaded from Ensembl or Illumina

wget ftp://ftp.ensembl.org/pub//mnt2/release-75/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP5.75.dna.toplevel.fa.gz

wget ftp://ftp.ensembl.org/pub//mnt2/release-75/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP5.75.gtf.gz

gunzip Drosophila_melanogaster.BDGP5.75.*

Indexing your reference genome:

/lustre/home/qjia2/bin/bowtie2-build -f Drosophila_melanogaster.BDGP5.75.dna.toplevel.fa Dme_BDGP5_75

After executing the command, the following BT2 files will be created:

Dme_BDGP5_75.1.bt2Dme_BDGP5_75.2.bt2Dme_BDGP5_75.3.bt2Dme_BDGP5_75.4.bt2Dme_BDGP5_75.rev.1.bt2Dme_BDGP5_75.rev.2.bt2

For model species, you can download pre-built Bowtie and Bowtie 2 indexes fromBowtie website.



Create links to the required dataThose required files are stored in the following directory in Newton:

/data/scratch/qjia2/data2012

In your working directory, you can create links to these files so that you don’tneed to copy these files into your folders.

To create links, type the following commands from your working directory:

ln -s /data/scratch/qjia2/data2012/Dme_BDGP5_75.* .ln -s /data/scratch/qjia2/data2012/genes.gtf .ln -s /data/scratch/qjia2/data2012/GSM79448* .

Then, type:

ls

You will see those files.



Assess data qualityIn this workshop, we’ll use FastQC to check the quality and integrity of the RNA-seqreads.

“FastQC aims to provide a simple way to do some quality control checks on rawsequence data coming from high throughput sequencing pipelines. It provides a modularset of analyses which you can use to give a quick impression of whether your data hasany problems of which you should be aware before doing any further analysis.”

Create a directory to store output files:

mkdir fastqc_reports

Run FastQC:

/lustre/home/qjia2/bin/fastqc -f fastq -o fastqc_reports *.fq.gz

Inspect the output:FastQC generates its output as an HTML file for each file and you need view it inyour web browser.

FastQC report for a good Illumina datasetFastQC report for a bad Illumina dataset



Align RNA-seq reads to the genome usingTopHat2

Create a job definition file called C1R1.sge:

#$ -N C1R1#$ -q medium*#$ -cwd#$ -pe threads 8/home/qjia2/bin/tophat2 -G genes.gtf -o C1_R1_thout Dme_BDGP5_75 GSM794483_C1_R1_1.fq.gz GSM794483_C1_R1_2.fq.gz

Submit the job using the qsub command:

qsub C1R1.sge

Use the qstat command to check the status of your jobs:

qstat

Kill your job:

qdel your_job_PID



TopHat2 outputThe tophat2 produces a number of files, most of which are internal, intermediate filesthatare generated for use within the pipeline.The output files you will likely want to look at are:

accepted_hits.bam: This file details the alignments for mapped reads. align_summary.txtdeletions.bed:insertions.bed junctions.bed: This file contains all the splice-sites detected by TopHat during the alignment. logs/ prep_reads.info unmapped.bam

The accepted_hits.bam file is used for our further analysis. This file is not “human-readable”,but we can use Samtools to convert it to the .sam format. Next, we’ll talk aboutSmatools first and then use IGV to look at our alignments.



samtools“SAM Tools provide various utilities for manipulating alignments in the SAM format,including sorting, merging, indexing and generating alignments in a per-position format.”

samtools

Program: samtools (Tools for alignments in the SAM format)Version: 0.1.19-44428cd

Usage: samtools <command> [options]

Command: view SAM<->>BAM conversion sort sort alignment file mpileup multi-way pileup depth compute the depth faidx index/extract FASTA tview text alignment viewer index index alignment idxstats BAM index stats (r595 or later) fixmate fix mate information flagstat simple stats calmd recalculate MD/NM tags and '=' bases merge merge sorted alignments rmdup remove PCR duplicates reheader replace BAM header cat concatenate BAMs bedcov read depth per BED region targetcut cut fosmid regions (for fosmid pool only) phase phase heterozygotes bamshuf shuffle and group alignments by name



File manipulationTo analyse differential expression, we need to count the reads that align to each gene.The htseq-count script needs sorted .sam files as an input, so run the followingcommandsto sort and create .sam files.

samtools sort -n C1_R1_thout/accepted_hits.bam C1_R1_snsamtools view -o C1_R1_sn.sam C1_R1_sn.bam

In order to view the alignments in IGV, the .bam files must be sorted by positionand indexed.

samtools sort C1_R1_thout/accepted_hits.bam C1_R1_ssamtools index C1_R1_s.bam



View alignments in the IGV1. Start the IGV software

If you haven’t installed it or have trouble starting it, please click here.

2. Load genome and gene annotation into IGVUnder the Main Menu, click Genomes -> Create .genome File…,and thefollowing window will appear:



View alignments in the IGV - cont.3. Load mapped reads into IGV

Under the Main Menu, click on File -> Load from File…. ChooseC1_R1_s.bam, and wait for IGV to finish loading.

4. Navigate in IGVFor further details see the IGV user guide at here.



Count reads in features with htseq-countHTSeq is a python package, so it can be used as a library. It also provides a set ofstand-alonescripts that we can use from command line.

The script called heseq-count will be used to count the reads overlapping with knowngenes.It accepts .sam files and a genome annotation file (gtf format) as inputs.

htseq-count -s no -a 10 C1_R1_sn.sam genes.gtf > C1_R1.count

-s: whether the data is from a strand-specific assay (default: yes)-a: skip all reads with alignment quality lower than the given minimum value (default:10)

It outputs a table with counts for each feature.

FBgn0000003 0FBgn0000008 622FBgn0000014 91FBgn0000015 73FBgn0000017 2700... ...

After running this command on the other five samples, merge htseq-count files intoone (mergedCounts.txt).

gene_id C1R1 C1R2 C1R3 C2R1 C2R2 C2R3FBgn0000003 0 0 0 0 0 0FBgn0000008 622 618 555 530 606 547FBgn0000014 91 81 104 87 125 102FBgn0000015 73 67 53 55 71 73GST Colloquim: March 25th, 2014 33 / 1


Find differentially expressed genes (DESeq)The commands used here are also described in the DESeq vignette (PDF).

1. Starting R and loading required modules

Rlibrary("DESeq")

2. Set your working directory

# make sure you are under For_DESeq directory.setwd("/Users/mac/Documents/rna_seq/files/dataset/For_DESeq")

# You can use getwd() command to check your current working directory.getwd()

3. Read in your count table.

CountTable = read.table("mergedCounts.txt", header = TRUE, row.names = 1)

You table should look like this:

head(CountTable)

## C1R1 C1R2 C1R3 C2R1 C2R2 C2R3## FBgn0000003 0 0 0 0 0 0## FBgn0000008 622 618 555 530 606 547## FBgn0000014 91 81 104 87 125 102## FBgn0000015 73 67 53 55 71 73## FBgn0000017 2700 2425 2485 2575 2643 2604## FBgn0000018 328 343 363 304 288 345



Find differentially expressed genes (DESeq)- cont.

4. Add treatment information to the data.

condition = factor(c("C1", "C1", "C1", "C2", "C2", "C2"))

condition

## [1] C1 C1 C1 C2 C2 C2## Levels: C1 C2

5. Create a newCountDataSet

cds <- newCountDataSet(CountTable, condition)

6. Estimate the size factors from the count data (Normalization)

cds <- estimateSizeFactors(cds)

To see these size factors, do this:

sizeFactors(cds)

## C1R1 C1R2 C1R3 C2R1 C2R2 C2R3 ## 1.0297 1.0295 1.0302 0.9755 0.9762 0.9777




Then, we can normalize the counts by the size factors using the following command:

head(counts(cds, normalized = TRUE))

## C1R1 C1R2 C1R3 C2R1 C2R2 C2R3## FBgn0000003 0.00 0.00 0.00 0.00 0.00 0.00## FBgn0000008 604.04 600.27 538.70 543.30 620.78 559.46## FBgn0000014 88.37 78.68 100.95 89.18 128.05 104.32## FBgn0000015 70.89 65.08 51.44 56.38 72.73 74.66## FBgn0000017 2622.02 2355.43 2412.04 2639.61 2707.45 2663.31## FBgn0000018 318.53 333.16 352.34 311.63 295.02 352.86

7. Calculate dispersion values

cds <- estimateDispersions(cds)

8. Inspect the estimated dispersions

plotDispEsts(cds)




9. Perform the test for differential expression

deg = nbinomTest(cds, "C1", "C2")

10. Plot the log2 fold changes against the mean normalised counts

plotMA(deg)




11. Plot histogram of p values

hist(deg$pval, breaks = 100, col = "skyblue", main = "")

12. Filter for significant genes at a 10% false discovery rate (FDR)

degSig = deg[deg$padj < 0.1, ]

Count the number of significant genes:

addmargins(table(deg$padj < 0.1))

## ## FALSE TRUE Sum ## 10012 269 10281




13. Look at the significantly upregulated and downregulated genes

head(degSig[order(degSig$log2FoldChange, decreasing = TRUE), ])

## id baseMean baseMeanA baseMeanB foldChange log2FoldChange## 2388 FBgn0025682 12095 6624 17565 2.652 1.407## 126 FBgn0000370 15468 8495 22440 2.641 1.401## 13444 FBgn0086904 17531 9887 25174 2.546 1.348## 15309 FBgn0263749 5510 3113 7908 2.540 1.345## 2103 FBgn0022893 15478 8754 22203 2.536 1.343## 2076 FBgn0022268 5276 2989 7562 2.530 1.339## pval padj## 2388 1.483e-96 7.626e-93## 126 3.985e-97 4.097e-93## 13444 7.761e-91 2.660e-87## 15309 5.640e-72 6.443e-69## 2103 1.261e-89 3.241e-86## 2076 1.574e-81 2.312e-78

head(degSig[order(degSig$log2FoldChange, decreasing = FALSE), ])

## id baseMean baseMeanA baseMeanB foldChange log2FoldChange## 11475 FBgn0051953 12.78 19.42 6.146 0.3165 -1.6597## 2685 FBgn0027513 160.39 189.03 131.764 0.6971 -0.5206## 5844 FBgn0033781 245.78 281.27 210.280 0.7476 -0.4197## 5682 FBgn0033539 546.01 624.37 467.646 0.7490 -0.4170## 8947 FBgn0038348 258.89 295.20 222.584 0.7540 -0.4073## 13333 FBgn0086251 529.43 600.73 458.118 0.7626 -0.3910## pval padj## 11475 2.557e-03 0.098108## 2685 8.797e-04 0.035054## 5844 1.432e-03 0.055784## 5682 3.157e-05 0.001319## 8947 1.942e-03 0.074794## 13333 1.093e-04 0.004514




14. Save our output to a file

write.csv(deg, file = "Result_table.csv")

write.csv(degSig, file = "Result_table_0.01FDR.csv")

You can use a spreadsheet program such as Excel to open .csv files.



References

1. S. Anders, D. J. McCarthy, Y. S. Chen, M. Okoniewski, G. K. Smyth, W. Huber,M. D. Robinson, Count-based differential expression analysis of RNA sequencingdata using R and Bioconductor. Nature protocols 8, 1765-1786 (2013);published online EpubSep (Doi 10.1038/Nprot.2013.099).

2. C. Trapnell, A. Roberts, L. Goff, G. Pertea, D. Kim, D. R. Kelley, H. Pimentel, S.L. Salzberg, J. L. Rinn, L. Pachter, Differential gene and transcript expressionanalysis of RNA-seq experiments with TopHat and Cufflinks. Nature protocols7, 562-578 (2012); published online EpubMar (10.1038/nprot.2012.016).

3. DESeq vignette:http://bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf


NGS Data Analysis: An Intro to RNA-Seq - Amazon S3 · NGS Data Analysis: An Intro to RNA-Seq March...

Documents

Transcript of NGS Data Analysis: An Intro to RNA-Seq - Amazon S3 · NGS Data Analysis: An Intro to RNA-Seq March...