Introductin to RNAseq & Differential Gene Expression

Introduction to RNA-seq &

Differential Gene Expression

By : Amit Kumar Singh

Next Generation Sequencing data

Next Generation Sequencing: Possibilities

Massive growth in amount of data generation since 2006

• Traditional Sanger Vs Next Generation Sequencing methods: – Reduced cost per base– Reduced sequencing time– Covering wide range of

applications

Comparative costs: sequencing a human genome

Sequencing Cheaper and Fast..Analysis of data complex and time consuming..

Genome

transcriptome

mRNA

other

s

What is a Transcriptome ?

Complete set of all RNA molecues in cell. It includes mRNA, rRNA, tRNA and other non coding RNA.Array of mRNA transcripts produced in a particular cell or tissue type.The study of transcriptomics, also referred toas expression profiling, examines the expression level of mRNAs in a given cell population,

GENOME vs. TRANSCRIPTOMEGenome : Content is fixed Transcriptome : Content is time and cell specific & is much more complex than the genome

NGS Advantages Next-generation sequencing (NGS) of cDNA (RNA-Seq) becomes more widely

adopted for transcriptome profiling.

* Dropping prices and maturing technology are causing NGS as technology of choice

RNA-Seq does not depend on genome annotation Transcript reconstruction – non model organisms. Trascript verification – model organisms RNA-Seq is the method of choice in projects using nonmodel organisms and for

novel transcript discovery and genome annotation. Accurate expression level determination

Cons Current wet-lab RNA-Seq strategies require lengthy library preparation procedures

Different types of RNA

Transcripts and alternate splicing RNA transcript is the code that is copied from the strand of DNA(known as the template strand). mRNA (pre)is the actually strand that carries the code out of the nucleus and into the cytoplasm. This mRNA undergoes with alternate splicing where introns are spliced out.

Transcripts sharing same TSS or CDS

• Sequencing based method to study transcriptome

• Use of Next-Generation Sequencing (NGS) technology to measure RNA levels

• Generating and sequencing ‘reads’ from cDNA

• Mapping reads to reference genome

• Quantification of assembled reads.

What is RNAseq ?

Experiment design : Replicates

Technical Replicates: measure quantity from one source.

Eg : 5 samples from single patient suffering from lung cancer

Biological Replicates : measure a quantity from different sources under the same conditions.

Eg: 5 Samples, each from 5 different patients suffering from lung

cancer Use of replicates– Minimize experimental variation or artifacts– Improving results by averaging out – More the data, more robust the statistical test

and Results are more statistically significant

Analysis tools Read alignment Transcript assembly or genome annotation Transcript and gene quantification

RNA-Seq analysis pipeline for detecting

differential expression

An overview of RNAseq for Differential gene expression

Tuxedo Pipeline for RNAseq analysis

MappingObjective: To find the unique location where a short read is identical to the referenceReality: Reference is never a perfect representation of the actual biological source of RNA being sequencedSample-specific attributes like SNPs and indels; short reads align perfectly to multiple locations and can contain sequencing errorsReal task is to find the location where each short read best matches the reference allowing for errors and structural variation

Problem in mapping of reads spanning splice junctions

These reads are alignedOn Reference Genome.

Splice junction aligners break junction Reads and index the information

Mapping: ChallengesMultimaps: Reads that map equally well to several locations

Multi-maps treatment-Discard multimaps

Paired-end reads reduce the problem of multi-mapping

• Splice junction mapper• Initial mapping onto genome

(exons) by bowtie, an ultrafast short read aligner

• Builds database of possible splice junction

• Maps unmapped reads against the junctions

• Also ; splits the unmapped reads into smaller fragment to map on exons.

TopHat algorithm

Input to know : GTF file• GTF : Gene transfer format• Reference GTF file is collection of every transcript (genes

and its isoforms + non-coding RNA transcripts)• Available with genome databases ENSEMBL, UCSC, RefSeq

Sample Ref.GTF file format

chr source

start endstrand Attributes of transcripts

Mapping with TophatHow to use !

Tophat which is a splice junction aligner. At the backend it uses bowtie for mapping of short reads on genome.

Bowtie which uses an extremely economical data structure

called the FM index to store the reference genome sequence and

allows it to be searched rapidly.

Indexing of Reference Genome:

Eg : The referece genome is chr19.fa. Indexing of Reference Genome is done by bowtie2 utility – bowtie2-build.

bowtie2-build <Ref genome fasta> <prefix>

[user]$ bowtie2-build chr19.fa chr19

Tophat commands

(i)Mapping without using reference annotation

[user]$ tophat chr19 reads1.fastq reads2.fastq

(ii) Mapping with using reference annotation

It uses referece annotation (GTF) for known splice junction location

for better mapping.

[user]$ tophat -G chr19.gtf chr19 reads1.fastq reads2.fastq

(iii) Mapping only to the reference annotation

[user]$ tophat -G chr19.gtf –no-novel-juncs chr19 reads1.fastq reads2.fastq

Note :The Gene transfer format (GTF) is a file format used to hold information about gene structure. Eg

New feature :Mapping on transcriptome: You can even map your reads directly on transcriptome with this new feature of

tophat. When providing TopHat with a known transcript file (-G/--GTF option above), a

transcriptome sequence file is built Bowtie then creates the index for this new transcriptome sequences Reads are then aligned these known transcripts (First time)

[user]$ tophat -o output_sample1 -G chr19.gtf --transcriptome-index=transcriptome/known chr19 sample1_1.fastq sample1_2.fastq

Once the transcriptome index is formed, there is not need to specify -G option next

time if you want to run tophat for other samples (Next time mapping on

transcriptome)

[user]$ tophat -o output_sample2 --transcriptome-index=tran/known chr19 sample1_1.fastq sample1_2.fastq

Output of Tophat

1. accepted_hits.bam. A list of read alignments in BAM format.

2. junctions.bed. A UCSC BED track of junctions reported by TopHat. The score is the number of alignments spanning the junction.

Alignments are reported in BAM files

BAM is the compressed, binary version of SAM, a flexible and general purpose read alignment format.

Many downstream analysis tools accept SAM and BAM as input.

There are also numerous utilities for viewing and manipulating SAM and BAM files. Perhaps The most popular among these is the SAMtools.

Read name

flag

chromosome

position

Mapping quality

cigar

= means PE

Name of mate

Insert len

seq

Quality val

CIGAR string (describes the position of insertions/deletions/matches in the alignment, encodes splice junctions, for example)For more information : http://samtools.sourceforge.net/samtools.shtml

SAM Format

Bed format

chr

Start & End Jun ID Optinal fields

Junctions View on IGV

Analysis with samtools(i) View the BAM file[user]$ samtools view accepted_hits.bam(ii) Convert the BAM file into non binary SAM file[user]$ samtools view accepted_hits.bam > accepted_hits.sam(iii) Count the number of lines of sam file[user]$ wc -l accepted_hits.sam

(iv) sorting of SAM file[user]$ samtools sort accepted_hits.bam outprefix

(v) Indexing of BAM file [user]$ samtools index accepted_hits.bam

(VI) Knowing the statistics of BAM file[user]$ samtools flagstat accepted_hits.bam

Cufflinks

Cufflinks to generate a transcriptome assembly for each sample. Cufflinks assembles individual transcripts from RNA-seq reads that have been aligned to the genome.

Normalization• More reads mapped to a transcript if it is

-long -At higher depth of coverage

(high expression)

• Normalize such that

Features of different lengths of different conditions can be compared

• Need for Normalization:To reduce bias within the sample or between different sample conditions

• FPKM is one such normalization strategy adopted by cufflinks.

C= the number of reads mapped onto the gene's exonsN= total number of reads in the experimentL= the sum of the exons in base pairs.

FPKM=109×CNL

• Cufflink estimates the abundance values in FPKM (fragments per kilobase of transcript per million mapped fragments )• Cufflinks ensure that expression levels for different genes and transcripts can be compared across runs by FPKM values.• FPKM is a measure of how many reads have been recorded for each transcript normalized by transcript length and the total number of reads.

FPKM

Visualizing data on IGV

Intronic regions Exonic regions

Nonconding exonic region

Read mapping onGene in BAM file

Cuffmerge

The assemblies generated by cufflinks are then merged together using the Cuffmerge utility (An function of cufflinks package).

This merged assembly provides a uniform basis for calculating gene and transcript expression in each condition

Command :[user]$cuffmerge -s genome/chr19.fa -g chromosome10.gtf assembly_GTF.list Where assemby_GTF.list , a text file contains path of all cufflinks assemblies you want to merge.

Cuffdiff : Protocol to estimate differential gene expression !!!

Calculates expression levels and tests the statistical significance of observed changes.

Fisher’s test Estimates log2 fold change log2( FPKMB /FPKMA ) Cuffdiff reports numerous output files containing the results of its differential

analysis of the samples. These files contain statistical values such as fold change, P values, gene and

transcript features such as commonname and location in the genome and the FPKM values for each feature.

Command :[user]$ cuffdiff merged_asm/merged.gtf sample1/tophat_out/accepted_hits.bamsample2/tophat_out/accepted_hits.bam

gene_exp.diff

FPKM values Significant gene

Healthy tissueHealthy tissueDiseased tissueDiseased tissueref genome A (.fa file)ref genome A (.fa file)

tophattophat tophattophat

Healthy_hits.bam

Healthy_hits.bam

Diseased hits.bam

Diseased hits.bam

cufflinkscufflinks cufflinks cufflinks

Ref A.gtfRef A.gtf

Transcripts_healthy.gtfTranscripts_healthy.gtf Transcripts_diseased.gtfTranscripts_diseased.gtf

cuffmerge cuffmerge

Healthy_disease.merged.gtfHealthy_disease.merged.gtf

cuffdiffcuffdiff

Gene.diffGene.diff

ENSEMBL

ENSEMBL

Samples

Tophat

Cufflinks

Cuffdiff

Expression Analysis

DESeq (R package)

Fold change

HTSeq

Tools used in RNAseq excercise

Geneset enrichment analysisby DAVID

Identification of GO Terms that are significantly overrepresented in the given set of genelist.

Hypergeometric statistical test is performed to identify such terms.

Simple Example : Let Your statistically significant gene list = 694 (Each gene associated with GO

Terms) Total genes in organism = 10,738 Total genes with cell division GO term biological process in organism = 634

Hypergeometric test will predict (with its statistical values for confidence): Out of 694 genes 107 genes have cell division GO term (Biological process) which is over represented

You can conclude that there is cell division which is altered between normal and treatment sample.

Submit your genelist

Annotation summary results

Questions ??

THANK YOU

Introductin to RNAseq & Differential Gene Expression

Technology

Transcript of Introductin to RNAseq & Differential Gene Expression