Introductin to RNAseq & Differential Gene Expression

43
Introduction to RNA-seq & Differential Gene Expression By : Amit Kumar Singh

Transcript of Introductin to RNAseq & Differential Gene Expression

Page 1: Introductin to RNAseq & Differential Gene Expression

Introduction to RNA-seq &

Differential Gene Expression

By : Amit Kumar Singh

Page 2: Introductin to RNAseq & Differential Gene Expression

Next Generation Sequencing data

Next Generation Sequencing: Possibilities

Massive growth in amount of data generation since 2006

• Traditional Sanger Vs Next Generation Sequencing methods: – Reduced cost per base– Reduced sequencing time– Covering wide range of

applications

Comparative costs: sequencing a human genome

Sequencing Cheaper and Fast..Analysis of data complex and time consuming..

Page 3: Introductin to RNAseq & Differential Gene Expression

Genome

transcriptome

mRNA

other

s

What is a Transcriptome ?

Complete set of all RNA molecues in cell. It includes mRNA, rRNA, tRNA and other non coding RNA.Array of mRNA transcripts produced in a particular cell or tissue type.The study of transcriptomics, also referred toas expression profiling, examines the expression level of mRNAs in a given cell population,

GENOME vs. TRANSCRIPTOMEGenome : Content is fixed Transcriptome : Content is time and cell specific & is much more complex than the genome

Page 4: Introductin to RNAseq & Differential Gene Expression

NGS Advantages Next-generation sequencing (NGS) of cDNA (RNA-Seq) becomes more widely

adopted for transcriptome profiling.

* Dropping prices and maturing technology are causing NGS as technology of choice

RNA-Seq does not depend on genome annotation Transcript reconstruction – non model organisms. Trascript verification – model organisms RNA-Seq is the method of choice in projects using nonmodel organisms and for

novel transcript discovery and genome annotation. Accurate expression level determination

Cons Current wet-lab RNA-Seq strategies require lengthy library preparation procedures

Page 5: Introductin to RNAseq & Differential Gene Expression

Different types of RNA

Page 6: Introductin to RNAseq & Differential Gene Expression

Transcripts and alternate splicing RNA transcript is the code that is copied from the strand of DNA(known as the template strand). mRNA (pre)is the actually strand that carries the code out of the nucleus and into the cytoplasm. This mRNA undergoes with alternate splicing where introns are spliced out.

Page 7: Introductin to RNAseq & Differential Gene Expression

Transcripts sharing same TSS or CDS

Page 8: Introductin to RNAseq & Differential Gene Expression

• Sequencing based method to study transcriptome

• Use of Next-Generation Sequencing (NGS) technology to measure RNA levels

• Generating and sequencing ‘reads’ from cDNA

• Mapping reads to reference genome

• Quantification of assembled reads.

What is RNAseq ?

Page 9: Introductin to RNAseq & Differential Gene Expression

Experiment design : Replicates

Technical Replicates: measure quantity from one source.

Eg : 5 samples from single patient suffering from lung cancer

Biological Replicates : measure a quantity from different sources under the same conditions.

Eg: 5 Samples, each from 5 different patients suffering from lung

cancer Use of replicates– Minimize experimental variation or artifacts– Improving results by averaging out – More the data, more robust the statistical test

and Results are more statistically significant

Page 10: Introductin to RNAseq & Differential Gene Expression

Analysis tools Read alignment Transcript assembly or genome annotation Transcript and gene quantification

Page 11: Introductin to RNAseq & Differential Gene Expression

RNA-Seq analysis pipeline for detecting

differential expression

Page 12: Introductin to RNAseq & Differential Gene Expression

An overview of RNAseq for Differential gene expression

Page 13: Introductin to RNAseq & Differential Gene Expression

Tuxedo Pipeline for RNAseq analysis

Page 14: Introductin to RNAseq & Differential Gene Expression

MappingObjective: To find the unique location where a short read is identical to the referenceReality: Reference is never a perfect representation of the actual biological source of RNA being sequencedSample-specific attributes like SNPs and indels; short reads align perfectly to multiple locations and can contain sequencing errorsReal task is to find the location where each short read best matches the reference allowing for errors and structural variation

Page 15: Introductin to RNAseq & Differential Gene Expression

Problem in mapping of reads spanning splice junctions

These reads are alignedOn Reference Genome.

Page 16: Introductin to RNAseq & Differential Gene Expression
Page 17: Introductin to RNAseq & Differential Gene Expression

Splice junction aligners break junction Reads and index the information

Page 18: Introductin to RNAseq & Differential Gene Expression

Mapping: ChallengesMultimaps: Reads that map equally well to several locations

Multi-maps treatment-Discard multimaps

Paired-end reads reduce the problem of multi-mapping

Page 19: Introductin to RNAseq & Differential Gene Expression

• Splice junction mapper• Initial mapping onto genome

(exons) by bowtie, an ultrafast short read aligner

• Builds database of possible splice junction

• Maps unmapped reads against the junctions

• Also ; splits the unmapped reads into smaller fragment to map on exons.

TopHat algorithm

Page 20: Introductin to RNAseq & Differential Gene Expression

Input to know : GTF file• GTF : Gene transfer format• Reference GTF file is collection of every transcript (genes

and its isoforms + non-coding RNA transcripts)• Available with genome databases ENSEMBL, UCSC, RefSeq

Sample Ref.GTF file format

chr source

start endstrand Attributes of transcripts

Page 21: Introductin to RNAseq & Differential Gene Expression

Mapping with TophatHow to use !

Tophat which is a splice junction aligner. At the backend it uses bowtie for mapping of short reads on genome.

Bowtie which uses an extremely economical data structure

called the FM index to store the reference genome sequence and

allows it to be searched rapidly.

Indexing of Reference Genome:

Eg : The referece genome is chr19.fa. Indexing of Reference Genome is done by bowtie2 utility – bowtie2-build.

bowtie2-build <Ref genome fasta> <prefix>

[user]$ bowtie2-build chr19.fa chr19

Page 22: Introductin to RNAseq & Differential Gene Expression

Tophat commands

(i)Mapping without using reference annotation

[user]$ tophat chr19 reads1.fastq reads2.fastq

(ii) Mapping with using reference annotation

It uses referece annotation (GTF) for known splice junction location

for better mapping.

[user]$ tophat -G chr19.gtf chr19 reads1.fastq reads2.fastq

(iii) Mapping only to the reference annotation

[user]$ tophat -G chr19.gtf –no-novel-juncs chr19 reads1.fastq reads2.fastq

Note :The Gene transfer format (GTF) is a file format used to hold information about gene structure. Eg

Page 23: Introductin to RNAseq & Differential Gene Expression

New feature :Mapping on transcriptome: You can even map your reads directly on transcriptome with this new feature of

tophat. When providing TopHat with a known transcript file (-G/--GTF option above), a

transcriptome sequence file is built Bowtie then creates the index for this new transcriptome sequences Reads are then aligned these known transcripts (First time)

[user]$ tophat -o output_sample1 -G chr19.gtf --transcriptome-index=transcriptome/known chr19 sample1_1.fastq sample1_2.fastq

Once the transcriptome index is formed, there is not need to specify -G option next

time if you want to run tophat for other samples (Next time mapping on

transcriptome)

[user]$ tophat -o output_sample2 --transcriptome-index=tran/known chr19 sample1_1.fastq sample1_2.fastq

Page 24: Introductin to RNAseq & Differential Gene Expression

Output of Tophat

1. accepted_hits.bam. A list of read alignments in BAM format.

2. junctions.bed. A UCSC BED track of junctions reported by TopHat. The score is the number of alignments spanning the junction.

Page 25: Introductin to RNAseq & Differential Gene Expression

Alignments are reported in BAM files

BAM is the compressed, binary version of SAM, a flexible and general purpose read alignment format.

Many downstream analysis tools accept SAM and BAM as input.

There are also numerous utilities for viewing and manipulating SAM and BAM files. Perhaps The most popular among these is the SAMtools.

Page 26: Introductin to RNAseq & Differential Gene Expression

Read name

flag

chromosome

position

Mapping quality

cigar

= means PE

Name of mate

Insert len

seq

Quality val

CIGAR string (describes the position of insertions/deletions/matches in the alignment, encodes splice junctions, for example)For more information : http://samtools.sourceforge.net/samtools.shtml

SAM Format

Page 27: Introductin to RNAseq & Differential Gene Expression

Bed format

chr

Start & End Jun ID Optinal fields

Junctions View on IGV

Page 28: Introductin to RNAseq & Differential Gene Expression

Analysis with samtools(i) View the BAM file[user]$ samtools view accepted_hits.bam(ii) Convert the BAM file into non binary SAM file[user]$ samtools view accepted_hits.bam > accepted_hits.sam(iii) Count the number of lines of sam file[user]$ wc -l accepted_hits.sam

(iv) sorting of SAM file[user]$ samtools sort accepted_hits.bam outprefix

(v) Indexing of BAM file [user]$ samtools index accepted_hits.bam

(VI) Knowing the statistics of BAM file[user]$ samtools flagstat accepted_hits.bam

Page 29: Introductin to RNAseq & Differential Gene Expression

Cufflinks

Cufflinks to generate a transcriptome assembly for each sample. Cufflinks assembles individual transcripts from RNA-seq reads that have been aligned to the genome.

Page 30: Introductin to RNAseq & Differential Gene Expression

Normalization• More reads mapped to a transcript if it is

-long -At higher depth of coverage

(high expression)

• Normalize such that

Features of different lengths of different conditions can be compared

• Need for Normalization:To reduce bias within the sample or between different sample conditions

• FPKM is one such normalization strategy adopted by cufflinks.

Page 31: Introductin to RNAseq & Differential Gene Expression

C= the number of reads mapped onto the gene's exonsN= total number of reads in the experimentL= the sum of the exons in base pairs.

FPKM=109×CNL

• Cufflink estimates the abundance values in FPKM (fragments per kilobase of transcript per million mapped fragments )• Cufflinks ensure that expression levels for different genes and transcripts can be compared across runs by FPKM values.• FPKM is a measure of how many reads have been recorded for each transcript normalized by transcript length and the total number of reads.

FPKM

Page 32: Introductin to RNAseq & Differential Gene Expression

Visualizing data on IGV

Intronic regions Exonic regions

Nonconding exonic region

Page 33: Introductin to RNAseq & Differential Gene Expression

Read mapping onGene in BAM file

Page 34: Introductin to RNAseq & Differential Gene Expression

Cuffmerge

The assemblies generated by cufflinks are then merged together using the Cuffmerge utility (An function of cufflinks package).

This merged assembly provides a uniform basis for calculating gene and transcript expression in each condition

Command :[user]$cuffmerge -s genome/chr19.fa -g chromosome10.gtf assembly_GTF.list Where assemby_GTF.list , a text file contains path of all cufflinks assemblies you want to merge.

Page 35: Introductin to RNAseq & Differential Gene Expression

Cuffdiff : Protocol to estimate differential gene expression !!!

Calculates expression levels and tests the statistical significance of observed changes.

Fisher’s test Estimates log2 fold change log2( FPKMB /FPKMA ) Cuffdiff reports numerous output files containing the results of its differential

analysis of the samples. These files contain statistical values such as fold change, P values, gene and

transcript features such as commonname and location in the genome and the FPKM values for each feature.

Command :[user]$ cuffdiff merged_asm/merged.gtf sample1/tophat_out/accepted_hits.bamsample2/tophat_out/accepted_hits.bam

Page 36: Introductin to RNAseq & Differential Gene Expression

gene_exp.diff

FPKM values Significant gene

Page 37: Introductin to RNAseq & Differential Gene Expression

Healthy tissueHealthy tissueDiseased tissueDiseased tissueref genome A (.fa file)ref genome A (.fa file)

tophattophat tophattophat

Healthy_hits.bam

Healthy_hits.bam

Diseased hits.bam

Diseased hits.bam

cufflinkscufflinks cufflinks cufflinks

Ref A.gtfRef A.gtf

Transcripts_healthy.gtfTranscripts_healthy.gtf Transcripts_diseased.gtfTranscripts_diseased.gtf

cuffmerge cuffmerge

Healthy_disease.merged.gtfHealthy_disease.merged.gtf

cuffdiffcuffdiff

Gene.diffGene.diff

ENSEMBL

ENSEMBL

Page 38: Introductin to RNAseq & Differential Gene Expression

Samples

Tophat

Cufflinks

Cuffdiff

Expression Analysis

DESeq (R package)

Fold change

HTSeq

Tools used in RNAseq excercise

Page 39: Introductin to RNAseq & Differential Gene Expression

Geneset enrichment analysisby DAVID

Identification of GO Terms that are significantly overrepresented in the given set of genelist.

Hypergeometric statistical test is performed to identify such terms.

Simple Example : Let Your statistically significant gene list = 694 (Each gene associated with GO

Terms) Total genes in organism = 10,738 Total genes with cell division GO term biological process in organism = 634

Hypergeometric test will predict (with its statistical values for confidence): Out of 694 genes 107 genes have cell division GO term (Biological process) which is over represented

You can conclude that there is cell division which is altered between normal and treatment sample.

Page 40: Introductin to RNAseq & Differential Gene Expression

Submit your genelist

Page 41: Introductin to RNAseq & Differential Gene Expression

Annotation summary results

Page 42: Introductin to RNAseq & Differential Gene Expression
Page 43: Introductin to RNAseq & Differential Gene Expression

Questions ??

THANK YOU