Introductin to RNAseq & Differential Gene Expression
-
Upload
amit-singh -
Category
Technology
-
view
1.639 -
download
0
Transcript of Introductin to RNAseq & Differential Gene Expression
Introduction to RNA-seq &
Differential Gene Expression
By : Amit Kumar Singh
Next Generation Sequencing data
Next Generation Sequencing: Possibilities
Massive growth in amount of data generation since 2006
• Traditional Sanger Vs Next Generation Sequencing methods: – Reduced cost per base– Reduced sequencing time– Covering wide range of
applications
Comparative costs: sequencing a human genome
Sequencing Cheaper and Fast..Analysis of data complex and time consuming..
Genome
transcriptome
mRNA
other
s
What is a Transcriptome ?
Complete set of all RNA molecues in cell. It includes mRNA, rRNA, tRNA and other non coding RNA.Array of mRNA transcripts produced in a particular cell or tissue type.The study of transcriptomics, also referred toas expression profiling, examines the expression level of mRNAs in a given cell population,
GENOME vs. TRANSCRIPTOMEGenome : Content is fixed Transcriptome : Content is time and cell specific & is much more complex than the genome
NGS Advantages Next-generation sequencing (NGS) of cDNA (RNA-Seq) becomes more widely
adopted for transcriptome profiling.
* Dropping prices and maturing technology are causing NGS as technology of choice
RNA-Seq does not depend on genome annotation Transcript reconstruction – non model organisms. Trascript verification – model organisms RNA-Seq is the method of choice in projects using nonmodel organisms and for
novel transcript discovery and genome annotation. Accurate expression level determination
Cons Current wet-lab RNA-Seq strategies require lengthy library preparation procedures
Different types of RNA
Transcripts and alternate splicing RNA transcript is the code that is copied from the strand of DNA(known as the template strand). mRNA (pre)is the actually strand that carries the code out of the nucleus and into the cytoplasm. This mRNA undergoes with alternate splicing where introns are spliced out.
Transcripts sharing same TSS or CDS
• Sequencing based method to study transcriptome
• Use of Next-Generation Sequencing (NGS) technology to measure RNA levels
• Generating and sequencing ‘reads’ from cDNA
• Mapping reads to reference genome
• Quantification of assembled reads.
What is RNAseq ?
Experiment design : Replicates
Technical Replicates: measure quantity from one source.
Eg : 5 samples from single patient suffering from lung cancer
Biological Replicates : measure a quantity from different sources under the same conditions.
Eg: 5 Samples, each from 5 different patients suffering from lung
cancer Use of replicates– Minimize experimental variation or artifacts– Improving results by averaging out – More the data, more robust the statistical test
and Results are more statistically significant
Analysis tools Read alignment Transcript assembly or genome annotation Transcript and gene quantification
RNA-Seq analysis pipeline for detecting
differential expression
An overview of RNAseq for Differential gene expression
Tuxedo Pipeline for RNAseq analysis
MappingObjective: To find the unique location where a short read is identical to the referenceReality: Reference is never a perfect representation of the actual biological source of RNA being sequencedSample-specific attributes like SNPs and indels; short reads align perfectly to multiple locations and can contain sequencing errorsReal task is to find the location where each short read best matches the reference allowing for errors and structural variation
Problem in mapping of reads spanning splice junctions
These reads are alignedOn Reference Genome.
Splice junction aligners break junction Reads and index the information
Mapping: ChallengesMultimaps: Reads that map equally well to several locations
Multi-maps treatment-Discard multimaps
Paired-end reads reduce the problem of multi-mapping
• Splice junction mapper• Initial mapping onto genome
(exons) by bowtie, an ultrafast short read aligner
• Builds database of possible splice junction
• Maps unmapped reads against the junctions
• Also ; splits the unmapped reads into smaller fragment to map on exons.
TopHat algorithm
Input to know : GTF file• GTF : Gene transfer format• Reference GTF file is collection of every transcript (genes
and its isoforms + non-coding RNA transcripts)• Available with genome databases ENSEMBL, UCSC, RefSeq
Sample Ref.GTF file format
chr source
start endstrand Attributes of transcripts
Mapping with TophatHow to use !
Tophat which is a splice junction aligner. At the backend it uses bowtie for mapping of short reads on genome.
Bowtie which uses an extremely economical data structure
called the FM index to store the reference genome sequence and
allows it to be searched rapidly.
Indexing of Reference Genome:
Eg : The referece genome is chr19.fa. Indexing of Reference Genome is done by bowtie2 utility – bowtie2-build.
bowtie2-build <Ref genome fasta> <prefix>
[user]$ bowtie2-build chr19.fa chr19
Tophat commands
(i)Mapping without using reference annotation
[user]$ tophat chr19 reads1.fastq reads2.fastq
(ii) Mapping with using reference annotation
It uses referece annotation (GTF) for known splice junction location
for better mapping.
[user]$ tophat -G chr19.gtf chr19 reads1.fastq reads2.fastq
(iii) Mapping only to the reference annotation
[user]$ tophat -G chr19.gtf –no-novel-juncs chr19 reads1.fastq reads2.fastq
Note :The Gene transfer format (GTF) is a file format used to hold information about gene structure. Eg
New feature :Mapping on transcriptome: You can even map your reads directly on transcriptome with this new feature of
tophat. When providing TopHat with a known transcript file (-G/--GTF option above), a
transcriptome sequence file is built Bowtie then creates the index for this new transcriptome sequences Reads are then aligned these known transcripts (First time)
[user]$ tophat -o output_sample1 -G chr19.gtf --transcriptome-index=transcriptome/known chr19 sample1_1.fastq sample1_2.fastq
Once the transcriptome index is formed, there is not need to specify -G option next
time if you want to run tophat for other samples (Next time mapping on
transcriptome)
[user]$ tophat -o output_sample2 --transcriptome-index=tran/known chr19 sample1_1.fastq sample1_2.fastq
Output of Tophat
1. accepted_hits.bam. A list of read alignments in BAM format.
2. junctions.bed. A UCSC BED track of junctions reported by TopHat. The score is the number of alignments spanning the junction.
Alignments are reported in BAM files
BAM is the compressed, binary version of SAM, a flexible and general purpose read alignment format.
Many downstream analysis tools accept SAM and BAM as input.
There are also numerous utilities for viewing and manipulating SAM and BAM files. Perhaps The most popular among these is the SAMtools.
Read name
flag
chromosome
position
Mapping quality
cigar
= means PE
Name of mate
Insert len
seq
Quality val
CIGAR string (describes the position of insertions/deletions/matches in the alignment, encodes splice junctions, for example)For more information : http://samtools.sourceforge.net/samtools.shtml
SAM Format
Bed format
chr
Start & End Jun ID Optinal fields
Junctions View on IGV
Analysis with samtools(i) View the BAM file[user]$ samtools view accepted_hits.bam(ii) Convert the BAM file into non binary SAM file[user]$ samtools view accepted_hits.bam > accepted_hits.sam(iii) Count the number of lines of sam file[user]$ wc -l accepted_hits.sam
(iv) sorting of SAM file[user]$ samtools sort accepted_hits.bam outprefix
(v) Indexing of BAM file [user]$ samtools index accepted_hits.bam
(VI) Knowing the statistics of BAM file[user]$ samtools flagstat accepted_hits.bam
Cufflinks
Cufflinks to generate a transcriptome assembly for each sample. Cufflinks assembles individual transcripts from RNA-seq reads that have been aligned to the genome.
Normalization• More reads mapped to a transcript if it is
-long -At higher depth of coverage
(high expression)
• Normalize such that
Features of different lengths of different conditions can be compared
• Need for Normalization:To reduce bias within the sample or between different sample conditions
• FPKM is one such normalization strategy adopted by cufflinks.
C= the number of reads mapped onto the gene's exonsN= total number of reads in the experimentL= the sum of the exons in base pairs.
FPKM=109×CNL
• Cufflink estimates the abundance values in FPKM (fragments per kilobase of transcript per million mapped fragments )• Cufflinks ensure that expression levels for different genes and transcripts can be compared across runs by FPKM values.• FPKM is a measure of how many reads have been recorded for each transcript normalized by transcript length and the total number of reads.
FPKM
Visualizing data on IGV
Intronic regions Exonic regions
Nonconding exonic region
Read mapping onGene in BAM file
Cuffmerge
The assemblies generated by cufflinks are then merged together using the Cuffmerge utility (An function of cufflinks package).
This merged assembly provides a uniform basis for calculating gene and transcript expression in each condition
Command :[user]$cuffmerge -s genome/chr19.fa -g chromosome10.gtf assembly_GTF.list Where assemby_GTF.list , a text file contains path of all cufflinks assemblies you want to merge.
Cuffdiff : Protocol to estimate differential gene expression !!!
Calculates expression levels and tests the statistical significance of observed changes.
Fisher’s test Estimates log2 fold change log2( FPKMB /FPKMA ) Cuffdiff reports numerous output files containing the results of its differential
analysis of the samples. These files contain statistical values such as fold change, P values, gene and
transcript features such as commonname and location in the genome and the FPKM values for each feature.
Command :[user]$ cuffdiff merged_asm/merged.gtf sample1/tophat_out/accepted_hits.bamsample2/tophat_out/accepted_hits.bam
gene_exp.diff
FPKM values Significant gene
Healthy tissueHealthy tissueDiseased tissueDiseased tissueref genome A (.fa file)ref genome A (.fa file)
tophattophat tophattophat
Healthy_hits.bam
Healthy_hits.bam
Diseased hits.bam
Diseased hits.bam
cufflinkscufflinks cufflinks cufflinks
Ref A.gtfRef A.gtf
Transcripts_healthy.gtfTranscripts_healthy.gtf Transcripts_diseased.gtfTranscripts_diseased.gtf
cuffmerge cuffmerge
Healthy_disease.merged.gtfHealthy_disease.merged.gtf
cuffdiffcuffdiff
Gene.diffGene.diff
ENSEMBL
ENSEMBL
Samples
Tophat
Cufflinks
Cuffdiff
Expression Analysis
DESeq (R package)
Fold change
HTSeq
Tools used in RNAseq excercise
Geneset enrichment analysisby DAVID
Identification of GO Terms that are significantly overrepresented in the given set of genelist.
Hypergeometric statistical test is performed to identify such terms.
Simple Example : Let Your statistically significant gene list = 694 (Each gene associated with GO
Terms) Total genes in organism = 10,738 Total genes with cell division GO term biological process in organism = 634
Hypergeometric test will predict (with its statistical values for confidence): Out of 694 genes 107 genes have cell division GO term (Biological process) which is over represented
You can conclude that there is cell division which is altered between normal and treatment sample.
Submit your genelist
Annotation summary results
Questions ??
THANK YOU