Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf ·...
Transcript of Exercise 1 Review - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNASeq20140421lecture2.pdf ·...
Exercise 1 Review
tophat -o A -G testgenome.gff3 --no-novel-juncs
testgenome a.fastq.gz
Running Tophat
-o A : write the result files into a directory “A”
-G testgenome.gff3: use the gene annotation file testgenome.gff3
--no-novel-juncs: do not detect novel splicing junctions
testgenome: bowtie2 index reference genome
a.fastq.gz: RNA-seq data file.
Tophat command:
tophat -o A -G testgenome.gff3 --no-novel-juncs testgenome
a.fastq.gz
• Default, no argument: detect novel junctions
• No novel junctions: -G myAnnotation.gtf --no-novel-juncs
• Doing both: -G myAnnotation.gtf
Exercise 1 Review
Exercise 1 Review
tophat -p 8 -o A -G testgenome.gff3 --no-novel-
juncs testgenome a.fastq.gz
Running Tophat using multiple threads:
-p 8: using 8 cores
Running Tophat on BioHPC Lab
Computers
• General workstations
8-core, 16G RAM, 0.5T hard drive
1 job at a time, using “-p 8”
• Medium memory workstations
24-core, 128G RAM, 1T SSD drive, 4T hard drive
4 jobs (6 cores per job) at a time;
• Large memory workstations
64 cores; 512GB RAM; 9.4TB HDD;1TB SSD;
10 jobs (6 cores per job) at a time;
Genome Databases for Tophat
• On /local_data directory:
human, mouse, Drosophila, C. elegans, yeast,
Arabidopsis, maize.
• On /shared_data/genome_db/:
rice, grape, apple, older versions of databases.
How to prepare Tophat genome
database
bowtie2-build rice7.fa rice7
Genome
fasta fileGive the
database a name
TOPHAT command in 2-step approach
tophat -G myAnnotation.gtf --transcriptome-index transcriptome
mygenome
Step 1. Build transcript sequences using the genome fasta and GFF file
tophat -p 8 -o sample1 -G myAnnotation.gtf --transcriptome-
index transcriptome/myAnnotation --no-novel-juncs mygenome
sample1.fastq.gz
Step 2. Using TOPHAT to align read to transcript and genome sequences
Default output from TOPHAT
• Up to 20 hits for the same read.
I. If more than 20 hits are found, randomly report
20;
II. It can be adjusted with --max-multihits
parameter.
• Maximum mismatch: 2
I. can be adjusted with --read-mismatches
Quantification
&
Normalization
Summarizing mapped reads into a gene level count
Different summarization strategies will result in the inclusion or exclusion of different
sets of reads in the table of counts.
Genome available, but
no annotation
Tophat: novel junctions
Cufflinks: build gff files
Cuffmerge: merge gff files
Cuffdiff: count reads using gff
RNA-seq Pipelines
Good genome and gene
annotation
Tophat: no novel junctions
Cuffdiff: count reads using gff
Differentially expressed gene detection
Differentially expressed gene detection
No reference genome
• Using Trinity to assemble a reference
transcriptome;
• Using tools provided with Trinity for transcript
quantification;
• Contact us for detailed instructions for
running Trinity.
Complications in quantification
1. Multi-mapped reads
Cufflinks
– uniformly divide each read to all mapped positions
– multi-mapped read correction (default off, can be enabled
with --multi-read-correct option)
HTSeq
– Count unique and multi-mapped reads separately
Complications in quantification
2. Assign reads to isoforms
Cufflinks
– Use its own model to estimate isoform abundance;
HTSeq
– A set of arbitrary rules specified by mode option,
including (a)skip or (b)counted towards each feature.
* Less problem with gene level read counting
Include in CUFFLINKS software:
CUFFLINKS vs CUFFDIFF
• Cufflinks
– Single library;
– Reference guide transcript assembly;
• Cuffdiff
– Multiple libraries;
– Quantification based on genes defined in GFF/GTF files;
– Identify differentially expressed genes
• Cufflinks
– Input: BAM file, GFF/GTF(optional)
– Output: GTF, isoform and gene FPKM
• Cuffdiff
– Input: BAM files, GFF/GTF(required)
– Output (at either gene or isoform level)
• FPKM, read count, read group tracking.
• Differential expressed genes
• Differential splicing test.
Include in CUFFLINKS software:
CUFFLINKS vs CUFFDIFF
• Cufflinks
– Input: BAM file, GFF/GTF(optional)
– Output: GTF, isoform and gene FPKM
• Cuffdiff
– Input: BAM files, GFF/GTF(required)
– Output (at either gene or isoform level)
• FPKM, read count, read group tracking.
• Differential expressed genes
• Differential splicing test.
Commonly done
at gene level
Raw counts can
be used for
external software
Include in CUFFLINKS software:
CUFFLINKS vs CUFFDIFF
Read-depth are not even across the same gene• Sequencing bias;
• RNA-seq provides relative abundance for the same gene across samples.
Fragment bias correction in cufflinks
• In some scenario, it can help to correct
batch effects due to different platforms,
protocols and reagents, et al.
--frag-bias-correct
Normalization methods
� Total-count normalization
• By total mapped reads
� Upper-quantile normalization
• By ead count of gene at upper-quantile
� Normalization by housekeeping genes
� Trimmed mean (TMM) normalization
Normalization methods
� Total-count normalization (FPKM, RPKM)
• By total mapped reads
� Upper-quantile normalization
• By read count of gene at upper-quantile
� Normalization by housekeeping genes
� Trimmed mean (TMM) normalization
cuffdiff
EdgeR & DESeq
Default
Normalization for RNA-seq data
Robinson & Oshlack 2010 Genome Biology 2010, 11:R25.
Median log-ratio of the housekeeping genes
Estimated TMM normalization factor
Log intensity ratio (M) vs average log intensity (A)
Statistical modeling of gene expression
and test for differentially expressed genes
1. Variance from biological replicates is large, and
cannot be modelled with Poisson distribution.
2. EdgeR and DESeq used related negative binomial
distribution (NB), and is increasing widely used for
detection for differentially expressed genes.
3. Each software has a built-in multiple test (e.g.
Benjamini-Hochberg) to give the FDR rate.
Rapaport F et al.
Genome Biology,
2013 14:R95
Comparison of Methods
TOPHAT -> BAM files
CUFFDIFF -> counts in
genes.read_group_tracking
EdgeR -> TMM normalization
EdgeR -> GLM for DE gene detection
RNA-seq Workflow at Bioinformatics Facility
http://cbsu.tc.cornell.edu/lab/doc/rna_seq_draft_v7.pdf
MDS plot of the samples
• Check reproducibility from replicates, remove outliers;
• Check batch effects;
Commercial software resource at
Cornell
• General RNA-seq data analysis
– Geneious
– DNASTAR
• Pathway analysis
– Ingenuity
• RNA-seq statistics and Pathway
– Genespring
• License informationa: http://www.biotech.cornell.edu/node/137
• BioHPC lab has one Windows computer that can run the commercial software
Exercise
• Using cuffdiff to identify differentially
expressed genes of rice plants grown at two
different conditions A and B. There are two
replicates for each condition.
• Using EdgeR package to make MDS plot of the
4 libraries.