RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the...
Transcript of RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the...
![Page 1: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/1.jpg)
RNA SEQUENCING AND DATA ANALYSIS
![Page 2: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/2.jpg)
Length of mRNA transcripts in the human genome
0 2,000 4,000 6,000 8,000 10,000
0
1,000
2,000
3,000
4,000
5,000
0 200 400 600 800
5,000
4,000
3,000
0
2,000
1,000
![Page 3: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/3.jpg)
Length of mRNA transcripts in the human genome
0 2,000 4,000 6,000 8,000 10,000
0
1,000
2,000
3,000
4,000
5,000
0 200 400 600 800
5,000
4,000
3,000
0
2,000
1,000
Insert size ~ 200bp
![Page 4: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/4.jpg)
Overview of RNA sequencing protocol
Fwd read Reverse read
Insert
SEQUENCING
Read length: 48-76bp
![Page 5: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/5.jpg)
Sequencing parameters
Read Depth
Minimum mapped reads: 10 million for quantitative analysis of mammalian transcriptome
More reads needed for splicing variant discovery and differential comparison among samples
Current output: 120-180 million raw reads / lane
Multiplex level: 4-12 libraries / lane recommended
![Page 6: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/6.jpg)
All RNA is not the same
Types of RNA:
![Page 7: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/7.jpg)
All RNA is not the same
Types of RNA:
Messenger RNA
Micro RNA
Long non-coding RNA
Ribosomal RNA
![Page 8: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/8.jpg)
Methods for RNA enrichment prior to library construction
Poly(A)-RNA selection By hybridization to oligo-dT beads mature mRNA highly enriched efficient for quantification of gene expression level and so on limitation: 3’ bias correlating with RNA degradation
rRNA depletion: by hybridization to bead-bound rRNA probes rRNA sequence-dependent and species-specific all non-rRNA retained: premature mRNA, long non-coding RNA
Small RNA extraction: Specific kits required to retain small RNA Optional fine size-selection by gel or column
![Page 9: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/9.jpg)
Different methods capture different types of RNA
Poly(A)-RNA
selection
rRNA depletion
Small RNA
extraction
Messenger RNA
Micro RNA
Long non-coding RNA
Ribosomal RNA
![Page 10: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/10.jpg)
Different methods capture different types of RNA
Poly(A)-RNA
selection
rRNA depletion
Small RNA
extraction
Messenger RNA X X
Micro RNA X X
Long non-coding RNA X
Ribosomal RNA X
![Page 11: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/11.jpg)
Paraffin embedded vs fresh frozen
Fresh Frozen
REA
D Q
UA
LITY
![Page 12: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/12.jpg)
First step: alignment
![Page 13: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/13.jpg)
Or: assembly, then alignment
![Page 14: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/14.jpg)
Alignment versus assembly
Assembly
Trinity, Cufflinks, ABySS
Particularly useful when no reference genome is available, like in bacterial transcriptomes
Alignment
Bowtie, BWA, Mosaic
Maximum sensitivity, fewer false positives
![Page 15: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/15.jpg)
RNA sequencing applications
![Page 16: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/16.jpg)
RNA sequencing applications
Quantification of transcript expression levels
Detection of splice variation/different isoforms of the same gene
Allele specific expression levels
Strand specific expression levels
Detection of fusion transcripts (such as BCR-ABL in CML)
Detection of sequence variation (limited application)
Validation of DNA sequence variants
![Page 17: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/17.jpg)
RNA-seq expression levels are linear where microarrays get saturated or are insensitive
Expression is measured as ‘reads per kilobase per million’ (RPKM)
or ‘fragments per kilobase of exon per million fragments mapped’
(FPKM) to normalize for gene length and library size
![Page 18: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/18.jpg)
In GBM, the gene EGFR is frequently targeted by intragenic deletions
![Page 19: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/19.jpg)
vIII deletion occurs in same domain as point mutations
![Page 20: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/20.jpg)
Detecting EGFR transcript variants using RNA-seq data
![Page 21: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/21.jpg)
SpliceSeq can detect splice variants http://bioinformatics.mdanderson.org/main/SpliceSeq:Overview
![Page 22: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/22.jpg)
Allele-/Strand-specific RNA-seq
Haplotype specific gene expression by computationally integrating RNAseq with DNA SNP data
Strand-specific RNA-seq requires specific library preparation protocol
Costs more
Output more accurate, useful for analysis in absence of a reference genome
![Page 23: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/23.jpg)
Identification of fusion transcripts
Popular methods search for
Read pairs that map to two different genes
Need to correct for gene homology
Reads that span fusion junction
Split reads in half and align separate halfs
Make a database of all possible fusion junctions and align full reads
PRADA, MapSplice, TopHat
http://sourceforge.net/projects/prada/
![Page 24: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/24.jpg)
FGFR3-TACC3 fusion in GBM is the result of a local inversion
FGFR3-TACC3
![Page 25: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/25.jpg)
Fusion transcripts are often associated with copy number difference and genomic breakpoints
Copy number profile of two FGFR3-TACC3 cases in TCGA
FGFR3-TACC3
![Page 26: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/26.jpg)
6.4% of GBM harbors transcript fusions involving EGFR
All fusions fall within the area of the EGFR amplification
![Page 27: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/27.jpg)
OUTPUTS
Processing Module
Read Alignment
Remap alignments
Combine two ends
Quality Scores
Recalibrated
INPUTS
.fastq files [END1 & END2]
Config.txt [location of scripts and reference files]
Preprocessing .bam file
[PAIRED END] GUESS-ft [YES|| NO|| ONLY]
RPKM & QC metrics
-geneA -geneB
Expression & QC Module
[YES|| NO|| ONLY]
Fusion Candidates
Supervised search evidence
Fusion Module [YES|| NO|| ONLY]
RNA-SeQC
Fusion Module Discordant read pair: Each end of the
read pair maps uniquely to distinct protein-coding genes.
Fusion spanning reads: Chimeric read that maps a putative junction and the mate read maps to either GENE A or GENE B.
Gene A Gene B
![Page 28: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/28.jpg)
Structural transcript variants in low grade glioma
RNA-seq data from 272 TCGA low grade glioma
Fusion detection accuracy affected by:
PRADA detected 1,843 fusion transcripts
#mapped
reads per
sample
Detected #fusion transcripts per sample
![Page 29: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/29.jpg)
Filtering out artifacts
Homology E value larger than 0.01 (column Evalue)
No mismatches in junction spanning reads
Count the number of partner genes for each individual gene
Identify genes with fusions mapping to more than 10 different chromosome arms
970/1,843 fusions filtered
Validation of predicted transcript fusions
509/970 fusions filtered
![Page 30: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/30.jpg)
Define four tiers of fusion transcripts based on evidence
Tier 1: At least 3 discordant read pairs (DSP), two perfect match junction spanning reads (JSR), and both partner genes only fused to one other partner gene in the same sample
Tier 2: At least 2 DSP and 1 JSR, with a DNA breakpoint within 100kb window
Use matching DNA copy number profile
Tier 3: At least 2 DSP and 1 JSR, unique partner genes, with predicted junction consistent for all
Tier 4: The rest
![Page 31: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/31.jpg)
Validation of RNA fusions using output of BreakDancer
BreakDancer detects DNA rearrangements in low pass sequencing data
![Page 32: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/32.jpg)
Validation of RNA fusions using output of BreakDancer
BreakDancer detects DNA rearrangements in low pass sequencing data
![Page 33: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/33.jpg)
Variant detection
Approximately 30% of mutations are covered sufficiently to
be detected at a validation rate of ~ 80%.
From TCGA renal
cell clear cell
carcinoma project
Reverse transcriptase step to convert RNA to cDNA complicates
detection of RNA edits and mutations
![Page 34: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/34.jpg)
![Page 35: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/35.jpg)
RNA sequencing read alignment in PRADA
Transcripts from same gene
Reads are aligned to all possible transcripts
Reads are also aligned to genome
![Page 36: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/36.jpg)
RNA sequencing read alignment in PRADA
Reads are aligned to all possible transcripts
Reads are also aligned to genome
Final and single placement for
each read it determined by
re-mapping
![Page 37: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/37.jpg)
PRADA alignments – advantages versus disadvantages
Advantage:
Alignment to DNA means mapping of unannotated transcripts
Alignment to transcriptome means mapping across exon-exon junctions
Disadvantage
More conservative alignment than split-read
![Page 38: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/38.jpg)
PRADA focuses on the analysis of paired-end RNA-sequencing data.
Four modules: 1. Processing
2. Expression and Quality Control
3. Gene fusion
4. GUESS-ft: General User dEfined Supervised Search for fusion transcripts
Processing Module
Read Alignment
Remap alignments
Combine two ends
Quality Scores
Recalibrated
INPUTS
.fastq files [END1 & END2]
Config.txt [location of scripts and reference files]
Preprocessing .bam file
[PAIRED END] GUESS-ft [YES|| NO|| ONLY]
RPKM & QC metrics
-geneA -geneB
Expression & QC Module
[YES|| NO|| ONLY]
Fusion Candidates
Supervised search evidence
Fusion Module [YES|| NO|| ONLY]
RNA-SeQC
OUTPUTS
http://sourceforge.net/projects/prada/
![Page 39: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/39.jpg)
OUTPUTS
Processing Module
Read Alignment
Remap alignments
Combine two ends
Quality Scores
Recalibrated
INPUTS
.fastq files [END1 & END2]
Config.txt [location of scripts and reference files]
Preprocessing .bam file
[PAIRED END] GUESS-ft [YES|| NO|| ONLY]
RPKM & QC metrics
-geneA -geneB
Expression & QC Module
[YES|| NO|| ONLY]
Fusion Candidates
Supervised search evidence
Fusion Module [YES|| NO|| ONLY]
RNA-SeQC
Expression & QC Module RNA-SeQC provides three types of
quality control metrics: Read Counts
Coverage
Correlation
RPKM Values at transcript level
For longest transcript
RNAseQC Process (java)
![Page 40: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/40.jpg)
OUTPUTS
Processing Module
Read Alignment
Remap alignments
Combine two ends
Quality Scores
Recalibrated
INPUTS
.fastq files [END1 & END2]
Config.txt [location of scripts and reference files]
Preprocessing .bam file
[PAIRED END] GUESS-ft [YES|| NO|| ONLY]
RPKM & QC metrics
-geneA -geneB
Expression & QC Module
[YES|| NO|| ONLY]
Fusion Candidates
Supervised search evidence
Fusion Module [YES|| NO|| ONLY]
RNA-SeQC
Fusion Module Discordant read pair: Each end of the
read pair maps uniquely to distinct protein-coding genes.
Fusion spanning reads: Chimeric read that maps a putative junction and the mate read maps to either GENE A or GENE B.
Gene A Gene B
![Page 41: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/41.jpg)
Implementation Results Samples processed
>400 KIRC
>170 GBM
Processing Module
Read Alignment
Remap alignments
Combine two ends
Quality Scores
Recalibrated
INPUTS
.fastq files [END1 & END2]
Config.txt [location of scripts and reference files]
Preprocessing .bam file
[PAIRED END] GUESS-ft [YES|| NO|| ONLY]
RPKM & QC metrics
-geneA -geneB
Expression & QC Module
[YES|| NO|| ONLY]
Fusion Candidates
Supervised search evidence
Fusion Module [YES|| NO|| ONLY]
OUTPUTS
RNA-SeQC
Works well in MDACC HPC* system
PRADA-fusion module validation rate ~85 % (53 out of 62)
![Page 42: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/42.jpg)
RNA sequencing in The Cancer Genome Atlas
mRNA: poly-A mRNA purified from total RNA using poly-T oligo-attached magnetic beads
miRNA: Total RNA is mixed with oligo(dT) MicroBeads and loaded into MACS column, which is then placed on a MultiMACS separator. From the flow-through, small RNAs, including miRNAs, are recovered by ethanol precipitation.
![Page 43: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/43.jpg)
Detecting fusion transcripts in GBM
![Page 44: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/44.jpg)
KIRC fusion results
We analyzed 416 RNA-seq samples from clear cell renal carcinoma (ccRCC), available through TCGA.
We identified 80 bona-fide fusion transcripts, 57 intrachromosomal
33 interchromosomal
in 62 individual samples
“Recurrent” fusions SFPQ-TFE3 (n=5, chr1-chrX)
DHX33-NLRP1 (n=2, chr2)
TRIP12-SLC16A14 (n=2, chr17)
TFG-GRP128 (n=4, chr3)
![Page 45: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/45.jpg)
KIRC fusion validation
Sample ID 5’ Gene 3’ Gene Discordant Read Pairs
Fusion Span Reads
Fusion Junction (s)
5’ Gene Chr
3’ Gene Chr
Validated?
TCGA-AK-3456-01A-02R-1325-07 TFE3 SFPQ 175 129 1 chrX chr1 Yes
TCGA-AK-3456-01A-02R-1325-07 SFPQ TFE3 116 81 1 chr1 chrX Yes
TCGA-A3-3313-01A-02R-1325-07 C6orf106 LRRC1 90 40 2 chr6 chr6 Yes
TCGA-A3-3313-01A-02R-1325-07 CYP39A1 LEMD2 37 9 1 chr6 chr6 Yes
TCGA-B2-4101-01A-02R-1277-07 FAM172A FHIT 17 4 1 chr5 chr3 Yes
TCGA-AK-3445-01A-02R-1277-07 KIAA0802 LRRC41 14 6 1 chr18 chr1 Yes
TCGA-B0-5095-01A-01R-1420-07 GORASP2 WIPF1 14 2 1 chr2 chr2 Yes
TCGA-A3-3313-01A-02R-1325-07 ZNF193 MRPS18A 11 3 1 chr6 chr6 Yes
TCGA-A3-3313-01A-02R-1325-07 FTSJD2 GPX6 9 8 1 chr6 chr6 Yes
TCGA-B0-4945-01A-01R-1420-07 KIAA0427 GRM4 8 5 1 chr18 chr6 No
TCGA-B8-4143-01A-01R-1188-07 SLC36A1 TTC37 5 5 1 chr5 chr5 No
PRADA-fusion module validation rate (11 out of 13) ~85% RT-PCR and FISH assays
TFE3-SFPQ was validated in three individual samples
![Page 46: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/46.jpg)
KIRC fusion validation: RT-PCR
FAM172A-FHIT
(a) (b)
Figure 2. RT-PCR results for TFE3 fusion validations for sample TCGA-AK-3456. SFPQ-TFE3
(a) (b)
Figure 2. RT-PCR results for TFE3 fusion validations for sample TCGA-AK-3456. TFE3-SFPQ
![Page 47: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/47.jpg)
KIRC fusion results
We analyzed 416 RNA-seq samples from clear cell renal carcinoma (ccRCC), available through TCGA.
We identified 80 bona-fide fusion transcripts, 57 intrachromosomal
33 interchromosomal
in 62 individual samples
“Recurrent” fusions SFPQ-TFE3 (n=5, chr1-chrX)
DHX33-NLRP1 (n=2, chr2)
TRIP12-SLC16A14 (n=2, chr17)
TFG-GRP128 (n=4, chr3)
![Page 48: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/48.jpg)
TFG-GRP128 has been reported in other cancers
![Page 49: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/49.jpg)
TFG-GRP128 has been reported in other cancers
![Page 50: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/50.jpg)
TFG-GRP128 has been reported in other cancers
TCGA has 1,000s of RNA seq samples - how
can we quickly scan many samples for the
presence of this fusion?
![Page 51: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/51.jpg)
Processing Module
Read Alignment
Remap alignments
Combine two ends
Quality Scores
Recalibrated
INPUTS
.fastq files [END1 & END2]
Config.txt [location of scripts and reference files]
Preprocessing .bam file
[PAIRED END] GUESS-ft [YES|| NO|| ONLY]
RPKM & QC metrics
-geneA -geneB
Expression & QC Module
[YES|| NO|| ONLY]
Fusion Candidates
Supervised search evidence
Fusion Module [YES|| NO|| ONLY]
OUTPUTS
RNA-SeQC
Supervised Search Module GUESS-ft: General User dEfined Supervised
Search for fusion transcripts
BAM
GUESS-ft
Mapped to A
or B
A-B
Discordant
reads
Unmapped
reads
Junction DB
Junction
spanning reads
Summary
report
Use high quality
mapping reads
only, Checks
read
orientation
fulfills fusion
schema, allow
up to one
mismatch.
Two read ends
map to A and B
respectively
Parse
Unmapped
reads with the
other end
mapping to A
or B
Map parsed
reads to DB of
all possible
exon junctions
List reads with
one end map
to junction, the
other map to A
or B
Time consuming step
![Page 52: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/52.jpg)
Identification of TFG-GRP128 fusion
All available normal samples in cghub
Subset of tumor samples selected based on RPKM expression pattern
Table. Samples across cancer types
Cancer Type # of normal
samples
# of tumor
samples
Bladder Urothelial Carcinoma [BLCA] 0 (0%) 2 (3.6%)
Breast invasive carcinoma [BRCA] 1 (0.94%) 13 (1.6%)
Head and Neck squamous cell carcinoma [HNSC] 0 (0%) 6 (2.3%)
Kidney renal clear cell carcinoma [KIRC] 1 (1.5%) 5 (1.2%)
Kidney renal papillary cell carcinoma [KIRP] 0 (0%) 1 (5.9%)
Liver hepatocellular carcinoma [LIHC] 0 (0%) 1 (5.9%)
Lung adenocarcinoma [LUAD] 0 (0%) 1 (0.79%)
Lung squamous cell carcinoma [LUSC] 0 (0%) 9 (4%)
Prostate adenocarcinoma [PRAD] 1 (14.3) 2 (1.9%)
Thyroid carcinoma [THCA] 0 (0%) 2 (0.89%)
* All performed by PRADA fusion module.
![Page 53: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200](https://reader035.fdocuments.us/reader035/viewer/2022062611/612699d03af7e46f9d2f480e/html5/thumbnails/53.jpg)
Tumors with the fusion have higher GPR128 expression levels
RPKM expression pattern seen in KIRC tumors
Fusion sample(s)
Higher expression of GPR128 (activation)
TCGA-B0-5703 w/ 1 discordant read pair in tumor sample w/ 33 discordant read pair in matched normal