Finding the Lost Treasure of NGS Data
description
Transcript of Finding the Lost Treasure of NGS Data
Finding the Lost Treasure of NGS Data
Yan Guo, PhD
Modules Overview for DNA-sequence Exome / whole-Genome
Bam files
bwa alignment
FastQC
bamQC
fastqfiles
structural variant analysis
GATK refinement
SNP/INDELvcf files
somatic mutation
gene-level analysis
gene associates
Translocation, inversion, copy number variants
gene coding changes
realignment
recalibration
mark-duplicationbest practicefilter
dbsnp / indel resources
RNAseq
Bam files
tophat alignment
FastQC
SeQC
fastqfiles
cufflinksannotations
cuffdiffcomparisons
Refinement
cuffmerge
gene-fusionanalysis
functional/pathway
cufflinksannotations
cuffdiffcomparisons
genes identifying
novel genes discovery
cluster
Gene List
gene quantification
DNAseq• SNPs• Somatic Mutations• Small Indels• Large Structural Change• CNV
RNAseq• Gene expression difference• Splicing Variants• Fusion Genes
What do you expect to find in NGS data?
What you don’t expect to find in NGS data?
Is targeted?
Exome sequencing reads
Mapped reads
Targeted DNA
Unmapped DNA reads
Untargeted DNA
Virus/Microbe DNA
Contamination
Intronic DNA
Intergenic DNA
Mitochondrial DNA
Is mapped? No
No
Yes
Yes
Exome CaptureTruSeq sa306744
Why do we care about intron and intergenic regions
• some introns can encode specific proteins and can be processed after splicing to form noncoding RNA molecules. (Rearick, Prakash et al. 2011)
• Majority of the GWAS SNPs are not in coding regions (706 exon, 3986 intron, 3323 intergenic)
• The ENCODE Project: ENCyclopedia Of DNA Elements
GWAS catalog SNPs
KitTarget total bases
Missing Exon SNPs
Missing intron SNPs
Missing Intergenic SNPs
SureSelect(v2) 37627747 387 3946 3323
TrueSeq 62085286 206 3980 3320
SeqCap EZ (v3.0) 64190747 326 3880 3317
Samples Average depth Intronic Splicing1 ncRNA2 Intergen
icExonic
Non-synonymous
Stopgain Stoploss
Agilent (N=22)
≥ 2 21741 48 9129 91480 1431 38 6≥ 5 7362 39 5794 44269 1142 29 5
≥ 10 4766 37 4393 28673 892 19 4
1000G (N=6)
≥ 2 4561 19 648 4658 491 10 1≥ 5 2784 12 360 2815 337 6 1
≥ 10 1419 9 194 1624 233 5 1
Illumina (N=6)
≥ 2 6114 0 985 9659 25 0 0≥ 5 2408 0 501 5344 0 0 0
≥ 10 1058 0 327 3498 0 0 01. Variant is within 2-bp of a splicing junction2. Variant overlaps a transcript without coding annotation in the gene definition
Mitochondria• Mitochondria play an important role in cellular energy
metabolism, free radical generation, and apoptosis (Andrews, Kubacka et al. 1999; Verma and Kumar 2007).
• Mitochondrial DNA (mtDNA) is a maternally-inherited 16,569-bp closed-circle genome that encodes two rRNAs, 22 tRNAs, and 10 polypeptides.
• Dysfunctions in mitochondrial function are an important cause of many neurological diseases (Fernandez-Vizarra, Bugiani et al. 2007) and drug toxicities (Lemasters, Qian et al. 1999; Wallace and Starkov 2000) and may contribute to carcinogenesis and tumor progression (Modica-Napolitano and Singh 2004; Chen 2012).
Mitochondria Extraction Strategy
Results
EXAMPLE
Virus• Known oncogenic viruses are estimated to cause 15 to 20
percent of all cancers in humans (Parkin 2006).• Understanding the viral integration pattern of cancer-
associated viruses may uncover novel oncogenes and tumor suppressors that are associated with cellular transformation.
• Viral genomes have been detected using off-target exome sequencing reads (Barzon, Lavezzo et al. 2011; Li and Delwart 2011; Chevaliez, Rodriguez et al. 2012; Radford, Chapman et al. 2012; Capobianchi, Giombini et al. 2013).
One example using HNSCC
Virus Detection in HNSCC in TCGA
Site clin_hpv_ish clin_hpv_p16 ExomeSeq low_pass RNAseq HPVBuccal Mucosa 0 0 0 0 0 0Buccal Mucosa 0 0 0 0 0 0Buccal Mucosa 0 0 0 0 0 0Buccal Mucosa 0 0 0 0 0 0Buccal Mucosa 0 0 0 0 0 0Buccal Mucosa 0 0 0 0 0 0Buccal Mucosa 0 0 0 0 0 0Buccal Mucosa 0 0 0 0 0 0Oropharynx 0 0 1 0 0 1Oropharynx 0 0 0 0 0 0Oropharynx 0 0 0 0 0 0Tonsil 1 1 1 0 1 4Tonsil 1 1 1 0 1 4Tonsil 1 1 1 0 1 4Tonsil 0 0 1 1 1 4Tonsil 0 0 1 1 1 4Tonsil 0 1 1 0 1 3Tonsil 1 0 1 0 1 3Tonsil 0 0 0 1 1 3Tonsil 0 0 1 0 1 2Tonsil 0 0 1 0 1 2Tonsil 0 0 1 0 1 2Tonsil 0 0 1 0 1 2Tonsil 0 0 1 0 1 2Tonsil 0 0 1 0 1 2Tonsil 0 0 1 0 0 1
Existing Tools
• PathSeq (Kostic, Ojesina et al. 2011)• VirusSeq (Chen, Yao et al. 2012)• ViralFusionSeq (Li, Wan et al. 2013)
SNP and Somatic Mutation Identification using RNAseq Data
• Traditionally, somatic mutations are detected using Sanger sequencing or RT-PCR by comparing paired tumor and normal samples. One obvious limitation of such methods is that we have to limit our search to a certain genomic region of interest.
• With the maturity of next generation sequencing, we can now screen all coding genes or even the whole genome for somatic mutations at a reasonable cost.
Why do we want to detect mutation in RNAseq data?
• You don’t have DNA sequencing data• Detecting mutation was not the original goal,
but why not• There are much more RNAseq data than
DNAseq data• A mutation in RNA is more relevant than a
mutation in DNA
Difficulties
• Not enough depth in the non-expressed genes to detect mutation
• Reverse transcribe RNA to cDNA introduce more error
• Hard to distinguish mutation from RNA editing• In summary, somatic mutation detection using
RNAseq data contains much more false positives.
Somatic Mutation Caller Designed Specifically for RNAseq Data
Other Ways you can mine your data
Summary
• Get your priority right, never design a study just for secondary analysis targets
• If you have old data, think about else you can do with it, try to maximize the full potential of your data
• At VANGARD, we help you with your basic genomic data analysis needs
• Advanced data analysis can be done through collaboration.
Acknowledgement
• Yu Shyr• Tiger Sheng• Chung-I Li• Jiang Li• Mike Guo• David Samuels• Chun Li