Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput...
-
Upload
paulina-griffith -
Category
Documents
-
view
236 -
download
5
Transcript of Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput...
![Page 1: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/1.jpg)
Introduction To Next Generation Sequencing (NGS) Data Analysis
Jenny WuUCI Genomics High Throughput
Facility
![Page 2: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/2.jpg)
Outline• Goals : Practical guide to NGS data processing• Bioinformatics in NGS data analysis– Basics: terminology, data file formats, general
workflow – Data Analysis Pipeline• Sequence QC and preprocessing• Obtaining and preparing reference • Sequence mapping• Downstream analysis workflow and software
• Example: RNA-Seq analysis with Tuxedo protocol• Summary and future plan
![Page 3: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/3.jpg)
Why Next Generation SequencingOne can sequence hundreds of millions of short sequences (35bp-120bp) in a single run in a short period of time with low per base cost.
• Illumina/Solexa GA II / HiSeq 2000, 2500 • Life Technologies/Applied Biosystems SOLiD• Roche/454 FLX, Titanium
Reviews: Michael Metzker (2010) Nature Reviews Genetics 11:31Quail et al (2012) BMC Genomics Jul 24;13:341.
![Page 4: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/4.jpg)
Why Bioinformatics
(wall.hms.harvard.edu)
Informatics
![Page 5: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/5.jpg)
Bioinformatics Challenges in NGS Data Analysis
• VERY large text files (tens of millions of lines long)– Can’t do ‘business as usual’ with familiar tools– Impossible memory usage and execution time – Manage, analyze, store, transfer and archive huge files
• Need for powerful computers and expertise– Informatics groups must manage compute clusters– New algorithms and software are required and often time
they are open source Unix/Linux based.– Collaboration of IT, bioinformaticians and biologists
![Page 6: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/6.jpg)
Basic NGS Workflow
![Page 7: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/7.jpg)
NGS Data Analysis Overview
Olson et al.
![Page 8: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/8.jpg)
Outline• Goals• Bioinformatics Challenges in NGS data analysis– Basics: terminology, data file formats, general workflow – Analysis Pipeline• Sequence QC and preprocessing• Obtaining and preparing reference • Sequence mapping• Downstream analysis workflow and software
• RNA-Seq analysis with Tuxedo protocol• Summary and future plan
![Page 9: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/9.jpg)
Terminology
• Coverage (depth): The number of nucleotides from reads
that are mapped to a given position.• Quality Score: Each called base comes with a quality score
which measures the probability of base call error.
• Mapping: Align reads to reference to identify its origin.
• Assembly: Merging of fragments of DNA in order to reconstruct the original sequence.
• Duplicate reads: Reads that are identical.
• Multi-reads: Reads that can be mapped to multiple locations equally well.
![Page 10: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/10.jpg)
What does the data look like?Common NGS Data Formats
![Page 11: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/11.jpg)
FASTA Format (Reference Seq)
![Page 12: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/12.jpg)
FASTQ Format (reads)
![Page 13: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/13.jpg)
FASTQ Format (Illumina Example)
@DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAACAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT+BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ@DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AGAAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG+@@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2@DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AGCCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC+CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ@DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AGGAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG+CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ
Read RecordHeader
Read BasesSeparator
(with optional repeated header)
Read Quality Scores
Flow Cell ID
Lane TileTile
Coordinates
Barcode
NOTE: for paired-end runs, there is a second file with one-to-one corresponding headers and reads.
(Passarelli, 2012)
![Page 14: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/14.jpg)
Outline• Goals• Bioinformatics Challenges in NGS data analysis– Basics: terminology, data file formats, general workflow, – Analysis Pipeline• Sequence QC and preprocessing• Obtaining and preparing reference • Sequence mapping• Downstream analysis workflow and software
• RNA-Seq analysis with Tuxedo protocol• Summary and future plan
![Page 15: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/15.jpg)
Data Analysis PipelineRaw reads
Read QC and preprocessing
Read Mapping
Analysis-readyreads
FASTQ
FASTQC, FASTX-toolkit, PRINSEQ
Local realignment, base quality recalibration
FASTQ
SAM/BAMMapped readsVisualization (IGV,
USCS GB)
Bowtie, BWA, MAQ
Whole Genome Sequencing:
Variant calling, annotation
RNA-Seq: Transcript assembly,
quantification
ChIP-Seq :Peak Calling
Methyl-Seq:Methylation
calling……
Collecting reference
sequences and annotation
DataTaskFile FormatSoftware
FASTA GTF/GFF
![Page 16: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/16.jpg)
Why QC?Sequencing runs cost money • Consequences of not assessing the Data • Sequencing a poor library on multiple runs
– throwing money away!
Data analysis costs money and time• Cost of analyzing data, CPU time $$• Cost of storing raw sequence data $$$• Hours of analysis could be wasted $$$$• Downstream analysis can be incorrect.
![Page 17: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/17.jpg)
How to QC?
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/, available on HPCTutorial : http://www.youtube.com/watch?v=bz93ReOv87Y
$: fastqc s_1_1.fastq;
![Page 18: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/18.jpg)
Outline• Goals• Bioinformatics Challenges in NGS data analysis– Basics: terminology, data file formats, general workflow, – Analysis Pipeline• Sequence QC and preprocessing• Obtaining and preparing reference • Sequence mapping• Downstream analysis workflow and software
• RNA-Seq analysis with Tuxedo protocol• Summary and future plan
![Page 19: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/19.jpg)
The UCSC Genome Browser Homepage
Get genome annotation here!
General information
Specific information—new features, current status, etc.
Get reference sequences here!
![Page 20: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/20.jpg)
Getting reference sequences
![Page 21: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/21.jpg)
Getting Reference Annotation
![Page 22: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/22.jpg)
Outline• Goals• Bioinformatics Challenges in NGS data analysis– Basics: terminology, data file formats, general workflow, – Analysis Pipeline• Sequence QC and preprocessing• Obtaining and preparing reference • Sequence mapping• Downstream analysis workflow and software
• RNA-Seq analysis with Tuxedo protocol• Summary and future plan
![Page 23: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/23.jpg)
Sequence Mapping Challenges
• Alignment (Mapping) is the first steps once read sequences are obtained.
• The task: to align sequencing reads against a known reference
• Difficulties: high volume of data, size of reference genome, computation time, read length constraints, ambiguity caused by repeats and sequencing errors.
![Page 24: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/24.jpg)
Short Read Alignment
Olson et al.
![Page 25: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/25.jpg)
Short Read Alignment Software
![Page 26: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/26.jpg)
Short Reads Mapping Software
![Page 27: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/27.jpg)
How to choose an aligner?
• There are many aligners and they vary a lot in performance (accuracy, memory usage, speed, etc).
• Factors to consider : application, platform, read length, downstream analysis, etc.
• Constant trade off between speed and sensitivity (e.g. MAQ vs. Bowtie)
• Guaranteed high accuracy will take longer.
![Page 28: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/28.jpg)
Outline• Goals• Bioinformatics Challenges in NGS data analysis– Basics: terminology, data file formats, general workflow, – Analysis Pipeline• Sequence QC and preprocessing• Obtaining and preparing reference • Sequence mapping• Downstream analysis workflow and software
• RNA-Seq analysis with Tuxedo protocol• Summary and future plan
![Page 29: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/29.jpg)
NGS Applications and Analysis StrategyName Nucleic acid population Brief analysis strategy
RNA-Seq RNA (may be poly-A mRNA or total RNA) Alignment of reads to “genes”; variations for detecting splice junctions and quantifying abundance
Small RNA sequencing
Small RNA (often miRNA) Alignment of reads to small RNA references (e.g. miRbase), then to the genome; quantify abundance
ChIP-Seq DNA bound to protein, captured via antibody (ChIP = Chromatin ImmunoPrecipitation)
Align reads to reference genome, identify peaks & motifs
RIP-Seq RNA bound to protein, captured via antibody (RIP = RNA ImmunoPrecipitation)
Align reads to reference genome and/or “genes”, identify peaks and motifs
Methylation Analysis
Select methylated genomic DNA regions, or convert methylated nucleotides to alternate forms
Align reads to reference and either identify peaks or regions of methylation
SNP calling/ discovery
All or some genomic DNA or RNA Either align reads to reference and identify statistically significant SNPs, or compare multiple samples to each other to identify SNPs
Structural Variation Analysis
Genomic DNA, with two reads (mate-pair reads) per DNA template
Align mate-pairs to reference sequence and interpret structural variants
de novo Sequencing
Genomic DNA (possibly with external data e.g. cDNA, genomes of closely related species, etc.)
Piece-together reads to assemble contigs, scaffolds, and (ideally) whole-genome sequence
Metagenomics Entire RNA or DNA from a (usually microbial) community
Phylogenetic analysis of sequences
(Hunicke-Smith et al, 2010)
![Page 30: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/30.jpg)
Application Specific Software
Tophat, STAR, Cufflinks, edgeR,
MACS, AREM, PeakSeq
Mapped reads
Whole Genome Sequencing, Exome Sequencing
RNA-Seq: Transcriptome
analysis
ChIP-Seq :Protein DNA binding site,
Methyl-Seq:Methylation
pattern analysis
Variant Calling: SNPs, InDels
Bismark, BS Seeker
1: Transcriptome assembly2. Abundance quantification3. Differential expression and regulation
Peak Identification
Methylation calling
ssahaSNP, Samtools, PyroBayes
……
![Page 31: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/31.jpg)
Outline• Goals• Bioinformatics Challenges in NGS data analysis– Basics: terminology, data file formats, general workflow, – Analysis Pipeline• Sequence QC and preprocessing• Obtaining and preparing reference • Sequence mapping• Downstream analysis workflow and software
• RNA-Seq analysis with Tuxedo protocol• Summary and future plan
![Page 32: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/32.jpg)
http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html
RNA-seq (Tuxedo Protocol)
2. Transcript assembly and quantification
1. Read mapping
3. Merge assembled transcripts from multiple
samples
4. Differential Expression analysis
SAM/BAM
GTF/GFF
![Page 33: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/33.jpg)
1. Spliced Alignment: TophatTophat : a spliced short read aligner for RNA-seq.
$ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq
$ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq
$ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C2_R1_2.fq
$ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C2_R2_2.fq
![Page 34: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/34.jpg)
2.Transcript assembly and abundance quantification: Cufflinks
CuffLinks: a program that assembles aligned RNA-Seq reads into transcripts, estimates their abundances, and tests for differential expression and regulation transcriptome-wide.
$ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/ accepted_hits.bam
$ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/ accepted_hits.bam
$ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/ accepted_hits.bam
$ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/ accepted_hits.bam
![Page 35: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/35.jpg)
3. Final Transcriptome assembly: Cuffmerge
$ cuffmerge -g genes.gtf -s genome.fa -p 8 assemblies.txt
$ more assembies.txt
./C1_R1_clout/transcripts.gtf
./C1_R2_clout/transcripts.gtf
./C2_R1_clout/transcripts.gtf
./C2_R2_clout/transcripts.gtf
![Page 36: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/36.jpg)
4.Differential Expression: Cuffdiff
CuffDiff: a program that compares transcript abundance between samples.
$ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf ./C1_R1_thout/accepted_hits.bam,./C1_R2_thout/accepted_hits.bam,./C2_R1_thout/accepted_hits.bam,./C2_R2_thout/accepted_hits.bam
![Page 37: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/37.jpg)
Integrative Genomics Viewer (IGV)http://www.broadinstitute.org/igv
![Page 38: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/38.jpg)
http://www.broadinstitute.org/igv/UserGuideNeilsen, C.B., et al. Visualizing Genomes: techniques and challenges Nature Methods 7:S5 S15 (2010)‐
Visualizing RNA-seq mapping with IGVSpecify range or tem in search box
Click on ruler
Click and drag
Use scroll bar
Use keyboard:Arrow keys, Page up
Page down, Home, End
![Page 39: Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility.](https://reader038.fdocuments.us/reader038/viewer/2022102707/56649ee45503460f94bf26b0/html5/thumbnails/39.jpg)
SummarySummary
Thank you!
• NGS technologies are transforming molecular biology.
• Bioinformatics analysis is a crucial part in NGS applications – Data formats, terminology, general workflow– Analysis pipeline– Software for various NGS applications
• RNA-seq with Tuxedo suite