Next-generation sequencing format and visualization with ngs.plot

44
Data formats and visualization in next- generation sequencing analysis Li Shen, Asst. Prof. Neuro core Sep 2013

description

Lecture given at the department of neuroscience, Icahn school of medicine at Mount Sinai. ngs.plot has been published in BMC genomics. Link: http://www.biomedcentral.com/1471-2164/15/284

Transcript of Next-generation sequencing format and visualization with ngs.plot

Page 1: Next-generation sequencing format and visualization with ngs.plot

Data formats and visualization in next-generation sequencing analysis

Li Shen, Asst. Prof.Neuro coreSep 2013

Page 2: Next-generation sequencing format and visualization with ngs.plot

Introduction to the Shenlab

Lab location: Icahn 10-20 office suite

Two focuses:1. Next-generation sequencing analysis2. Novel software development for NGS

http://neuroscience.mssm.edu/shen/index.html

Page 3: Next-generation sequencing format and visualization with ngs.plot

DNA sequencing overview

Primer

Template sequence

DNA polymerase/ligase

ACGT

5’ 3’

5’3’

1. How to “freeze” the procedure?2. What kind of signal to generate?3. How to capture the signals?

Sanger sequencingPyrosequencingSolexa sequencingSOLiD sequencingIon Torrent sequencingSMRT sequencing…and many others

Extending sequence

Page 4: Next-generation sequencing format and visualization with ngs.plot

What is “next-generation” sequencing?

-- first-generation sequencers: –

Sanger sequencer: 384 samplesper single batch

-- next-generation sequencers: --

Illumina, SOLiD sequencer: billionsper single batch, ~3 million fold increase in throughput!

Massively Parallel:

Page 5: Next-generation sequencing format and visualization with ngs.plot

What are “short” reads?

http://www.edgebio.com/blog_old/uploads/2011/06/1.png

http://en.wikipedia.org/wiki/File:DNA_Sequencing_gDNA_libraries.jpg

Read position

Qua

lity

scor

e

Illumina:50-250bp

SOLiD:35-50bp

454 pyro:700bp

Sanger:900bp

Limit of read length

Page 6: Next-generation sequencing format and visualization with ngs.plot

Illumina sequencing terminology

Chip, slide, flow cell…

HiSeq 2500

DNA fragment

Page 7: Next-generation sequencing format and visualization with ngs.plot

7

Information flow of sequencing data

fastq

SAM/BAM

coverage

HISEQ2:197:D08GUACXX:8:2105:21056:104282 0 chr10 3000101 255 51M * 0 0 AAGGTCACCAAAGGCCCACCTTGTCTTTACCTTATTTGTTCTAAATTTTTT =@@DA:ADDHD;AA?:AAFHGIHHBDEFHIDGB9CFH<?F<DEEIGGHEII XA:i:0 MD:Z:51 NM:i:0HISEQ2:197:D08GUACXX:6:1105:9303:81340 0 chr10 3000301 255 51M * 0 0 GTGTTATTTCACAAGGTGAAGATAGAGCTTGGTGGCTGCCAGAGAGATTAA BB@FFFFFHHHHHJJJFGIJIIJJJJJJIJJJHIJJJIIJJJJIGIGIJII XA:i:0 MD:Z:51 NM:i:0HISEQ2:197:D08GUACXX:7:2102:2396:174630 16 chr10 3000373 255 51M * 0 0 CTGAATCTTCTCCTAAGTATCATCCTGAAGAACAAAATTCCTCTTTTGCTT JJIJJJJJJJJJJJJJJJIIJJJJJIJJJJJHJJJJJJHHHHHFFFFFCCC XA:i:0 MD:Z:51 NM:i:0HISEQ2:197:D08GUACXX:8:2108:12162:127556 0 chr10 3000388 255 51M * 0 0 AGTATCATCCTGAAGAACAAAATTCCTCTTTTGCTTAAAATTCACTGGGGA @@?DDFFDBHFFGJIIGIGGGGGIJGHHIHIIGEGIIIIIJJJIIJIGGGG XA:i:0 MD:Z:51 NM:i:0

Image analysis

Page 8: Next-generation sequencing format and visualization with ngs.plot

FASTQRaw sequence format

Page 9: Next-generation sequencing format and visualization with ngs.plot

What is FASTQ?

• Text-based format for storing both biological sequences and corresponding quality scores.

• FASTQ = FASTA + QUALITY• A FASTQ file uses four lines per sequence.

@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAA+SEQ_ID(Optional)!''*((((***+))%%%++)(%%%%).1**

1234

Page 10: Next-generation sequencing format and visualization with ngs.plot

Illumina sequence identifiers

@SOLEXA-DELL:6:1:8:1376#0/1

Instrument name Lane

Tile

X-coordinate

Y-coordinate

Index number

Paired read

@SEQ_ID

Page 11: Next-generation sequencing format and visualization with ngs.plot

Quality score calculation+SEQ_ID!''*((((***+))%%%++)(%%%%).1** ?

A quality value Q is an integer representation of the probability p that the corresponding base call is incorrect.

1 2

SangerSolexa

Figures from Wikepedia

Page 12: Next-generation sequencing format and visualization with ngs.plot

Quality score interpretation

Phred Quality Score Probability of incorrect base call Base call accuracy

10 1 in 10 90%20 1 in 100 99%30 1 in 1000 99.9%40 1 in 10000 99.99%50 1 in 100000 99.999%

Materials from Wikepedia

Page 13: Next-generation sequencing format and visualization with ngs.plot

Quality score encoding

• Formula: score + offset => look for ascii symbol

• Two variants: offset=64(Illumina 1.0-before 1.8); offset=33(Sanger, Illumina 1.8+).

• A quality score is typically: [0, 40]

(33): !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI(64): @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh

Figures from Wikepedia

Page 14: Next-generation sequencing format and visualization with ngs.plot

What can you do with FASTQ files?

• Quality control: quality score distribution, GC content, k-mer enrichment, etc.

• Preprocessing: adapter removal, low-quality reads filtering, etc.

GATTTGGGGTTCAAAGCAGTATCGATCAAA!''*((((***+))%%%++)(%%%%).1** Mean quality

GC contentK-mer enrichment

Adapter? (miRNA)

Quality Quality …

Page 15: Next-generation sequencing format and visualization with ngs.plot

SAM/BAMAlignment format

Page 16: Next-generation sequencing format and visualization with ngs.plot

Short read alignment

• Many choices: BWA, Bowtie, Maq, Soap, Star, Tophat, etc.

FASTQ files Alignments Index

Genomic reference sequence

Page 17: Next-generation sequencing format and visualization with ngs.plot

Alignment format

Bowtie

ELAND

BWA

Soap

Maq

SHRiMP

SAM

Page 18: Next-generation sequencing format and visualization with ngs.plot

The SAM format

2. chromosome

Short read

Reference sequence

1. seqid

3. position? 4. mapping quality

mismatch Indel: insertion, deletion

5. CIGAR: description of alignment operations

6. sequence7. quality

Page 19: Next-generation sequencing format and visualization with ngs.plot

The SAM specificationhttps://github.com/samtools/hts-specs

MARILYN_0005:2:77:7570:3792#0 97 1 12017 1 76M = 12244 303 ACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGGAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCAT IHGIIIIIIIIIIIIGGDBDIIHIIEIGDG=GGDDGGGGEDE>CGDG<GBGGBGDEEGDFFEB>2;C<C;BDDBB8 AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:32C43 YT:Z:UU XS:A:+ NH:i:3 CC:Z:15 CP:i:102519078 HI:i:0

An example line:

N = hundreds of millions

Page 20: Next-generation sequencing format and visualization with ngs.plot

BAM: the binary version of SAM

• SAM files are large: 1M short reads => 200MB; 100M short reads => 20GB.

• Makes sense for compression• BAM: Binary sAM; compress using gzip

library.• Two parts: compressed data + index• Index: random access (visualization,

analysis, etc.)

Page 21: Next-generation sequencing format and visualization with ngs.plot

Layout of binary BAM file

Short read alignment

Hundreds of millions of alignments

Gzip blocks

Time: O(n), n = #alignments

q = chr: X–Y

Chromosome:

Page 22: Next-generation sequencing format and visualization with ngs.plot

A naïve approach

Chromosome:

......

One index per base-pair?Wait, the human chr1 is as long as 200Mb!

Gzip blocks:

Page 23: Next-generation sequencing format and visualization with ngs.plot

A binning strategy

Chromosome bins:

E.g.: bin = 16Kb each, ~10,000 indices per chromosome

Gzip blocks: ...

Long alignment

RNA splicing

Assume all alignments are sorted according to genomic coordinates:

......

Page 24: Next-generation sequencing format and visualization with ngs.plot

Hierarchical binning and linear index

0

1 2 3 4 5 6 7 8512Mb

64Mb

Level 0:

Level 1:

. . .Level 5: 16Kb

...

Linear Index:

16Kb tiling windows: file offset of the left-most alignment that overlaps the window

Binning:

. . .

Page 25: Next-generation sequencing format and visualization with ngs.plot

A hypothetical example0

1 2 3 4

a b cd e

f gh

bin 0: f, g, hbin 1: abin 2: bbin 3: c, dbin 4: e

q1. bins(q): [0, 3];2. Candidate alignments: f->h->c->d->g;3. LinearIndex(3): start(h) => larger than end(f);4. Remove f without reading;5. Read h, c, d;6. start(d) larger than boundary(q);7. Stop: without reading g.

Done: saved TWO disk seeks!

Page 26: Next-generation sequencing format and visualization with ngs.plot

WIGGLECoverage format

Page 27: Next-generation sequencing format and visualization with ngs.plot

From alignment to read depth

• Read depth: the number of times a base-pair is covered by aligned short reads.

• Can be normalized: depth / library size * 1E6 = read depth per million aligned reads.

• Many tools to use: samtools depth, bedtools, and so on.

1 2 3 4

Reference:

Alignments

Example:

Page 28: Next-generation sequencing format and visualization with ngs.plot

Describing depth: the Wiggle format

• Line-oriented text file, two options: variable step and fixed step.

variableStep chrom=chr1 span=2100 1variableStep chrom=chr1 span=31000 2variableStep chrom=chr1 span=410000 3

11 222 3333chr1:

100 1000 10000

Page 29: Next-generation sequencing format and visualization with ngs.plot

Wiggle: fixed stepfixedStep chrom=chr1 start=100 step=100 span=3123

111 222 333chr1:

100 200 300

Reference:

w w w w …fixedStep chrom=chr? start=??? step=w span=wDump your data here

Page 30: Next-generation sequencing format and visualization with ngs.plot

If you have very large wiggle files…• Wiggle files can be huge: average per 10bp window => 300M elements

for human genome.• Makes sense to compress and index.

Gzip blocks

Page 31: Next-generation sequencing format and visualization with ngs.plot

Genome browser

v.s.

Pros: very comprehensiveCons: data have to be transmitted via network

Pros: locally installedCons: less genome annotation

UCSC genome browser

Page 32: Next-generation sequencing format and visualization with ngs.plot

Genome browsers: lots of options

Wiki: 34 in total and that is not all!

Page 33: Next-generation sequencing format and visualization with ngs.plot

DEMO: GENOME BROWSERAlignment, BAM, Wiggle, Peak calling, BED…

Page 34: Next-generation sequencing format and visualization with ngs.plot

NGS.PLOT: GLOBAL VISUALIZATION FOR NEXT GENERATION SEQUENCING DATA

Page 35: Next-generation sequencing format and visualization with ngs.plot

35

A genome is a huge collection of functional elements

• TSS: transcriptional start site

• TES: transcriptional end site

• Exon: mRNA components

• CpG island: has roles in gene regulation and evolution

• Enhancer: activate genes

• Dnase hyper-sensitive site: where TFs bind

• And more…

Images from Google image search

Page 36: Next-generation sequencing format and visualization with ngs.plot

36

Categorizing functional elements

TSS TES Enhancer CpG islandExon

GB view

TSS1

TSS2

TSS3

TSS4

TSS5...

Chrom Start End chr1 100 101

chr2 200 201

.

.

.Avg. profileHeatmap

H3K4me3

Page 37: Next-generation sequencing format and visualization with ngs.plot

Step 1: choose a region of interest

Where to download?

Which database to use?

What kind of formats do they use?

0-based coordinates?

1-based coordinates?

Subset regions by function?

Page 38: Next-generation sequencing format and visualization with ngs.plot

ngs.plot collects lots of genome annotations

Variable Count DescriptionDatabase 3 Refseq , Ensembl and ENCODE

Genome 9 dm3, hg19, mm10, mm9, rn4, rn5, sacCer3, Tair10, Zv9

Biotype 7 tss, tes, genebody, cgi, dhs, enhancer, exon

Gene type 4 protein_coding, lincRNA, miRNA, pseudogene

Cell line 9 Gm12878, H1hesc, Hepg2, Hmec, Hsmm, Huvec, K562, Nhek, Nhlf

Total functional elements: 15,944,952

Page 39: Next-generation sequencing format and visualization with ngs.plot

39

H3K27me3SEM bar

Smoothing Shade

Flanking regionRobust estimation

totalhcnonediffprodpcamax

Gene ranking algo.:

Step 2: plot something at this region

Page 40: Next-generation sequencing format and visualization with ngs.plot

40

ngs.plot: a global visualization tool for NGS data

• Written in R, easy-to-use command line program.ngs.plot.r -G genome -R tss -C chipseq.bam -O output

Page 41: Next-generation sequencing format and visualization with ngs.plot

41

Testing biological hypotheses with NGS data

Ian MazeAllis labRockefeller

Nucleosome H3 Var AH3 Var B

2. 3.

ChIP-seq

N

Understand questionsTransform -> analytics

bioinformatician

Time spent:Super

Not badNormal…

Page 42: Next-generation sequencing format and visualization with ngs.plot

42

Visualization the ngs.plot wayA B H3 RNA-seq

A.bam -1 “A”

B.bam -1 “B”

H3.bam -1 “H3”

Config file:

ngs.plot.r -G mm9 -R genebody -C config.txt -GO diff -O XXX diff

Export gene order list: go.txt

ngs.plot.r -E go.txt -G mm9 -R genebody -F rnaseq –C RNA.bam -GO none -O YYY

Page 43: Next-generation sequencing format and visualization with ngs.plot

43

ngs.plot is also available on Galaxy!URL: https://ineuron.mssm.edu/galaxy

Page 44: Next-generation sequencing format and visualization with ngs.plot

DEMO: NGS.PLOTGlobal visualization made easy…