Download - Next-generation sequencing format and visualization with ngs.plot

Data formats and visualization in next-generation sequencing analysis

Li Shen, Asst. Prof.Neuro coreSep 2013

Introduction to the Shenlab

Lab location: Icahn 10-20 office suite

Two focuses:1. Next-generation sequencing analysis2. Novel software development for NGS

http://neuroscience.mssm.edu/shen/index.html

http://neuroscience.mssm.edu/shen/index.html

DNA sequencing overview

Primer

Template sequence

DNA polymerase/ligase

ACGT

5’ 3’

5’3’

1. How to “freeze” the procedure?2. What kind of signal to generate?3. How to capture the signals?

Sanger sequencingPyrosequencingSolexa sequencingSOLiD sequencingIon Torrent sequencingSMRT sequencing…and many others

Extending sequence

What is “next-generation” sequencing?

-- first-generation sequencers: –

Sanger sequencer: 384 samplesper single batch

-- next-generation sequencers: --

Illumina, SOLiD sequencer: billionsper single batch, ~3 million fold increase in throughput!

Massively Parallel:

What are “short” reads?

http://www.edgebio.com/blog_old/uploads/2011/06/1.png

http://en.wikipedia.org/wiki/File:DNA_Sequencing_gDNA_libraries.jpg

Read position

Qua

lity

scor

e

Illumina:50-250bp

SOLiD:35-50bp

454 pyro:700bp

Sanger:900bp

Limit of read length

http://www.edgebio.com/blog_old/uploads/2011/06/1.png

http://en.wikipedia.org/wiki/File:DNA_Sequencing_gDNA_libraries.jpg

Illumina sequencing terminology

Chip, slide, flow cell…

HiSeq 2500

DNA fragment

7

Information flow of sequencing data

fastq

SAM/BAM

coverage

HISEQ2:197:D08GUACXX:8:2105:21056:104282 0 chr10 3000101 255 51M * 0 0 AAGGTCACCAAAGGCCCACCTTGTCTTTACCTTATTTGTTCTAAATTTTTT =@@DA:ADDHD;AA?:AAFHGIHHBDEFHIDGB9CFH<?F<DEEIGGHEII XA:i:0 MD:Z:51 NM:i:0HISEQ2:197:D08GUACXX:6:1105:9303:81340 0 chr10 3000301 255 51M * 0 0 GTGTTATTTCACAAGGTGAAGATAGAGCTTGGTGGCTGCCAGAGAGATTAA BB@FFFFFHHHHHJJJFGIJIIJJJJJJIJJJHIJJJIIJJJJIGIGIJII XA:i:0 MD:Z:51 NM:i:0HISEQ2:197:D08GUACXX:7:2102:2396:174630 16 chr10 3000373 255 51M * 0 0 CTGAATCTTCTCCTAAGTATCATCCTGAAGAACAAAATTCCTCTTTTGCTT JJIJJJJJJJJJJJJJJJIIJJJJJIJJJJJHJJJJJJHHHHHFFFFFCCC XA:i:0 MD:Z:51 NM:i:0HISEQ2:197:D08GUACXX:8:2108:12162:127556 0 chr10 3000388 255 51M * 0 0 AGTATCATCCTGAAGAACAAAATTCCTCTTTTGCTTAAAATTCACTGGGGA @@?DDFFDBHFFGJIIGIGGGGGIJGHHIHIIGEGIIIIIJJJIIJIGGGG XA:i:0 MD:Z:51 NM:i:0

Image analysis

FASTQRaw sequence format

What is FASTQ?

• Text-based format for storing both biological sequences and corresponding quality scores.

• FASTQ = FASTA + QUALITY• A FASTQ file uses four lines per sequence.

@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAA+SEQ_ID(Optional)!''*((((***+))%%%++)(%%%%).1**

1234

Illumina sequence identifiers

@SOLEXA-DELL:6:1:8:1376#0/1

Instrument name Lane

Tile

X-coordinate

Y-coordinate

Index number

Paired read

@SEQ_ID

Quality score calculation+SEQ_ID!''*((((***+))%%%++)(%%%%).1** ?

A quality value Q is an integer representation of the probability p that the corresponding base call is incorrect.

1 2

SangerSolexa

Figures from Wikepedia

Quality score interpretation

Phred Quality Score Probability of incorrect base call Base call accuracy

10 1 in 10 90%20 1 in 100 99%30 1 in 1000 99.9%40 1 in 10000 99.99%50 1 in 100000 99.999%

Materials from Wikepedia

Quality score encoding

• Formula: score + offset => look for ascii symbol

• Two variants: offset=64(Illumina 1.0-before 1.8); offset=33(Sanger, Illumina 1.8+).

• A quality score is typically: [0, 40]

(33): !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI(64): @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh

Figures from Wikepedia

What can you do with FASTQ files?

• Quality control: quality score distribution, GC content, k-mer enrichment, etc.

• Preprocessing: adapter removal, low-quality reads filtering, etc.

GATTTGGGGTTCAAAGCAGTATCGATCAAA!''*((((***+))%%%++)(%%%%).1** Mean quality

GC contentK-mer enrichment

Adapter? (miRNA)

Quality Quality …

SAM/BAMAlignment format

Short read alignment

• Many choices: BWA, Bowtie, Maq, Soap, Star, Tophat, etc.

FASTQ files Alignments Index

Genomic reference sequence

Alignment format

Bowtie

ELAND

BWA

Soap

Maq

SHRiMP

SAM

The SAM format

2. chromosome

Short read

Reference sequence

1. seqid

3. position? 4. mapping quality

mismatch Indel: insertion, deletion

5. CIGAR: description of alignment operations

6. sequence7. quality

The SAM specificationhttps://github.com/samtools/hts-specs

MARILYN_0005:2:77:7570:3792#0 97 1 12017 1 76M = 12244 303 ACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGGAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCAT IHGIIIIIIIIIIIIGGDBDIIHIIEIGDG=GGDDGGGGEDE>CGDG<GBGGBGDEEGDFFEB>2;C<C;BDDBB8 AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:32C43 YT:Z:UU XS:A:+ NH:i:3 CC:Z:15 CP:i:102519078 HI:i:0

An example line:

N = hundreds of millions

https://github.com/samtools/hts-specs

BAM: the binary version of SAM

• SAM files are large: 1M short reads => 200MB; 100M short reads => 20GB.

• Makes sense for compression• BAM: Binary sAM; compress using gzip

library.• Two parts: compressed data + index• Index: random access (visualization,

analysis, etc.)

Layout of binary BAM file

Short read alignment

Hundreds of millions of alignments

Gzip blocks

Time: O(n), n = #alignments

q = chr: X–Y

Chromosome:

A naïve approach

Chromosome:

......

One index per base-pair?Wait, the human chr1 is as long as 200Mb!

Gzip blocks:

A binning strategy

Chromosome bins:

E.g.: bin = 16Kb each, ~10,000 indices per chromosome

Gzip blocks: ...

Long alignment

RNA splicing

Assume all alignments are sorted according to genomic coordinates:

......

Hierarchical binning and linear index

0

1 2 3 4 5 6 7 8512Mb

64Mb

Level 0:

Level 1:

. . .Level 5: 16Kb

...

Linear Index:

16Kb tiling windows: file offset of the left-most alignment that overlaps the window

Binning:

. . .

A hypothetical example0

1 2 3 4

a b cd e

f gh

bin 0: f, g, hbin 1: abin 2: bbin 3: c, dbin 4: e

q1. bins(q): [0, 3];2. Candidate alignments: f->h->c->d->g;3. LinearIndex(3): start(h) => larger than end(f);4. Remove f without reading;5. Read h, c, d;6. start(d) larger than boundary(q);7. Stop: without reading g.

Done: saved TWO disk seeks!

WIGGLECoverage format

From alignment to read depth

• Read depth: the number of times a base-pair is covered by aligned short reads.

• Can be normalized: depth / library size * 1E6 = read depth per million aligned reads.

• Many tools to use: samtools depth, bedtools, and so on.

1 2 3 4

Reference:

Alignments

Example:

Describing depth: the Wiggle format

• Line-oriented text file, two options: variable step and fixed step.

variableStep chrom=chr1 span=2100 1variableStep chrom=chr1 span=31000 2variableStep chrom=chr1 span=410000 3

11 222 3333chr1:

100 1000 10000

Wiggle: fixed stepfixedStep chrom=chr1 start=100 step=100 span=3123

111 222 333chr1:

100 200 300

Reference:

w w w w …fixedStep chrom=chr? start=??? step=w span=wDump your data here

If you have very large wiggle files…• Wiggle files can be huge: average per 10bp window => 300M elements

for human genome.• Makes sense to compress and index.

Gzip blocks

Genome browser

v.s.

Pros: very comprehensiveCons: data have to be transmitted via network

Pros: locally installedCons: less genome annotation

UCSC genome browser

Genome browsers: lots of options

Wiki: 34 in total and that is not all!

DEMO: GENOME BROWSERAlignment, BAM, Wiggle, Peak calling, BED…

NGS.PLOT: GLOBAL VISUALIZATION FOR NEXT GENERATION SEQUENCING DATA

35

A genome is a huge collection of functional elements

• TSS: transcriptional start site

• TES: transcriptional end site

• Exon: mRNA components

• CpG island: has roles in gene regulation and evolution

• Enhancer: activate genes

• Dnase hyper-sensitive site: where TFs bind

• And more…

Images from Google image search

36

Categorizing functional elements

TSS TES Enhancer CpG islandExon

GB view

TSS1

TSS2

TSS3

TSS4

TSS5...

Chrom Start End chr1 100 101

chr2 200 201

.

.

.Avg. profileHeatmap

H3K4me3

Step 1: choose a region of interest

Where to download?

Which database to use?

What kind of formats do they use?

0-based coordinates?

1-based coordinates?

Subset regions by function?

ngs.plot collects lots of genome annotations

Variable Count DescriptionDatabase 3 Refseq , Ensembl and ENCODE

Genome 9 dm3, hg19, mm10, mm9, rn4, rn5, sacCer3, Tair10, Zv9

Biotype 7 tss, tes, genebody, cgi, dhs, enhancer, exon

Gene type 4 protein_coding, lincRNA, miRNA, pseudogene

Cell line 9 Gm12878, H1hesc, Hepg2, Hmec, Hsmm, Huvec, K562, Nhek, Nhlf

Total functional elements: 15,944,952

39

H3K27me3SEM bar

Smoothing Shade

Flanking regionRobust estimation

totalhcnonediffprodpcamax

Gene ranking algo.:

Step 2: plot something at this region

40

ngs.plot: a global visualization tool for NGS data

• Written in R, easy-to-use command line program.ngs.plot.r -G genome -R tss -C chipseq.bam -O output

41

Testing biological hypotheses with NGS data

Ian MazeAllis labRockefeller

Nucleosome H3 Var AH3 Var B

2. 3.

ChIP-seq

N

Understand questionsTransform -> analytics

bioinformatician

Time spent:Super

Not badNormal…

42

Visualization the ngs.plot wayA B H3 RNA-seq

A.bam -1 “A”

B.bam -1 “B”

H3.bam -1 “H3”

Config file:

ngs.plot.r -G mm9 -R genebody -C config.txt -GO diff -O XXX diff

Export gene order list: go.txt

ngs.plot.r -E go.txt -G mm9 -R genebody -F rnaseq –C RNA.bam -GO none -O YYY

43

ngs.plot is also available on Galaxy!URL: https://ineuron.mssm.edu/galaxy

DEMO: NGS.PLOTGlobal visualization made easy…