Bioinfo ngs data format visualization v2

Data formats and visualization in next-generation sequencing analysis

Li Shen, Asst. Prof.

Neuro core

Sep 2014

Introduction to the Shenlab

Lab location: Icahn 10-20 office suite

Two focuses:1. Next-generation sequencing analysis2. Novel software development for NGS

http://neuroscience.mssm.edu/shen/index.html

DNA sequencing overview

Primer

Template sequence

DNA polymerase/ligase

5’ 3’

5’3’

1. How to “freeze” the procedure?2. What kind of signal to generate?3. How to capture the signals?

Sanger sequencingPyrosequencingSolexa sequencingSOLiD sequencingIon Torrent sequencingSMRT sequencing…and many others

Extending sequence

What is “next-generation” sequencing?

-- first-generation sequencers: –

Sanger sequencer: 384 samplesper single batch

-- next-generation sequencers: --

Illumina, SOLiD sequencer: billionsper single batch, ~3 million fold increase in throughput!

Massively Parallel:

What are “short” reads?

http://www.edgebio.com/blog_old/uploads/2011/06/1.png

http://en.wikipedia.org/wiki/File:DNA_Sequencing_gDNA_libraries.jpg

Read position

Illumina:50-250bp

SOLiD:35-50bp

454 pyro:700bp

Sanger:900bp

Limit of read length

Illumina sequencing terminology

Chip, slide, flow cell…

HiSeq 2500

DNA fragment

Information flow of sequencing data

SAM/BAM

coverage

HISEQ2:197:D08GUACXX:8:2105:21056:104282 0 chr10 3000101 255 51M * 0 0 AAGGTCACCAAAGGCCCACCTTGTCTTTACCTTATTTGTTCTAAATTTTTT =@@DA:ADDHD;AA?:AAFHGIHHBDEFHIDGB9CFH<?F<DEEIGGHEII XA:i:0 MD:Z:51 NM:i:0HISEQ2:197:D08GUACXX:6:1105:9303:81340 0 chr10 3000301 255 51M * 0 0 GTGTTATTTCACAAGGTGAAGATAGAGCTTGGTGGCTGCCAGAGAGATTAA BB@FFFFFHHHHHJJJFGIJIIJJJJJJIJJJHIJJJIIJJJJIGIGIJII XA:i:0 MD:Z:51 NM:i:0HISEQ2:197:D08GUACXX:7:2102:2396:174630 16 chr10 3000373 255 51M * 0 0 CTGAATCTTCTCCTAAGTATCATCCTGAAGAACAAAATTCCTCTTTTGCTT JJIJJJJJJJJJJJJJJJIIJJJJJIJJJJJHJJJJJJHHHHHFFFFFCCC XA:i:0 MD:Z:51 NM:i:0HISEQ2:197:D08GUACXX:8:2108:12162:127556 0 chr10 3000388 255 51M * 0 0 AGTATCATCCTGAAGAACAAAATTCCTCTTTTGCTTAAAATTCACTGGGGA @@?DDFFDBHFFGJIIGIGGGGGIJGHHIHIIGEGIIIIIJJJIIJIGGGG XA:i:0 MD:Z:51 NM:i:0

Image analysis

FASTQRaw sequence format

What is FASTQ?

• Text-based format for storing both biological sequences and corresponding quality scores.

• FASTQ = FASTA + QUALITY• A FASTQ file uses four lines per sequence.

@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAA+SEQ_ID(Optional)!''*((((***+))%%%++)(%%%%).1**

Illumina sequence identifiers

@SOLEXA-DELL:6:1:8:1376#0/1

Instrument name Lane

X-coordinate

Y-coordinate

Index number

Paired read

@SEQ_ID

Quality score calculation

+SEQ_ID!''*((((***+))%%%++)(%%%%).1** ?

A quality value Q is an integer representation of the probability p that the corresponding base call is incorrect.

P=0.001 => Q=30

Encoding

Quality score interpretation

Phred Quality ScoreProbability of incorrect

base callBase call accuracy

10 1 in 10 90%

20 1 in 100 99%

30 1 in 1000 99.9%

40 1 in 10000 99.99%

50 1 in 100000 99.999%

Materials from Wikepedia

Quality score encoding

(33): !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI(64): @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh

1. A quality score is typically: [0, 40]

http://ascii-table.com/img/ascii-table.gif

2. An ascii table contains 128 symbols, incl. quality score range

3. Formula: score + offset => index

Two variants: • offset=64(Illumina 1.0-before 1.8)• offset=33(Sanger, Illumina 1.8+).

Not efficient space use

What can you do with FASTQ files?

• Quality control: quality score distribution, GC content, k-mer enrichment, etc.

• Preprocessing: adapter removal, low-quality reads filtering, etc.

GATTTGGGGTTCAAAGCAGTATCGATCAAA!''*((((***+))%%%++)(%%%%).1** Mean quality

GC contentK-mer enrichment

Adapter? (miRNA)

Quality Quality …

SAM/BAMAlignment format

Short read alignment

• Many choices: BWA, Bowtie, Maq, Soap, Star, Tophat, etc.

FASTQ files Alignments Index

Genomic reference sequence

Alignment format

Bowtie

SHRiMP

The SAM format

2. chromosome

Short read

Reference sequence

1. seqid

3. position? 4. mapping quality

mismatch Indel: insertion, deletion

5. CIGAR: description of alignment operations

6. sequence7. quality

The SAM specificationhttps://github.com/samtools/hts-specs

MARILYN_0005:2:77:7570:3792#0 97 1 12017 1 76M = 12244 303 ACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGGAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCAT IHGIIIIIIIIIIIIGGDBDIIHIIEIGDG=GGDDGGGGEDE>CGDG<GBGGBGDEEGDFFEB>2;C<C;BDDBB8 AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:32C43 YT:Z:UU XS:A:+ NH:i:3 CC:Z:15 CP:i:102519078 HI:i:0

An example line:

N = hundreds of millions

BAM: the binary version of SAM

• SAM files are large: 1M short reads => 200MB; 100M short reads => 20GB.

• Makes sense for compression• BAM: Binary sAM; compress using gzip

library.• Two parts: compressed data + index• Index: random access (visualization,

analysis, etc.)

Computer storage: primary vs. secondary

Primary Storage

Secondary Storage

• Fast, but• Expensive

Corsair 16GB (2x8GB) 1600MHz PC3-12800 204-

Pin DDR3 SODIMM Laptop Memory - $160 on Amazon

• Slow, but• Inexpensive

WD My Book 4 TB USB 3.0 Hard Drive with Backup -

$150 on Amazon

http://www.dtidata.com/resourcecenter/harddrive.jpg

1. Disk seek (~10ms on mobile and desktop)

2. Disk read

Scattered Sequential

Use secondary storage smartly!

Alignment

BAM indexing:

~1 disk seek (Li, H., 2011)

WIGGLECoverage format

From alignment to read depth

• Coverage: summary of alignments at each basepair (analysis and visualization)

• Read depth: the number of times a base-pair is covered by aligned short reads.

• Can be normalized: depth / library size * 1E6 = read depth per million aligned reads.

• Many tools to use: samtools depth, bedtools, and so on.1 2 3 4

Reference:

Alignments

Example:

Coverage: sparse or continuous

H3K4me3 (histone mark)

Mouse chr315Kb

Some values A lot of zeros

H3K9me2 (histone mark)

A lot of values everywhere

Read depths => normalization, smoothing

Describing coverage: the Wiggle format

• Line-oriented text file for coverage data• Two options: variable step and fixed step.

variableStep chrom=chr1 span=2100 1variableStep chrom=chr1 span=31000 2variableStep chrom=chr1 span=410000 3

11 222 3333chr1:

100 1000 10000

Wiggle: fixed step

fixedStep chrom=chr1 start=100 step=100 span=3123

111 222 333chr1:

100 200 300

If you have very large wiggle files…

• Wiggle files can be huge: average per 10bp window => 300M elements for human genome.

• Makes sense to compress and index.

Gzip blocks

Genome browser

Pros: very comprehensiveCons: data have to be uploaded or transmitted via network dynamically

Pros: locally installedCons: less genome annotation

UCSC genome browser

Genome browsers: lots of options

Wiki: 34 in total and that is not all!

DEMO: GENOME BROWSERAlignment, BAM, Wiggle, Peak calling, BED…

NGS.PLOT: QUICK MINING AND VISUALIZATION FOR NEXT GENERATION SEQUENCING DATA

The coolest way to visualize your NGS data

Genome: functions & annotations

http://www.bioteach.ubc.ca/wp-content/uploads/2008/04/dna1-198x300.jpg

Molecular level Chromatin level

Robison and Nestler, 2011, Nature Reviews

…-GCCCATTTGGCCATGCCCCCAAAATTCGCGCGTTTAAAA-…

• Long: ~3Gb• Various contexts• Heterogeneous

Labels:

Functional level

Protein codingActivationRepressionSupport othersEvolution relatedEtc.

Genome: A huge catalog of functional elements

Promoter

http://www.nature.com/nsmb/journal/v17/n5/images_article/nsmb.1801-F6.jpg

https://wikispaces.psu.edu/download/attachments/42338229/image-2.jpg

Enhancer

Exon CpG island

DNase I hypersensitive site

And many more…Images from Google image search

Categorizing functional elements

TSS TES Enhancer CpG islandExon

Genome Browser

TSS5...

Chrom Start End chr1 100 101

chr2 200 201

.Avg. profileHeatmap

H3K4me3@TSS

Genome

Genomic annotations are stored in different databases

• Maintained by different groups at different locations• Heterogeneous data formats

And many more…

The Zebrafish Database

The difficulty of dealing with genomic annotations

Where to download?

Which database to use?

What kind of formats do they use?

0-based coordinates?

1-based coordinates?

Subset regions by XXX?

Q: All transcription start sites for mouse genome?

Automated Process

ngs.plot: quick mining & visualization for NGS data

• Easy-to-use command line program.ngs.plot.r -G genome -R tss -C chipseq.bam -O output

ngs.plot workflow

Three histone modification marks

Continued…

• ChIP-seq in human embryonic stem cells• Alignment files: h3k4me3.bam, h3k27me3.bam,

h3k36me3.bam and input.bam (control)

http://www.nature.com/nsmb/journal/v18/n9/images/nsmb.2123-F6.jpg

Configure and…go!

#Bam File Gene List Title

h3k4me3.bam:input.bam -1 H3K4me3

config.txt

ngs.plot –G hg19 –R genebody –C config.txt –GO km –O threeMarks

Genome name Region Configuration Gene rank/clustering(K-means)

Output name

H3K27me3 H3K4me3 H3K36me3

Strongly expressed

Supressed

Bivalent

Nothing

Weakly expressed

“Average” profile

H3K4me3

H3K27me3

H3K36me3

(OPTIONAL) DEMO: NGS.PLOTGlobal visualization made easy…

Bioinfo ngs data format visualization v2

Data & Analytics

Transcript of Bioinfo ngs data format visualization v2

Bioinfo Protein Structure

Statistical Computing Resampling methods Lecture 2: BioInfo course.

NGS data analysis and visualization with Chipster - Prace … · · 2012-07-19NGS data analysis and visualization with Chipster ... Chipster NGS functionality ... Does not support

Aula Aline DNA cromossomos genes bioinfo

R for Bioinfo

SureSeq NGS Library Preparation Kit - ogt.com · The SureSeq NGS Library Preparation Kit generates NGS libraries suitable for the ... SureSeq NGS Library Preparation Adaptors, ...

100503 bioinfo instsymp

Bioinfo 7 Alignment - University of Illinois Urbana-Champaign

Unit 1 intro to bioinfo and biological databases student copy

NGS, Cancer and Bioinformacsrssf.i2bc.paris-saclay.fr/transfert/M2CANCERO/NGS... · NGS and Oncology 18 07-09th April 2014 NGS and Bioinformatics NGS is now widely used as: •A research

Integrating phylogenetic inference and metadata visualization for NGS data

Masters bioinfo 2013-11-14-15

Introduction to Next Generation Sequencing (NGS) … •Introduction to NGS data analysis in Cancer Genomics –NGS applications in cancer research –Typical NGS workflows and pipeline

Graphics - ngs-course.readthedocs.io · Not all visual properties are born equal. Ware, Information Visualization: Perception for Design (Morgan Kaufmann), p. 179. “Grammar of graphics”

Strctural Bioinfo in Drug Design-passino

T-bioinfo overview

10/22/12 NGS Sequence data NGS Sequence datagorgonzola.cshl.edu/pfb/2012/lecture_notes/Stajich NGS...hyphaltip.github.com/CSHL_2012_NGS/lecture/NGS_DNA.slides.html#slide1 2/58 NGS

NGS Investing in Opportunities NGS Investing in ...

Makefiles Bioinfo

Bioinfo Project