Data formats and visualization in next-generation sequencing analysis
Li Shen, Asst. Prof.Neuro coreSep 2013
Introduction to the Shenlab
Lab location: Icahn 10-20 office suite
Two focuses:1. Next-generation sequencing analysis2. Novel software development for NGS
http://neuroscience.mssm.edu/shen/index.html
DNA sequencing overview
Primer
Template sequence
DNA polymerase/ligase
ACGT
5’ 3’
5’3’
1. How to “freeze” the procedure?2. What kind of signal to generate?3. How to capture the signals?
Sanger sequencingPyrosequencingSolexa sequencingSOLiD sequencingIon Torrent sequencingSMRT sequencing…and many others
Extending sequence
What is “next-generation” sequencing?
-- first-generation sequencers: –
Sanger sequencer: 384 samplesper single batch
-- next-generation sequencers: --
Illumina, SOLiD sequencer: billionsper single batch, ~3 million fold increase in throughput!
Massively Parallel:
What are “short” reads?
http://www.edgebio.com/blog_old/uploads/2011/06/1.png
http://en.wikipedia.org/wiki/File:DNA_Sequencing_gDNA_libraries.jpg
Read position
Qua
lity
scor
e
Illumina:50-250bp
SOLiD:35-50bp
454 pyro:700bp
Sanger:900bp
Limit of read length
Illumina sequencing terminology
Chip, slide, flow cell…
HiSeq 2500
DNA fragment
7
Information flow of sequencing data
fastq
SAM/BAM
coverage
HISEQ2:197:D08GUACXX:8:2105:21056:104282 0 chr10 3000101 255 51M * 0 0 AAGGTCACCAAAGGCCCACCTTGTCTTTACCTTATTTGTTCTAAATTTTTT =@@DA:ADDHD;AA?:AAFHGIHHBDEFHIDGB9CFH<?F<DEEIGGHEII XA:i:0 MD:Z:51 NM:i:0HISEQ2:197:D08GUACXX:6:1105:9303:81340 0 chr10 3000301 255 51M * 0 0 GTGTTATTTCACAAGGTGAAGATAGAGCTTGGTGGCTGCCAGAGAGATTAA BB@FFFFFHHHHHJJJFGIJIIJJJJJJIJJJHIJJJIIJJJJIGIGIJII XA:i:0 MD:Z:51 NM:i:0HISEQ2:197:D08GUACXX:7:2102:2396:174630 16 chr10 3000373 255 51M * 0 0 CTGAATCTTCTCCTAAGTATCATCCTGAAGAACAAAATTCCTCTTTTGCTT JJIJJJJJJJJJJJJJJJIIJJJJJIJJJJJHJJJJJJHHHHHFFFFFCCC XA:i:0 MD:Z:51 NM:i:0HISEQ2:197:D08GUACXX:8:2108:12162:127556 0 chr10 3000388 255 51M * 0 0 AGTATCATCCTGAAGAACAAAATTCCTCTTTTGCTTAAAATTCACTGGGGA @@?DDFFDBHFFGJIIGIGGGGGIJGHHIHIIGEGIIIIIJJJIIJIGGGG XA:i:0 MD:Z:51 NM:i:0
Image analysis
FASTQRaw sequence format
What is FASTQ?
• Text-based format for storing both biological sequences and corresponding quality scores.
• FASTQ = FASTA + QUALITY• A FASTQ file uses four lines per sequence.
@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAA+SEQ_ID(Optional)!''*((((***+))%%%++)(%%%%).1**
1234
Illumina sequence identifiers
@SOLEXA-DELL:6:1:8:1376#0/1
Instrument name Lane
Tile
X-coordinate
Y-coordinate
Index number
Paired read
@SEQ_ID
Quality score calculation+SEQ_ID!''*((((***+))%%%++)(%%%%).1** ?
A quality value Q is an integer representation of the probability p that the corresponding base call is incorrect.
1 2
SangerSolexa
Figures from Wikepedia
Quality score interpretation
Phred Quality Score Probability of incorrect base call Base call accuracy
10 1 in 10 90%20 1 in 100 99%30 1 in 1000 99.9%40 1 in 10000 99.99%50 1 in 100000 99.999%
Materials from Wikepedia
Quality score encoding
• Formula: score + offset => look for ascii symbol
• Two variants: offset=64(Illumina 1.0-before 1.8); offset=33(Sanger, Illumina 1.8+).
• A quality score is typically: [0, 40]
(33): !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI(64): @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh
Figures from Wikepedia
What can you do with FASTQ files?
• Quality control: quality score distribution, GC content, k-mer enrichment, etc.
• Preprocessing: adapter removal, low-quality reads filtering, etc.
GATTTGGGGTTCAAAGCAGTATCGATCAAA!''*((((***+))%%%++)(%%%%).1** Mean quality
GC contentK-mer enrichment
Adapter? (miRNA)
Quality Quality …
SAM/BAMAlignment format
Short read alignment
• Many choices: BWA, Bowtie, Maq, Soap, Star, Tophat, etc.
FASTQ files Alignments Index
Genomic reference sequence
Alignment format
Bowtie
ELAND
BWA
Soap
Maq
SHRiMP
SAM
The SAM format
2. chromosome
Short read
Reference sequence
1. seqid
3. position? 4. mapping quality
mismatch Indel: insertion, deletion
5. CIGAR: description of alignment operations
6. sequence7. quality
The SAM specificationhttps://github.com/samtools/hts-specs
MARILYN_0005:2:77:7570:3792#0 97 1 12017 1 76M = 12244 303 ACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGGAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCAT IHGIIIIIIIIIIIIGGDBDIIHIIEIGDG=GGDDGGGGEDE>CGDG<GBGGBGDEEGDFFEB>2;C<C;BDDBB8 AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:32C43 YT:Z:UU XS:A:+ NH:i:3 CC:Z:15 CP:i:102519078 HI:i:0
An example line:
N = hundreds of millions
BAM: the binary version of SAM
• SAM files are large: 1M short reads => 200MB; 100M short reads => 20GB.
• Makes sense for compression• BAM: Binary sAM; compress using gzip
library.• Two parts: compressed data + index• Index: random access (visualization,
analysis, etc.)
Layout of binary BAM file
Short read alignment
Hundreds of millions of alignments
Gzip blocks
Time: O(n), n = #alignments
q = chr: X–Y
Chromosome:
A naïve approach
Chromosome:
......
One index per base-pair?Wait, the human chr1 is as long as 200Mb!
Gzip blocks:
A binning strategy
Chromosome bins:
E.g.: bin = 16Kb each, ~10,000 indices per chromosome
Gzip blocks: ...
Long alignment
RNA splicing
Assume all alignments are sorted according to genomic coordinates:
......
Hierarchical binning and linear index
0
1 2 3 4 5 6 7 8512Mb
64Mb
Level 0:
Level 1:
. . .Level 5: 16Kb
...
Linear Index:
16Kb tiling windows: file offset of the left-most alignment that overlaps the window
Binning:
. . .
A hypothetical example0
1 2 3 4
a b cd e
f gh
bin 0: f, g, hbin 1: abin 2: bbin 3: c, dbin 4: e
q1. bins(q): [0, 3];2. Candidate alignments: f->h->c->d->g;3. LinearIndex(3): start(h) => larger than end(f);4. Remove f without reading;5. Read h, c, d;6. start(d) larger than boundary(q);7. Stop: without reading g.
Done: saved TWO disk seeks!
WIGGLECoverage format
From alignment to read depth
• Read depth: the number of times a base-pair is covered by aligned short reads.
• Can be normalized: depth / library size * 1E6 = read depth per million aligned reads.
• Many tools to use: samtools depth, bedtools, and so on.
1 2 3 4
Reference:
Alignments
Example:
Describing depth: the Wiggle format
• Line-oriented text file, two options: variable step and fixed step.
variableStep chrom=chr1 span=2100 1variableStep chrom=chr1 span=31000 2variableStep chrom=chr1 span=410000 3
11 222 3333chr1:
100 1000 10000
Wiggle: fixed stepfixedStep chrom=chr1 start=100 step=100 span=3123
111 222 333chr1:
100 200 300
Reference:
w w w w …fixedStep chrom=chr? start=??? step=w span=wDump your data here
If you have very large wiggle files…• Wiggle files can be huge: average per 10bp window => 300M elements
for human genome.• Makes sense to compress and index.
Gzip blocks
Genome browser
v.s.
Pros: very comprehensiveCons: data have to be transmitted via network
Pros: locally installedCons: less genome annotation
UCSC genome browser
Genome browsers: lots of options
Wiki: 34 in total and that is not all!
DEMO: GENOME BROWSERAlignment, BAM, Wiggle, Peak calling, BED…
NGS.PLOT: GLOBAL VISUALIZATION FOR NEXT GENERATION SEQUENCING DATA
35
A genome is a huge collection of functional elements
• TSS: transcriptional start site
• TES: transcriptional end site
• Exon: mRNA components
• CpG island: has roles in gene regulation and evolution
• Enhancer: activate genes
• Dnase hyper-sensitive site: where TFs bind
• And more…
Images from Google image search
36
Categorizing functional elements
TSS TES Enhancer CpG islandExon
GB view
TSS1
TSS2
TSS3
TSS4
TSS5...
Chrom Start End chr1 100 101
chr2 200 201
.
.
.Avg. profileHeatmap
H3K4me3
Step 1: choose a region of interest
Where to download?
Which database to use?
What kind of formats do they use?
0-based coordinates?
1-based coordinates?
Subset regions by function?
ngs.plot collects lots of genome annotations
Variable Count DescriptionDatabase 3 Refseq , Ensembl and ENCODE
Genome 9 dm3, hg19, mm10, mm9, rn4, rn5, sacCer3, Tair10, Zv9
Biotype 7 tss, tes, genebody, cgi, dhs, enhancer, exon
Gene type 4 protein_coding, lincRNA, miRNA, pseudogene
Cell line 9 Gm12878, H1hesc, Hepg2, Hmec, Hsmm, Huvec, K562, Nhek, Nhlf
Total functional elements: 15,944,952
39
H3K27me3SEM bar
Smoothing Shade
Flanking regionRobust estimation
totalhcnonediffprodpcamax
Gene ranking algo.:
Step 2: plot something at this region
40
ngs.plot: a global visualization tool for NGS data
• Written in R, easy-to-use command line program.ngs.plot.r -G genome -R tss -C chipseq.bam -O output
41
Testing biological hypotheses with NGS data
Ian MazeAllis labRockefeller
Nucleosome H3 Var AH3 Var B
2. 3.
ChIP-seq
N
Understand questionsTransform -> analytics
bioinformatician
Time spent:Super
Not badNormal…
42
Visualization the ngs.plot wayA B H3 RNA-seq
A.bam -1 “A”
B.bam -1 “B”
H3.bam -1 “H3”
Config file:
ngs.plot.r -G mm9 -R genebody -C config.txt -GO diff -O XXX diff
Export gene order list: go.txt
ngs.plot.r -E go.txt -G mm9 -R genebody -F rnaseq –C RNA.bam -GO none -O YYY
43
ngs.plot is also available on Galaxy!URL: https://ineuron.mssm.edu/galaxy
DEMO: NGS.PLOTGlobal visualization made easy…
Top Related