Introduction to Short Read Sequencing Analysis

33
Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760

description

Introduction to Short Read Sequencing Analysis. Jim Noonan GENE 760. Sequence read lengths remain limiting. Chr1: 249 Mb . 249 Mb sequencing read. Current platforms: A moderate number (~500,000) of long reads (~10 kb) A very large number (>200 M) of short reads (100 bp ) . - PowerPoint PPT Presentation

Transcript of Introduction to Short Read Sequencing Analysis

Page 1: Introduction to Short Read Sequencing Analysis

Introduction to Short Read Sequencing Analysis

Jim NoonanGENE 760

Page 2: Introduction to Short Read Sequencing Analysis

Sequence read lengths remain limiting

• For most applications reads are aligned to a reference genome• Short reads contain inherently limited information• De novo assembly of short reads is difficult

Chr1: 249 Mb

249 Mb sequencing read

Current platforms:• A moderate number (~500,000) of long reads (~10 kb)• A very large number (>200 M) of short reads (100 bp)

Page 3: Introduction to Short Read Sequencing Analysis

Determining the identity and location of short sequence reads in the genome/exome/transcriptome

@HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCATGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG

Need a computationally efficient method to perform accurate alignments of millions of reads

Aligning short reads to much larger reference

Page 4: Introduction to Short Read Sequencing Analysis

Read length requirements vary depending on the feature being studied

Exome:

80-120 bp

Transcriptome:

10,000 bp

Splice junctions(connectivity)

Page 5: Introduction to Short Read Sequencing Analysis

Determining the identity and location of short sequence reads in the genome/exome/transcriptome

@HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCATGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG@HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCATGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG

Exome or Genome

TranscriptomeConsiderations•Alignment scoring•Source of the reads•Sequencing format (PE or SE)•Read length•Error rates

Aligning short reads to much larger reference

Page 6: Introduction to Short Read Sequencing Analysis

Topics

•Mapability

•Error rates and quality scores for short read sequencing

•Common algorithms for short read sequence alignment

•Scoring short read sequence aligments

•Uniform data output formats

•Scoring alignments

Page 7: Introduction to Short Read Sequencing Analysis

Scoring alignments

TAGATTACACAGATTAC|||||||||||||||||TAGATTACACAGATTAC

TAGATTACTCAGA-TAC|||||||| |||| |||TAGATTACACAGATTAC

Adapted from Mark Gerstein

Correct:

Wrong:

C|C

Match (+1)

Mismatch (-1, -2, etc.)C

T

Gap penalty:

P = a +bNa = cost of opening a gapb = cost of extending gap by 1N = length of gap

A-TAC|||||ATTACA--AC|||||ATTAC

Many short read alignment algorithmsallow a fixed number of mismatches

Page 8: Introduction to Short Read Sequencing Analysis

Scoring alignments

TAGATTACTCAGATTAC|||||||| ||||||||TAGATTACACAGATTAC

TAGATTACTCAGA-TAC|||||||| |||| |||TAGATTACACAGATTAC

Adapted from Mark Gerstein

Correct (polymorphism):

Wrong:

C|C

Match (+1)

Mismatch (-1, -2, etc.)C

T

Gap penalty:

P = a +bNa = cost of opening a gapb = cost of extending gap by 1N = length of gap

A-TAC|||||ATTACA--AC|||||ATTAC

Many short read alignment algorithmsallow a fixed number of mismatches

Page 9: Introduction to Short Read Sequencing Analysis

Quality scores

A quality score (or Q-score) expresses the probability that a basecall is incorrect. Given a basecall, A:

• The estimated probability that A is not correct is P(~A);

• The quality score for A is Q (A) = -10 log10 (P(~A))

A quality score of 10 means a probability of 0.1 that A is the wrong basecall.

Quality scores are logarithmic:

P(~A) is platform-specific; Q-scores can be compared across platforms.

Q-score Error probability

10 0.1

20 0.01

40 0.0001

Page 10: Introduction to Short Read Sequencing Analysis

Sequencingby synthesiswith reversibledye terminators

1 cycle

Scan flow cell

Add base

Reverse terminationAdd next base, etc.

Error rates in lllumina sequencing reads

Individual synthesis reactions go out of phase

Page 11: Introduction to Short Read Sequencing Analysis

Error rates in lllumina sequencing reads

• Error rates are mismatch rates relative to reference genome

• Reads may be trimmed to improve alignment quality

• Error rates increase with increasing cycle number

• Contingent on reference genome quality

Page 12: Introduction to Short Read Sequencing Analysis

Illumina quality score encoding in FASTQ format(CASAVA v1.8)

>90% Q30 bases in high quality run>80% mappable reads

Page 13: Introduction to Short Read Sequencing Analysis

Sources of error in single-molecule sequencing

Illumina:

PacBio:

TAGATTACACAGATTAC|||||||||||||||||TAGATTACACAGATTAC

Consensus signal

TAGATTA-ACAG-TT-C||||||| |||| || |TAGATTACACAGATTAC

One molecule, one read

Sequence templates multiple times

Page 14: Introduction to Short Read Sequencing Analysis

Mapability

•The genome contains non-unique sequences (repeats, segmental duplications)•Short reads derived from repetitive regions are difficult to map

Chr3 Chr7repeat repeat

Longer reads:

Paired reads:

Page 15: Introduction to Short Read Sequencing Analysis

Mapability scores at UCSC

•The genome contains non-unique sequences (repeats, segmental duplications)•Short reads derived from repetitive regions are difficult to map

36mers, 2 mismatches

75mers, 2 mismatches

100mers, 2 mismatches

Page 16: Introduction to Short Read Sequencing Analysis

Poorly mappable regions of the genome

36mers, 2 mismatches

75mers, 2 mismatches

100mers, 2 mismatches

Page 17: Introduction to Short Read Sequencing Analysis

Program WebsiteELAND (v2) N/A – integrated into Illumina pipelineBowtie http://bowtie-bio.sourceforge.net/BWA http://bio-bwa.sourceforge.net/Maq http://maq.sourceforge.net/

Common algorithms for mapping short reads to a reference genome

Considerations•Alignment scoring method•Speed•Quality aware•Seeding•Gapped alignment

Page 18: Introduction to Short Read Sequencing Analysis

Seed-based alignment strategy

Reference

Seed

Critical values are seed length and number of mismatches allowedIn ELAND:Seed length = 32Number of mismatches = 2

Single seed alignments

Multiseed alignments(ELAND v2, others)

Seed intervalcontingent on read length

Page 19: Introduction to Short Read Sequencing Analysis

Implementation in ELAND v2

A read must have at least one seed with no more than 2 mismatches and no gaps

Gapped alignment: extend each alignment to full length of read, allowing gaps up to 10 bp

Page 20: Introduction to Short Read Sequencing Analysis

Resolving ambiguous read alignments with multiple seeds

Reference

Seed

Page 21: Introduction to Short Read Sequencing Analysis

Resolving ambiguous read alignments with multiple seeds

Page 22: Introduction to Short Read Sequencing Analysis

Utility of gapped alignments

RNA-seq Insertions and deletion variants in exome and whole genome sequencing

Page 23: Introduction to Short Read Sequencing Analysis

Mapping paired end reads

Read 1 Read 2

Insert size

Insert size within specified range

Page 24: Introduction to Short Read Sequencing Analysis
Page 25: Introduction to Short Read Sequencing Analysis

ELAND alignment scoring

Base quality values and mismatch positions in a candidate alignment are used to assign a p value

P values reflect probability that candidate position in genome would give rise to the observed read if its bases were sequenced at error ratescorresponding to the read’s quality values

Alignment score for a read is computed from p values of all candidatealignments

If there are two candidates for a read with p values 0.9 and 0.3:

• 0.9/(0.9+0.3) = 0.75, chance highest scoring alignment is correct

• 1- 0.75, chance highest scoring alignment is wrong

• Alignment score = -10 log(0.25) = 6.

Page 26: Introduction to Short Read Sequencing Analysis

BaseSpace

https://basespace.illumina.com/

Page 27: Introduction to Short Read Sequencing Analysis

alignment

Spaced-seed indexing of the reference genome

Trapnell and Salzberg, Nat Biotechnology 27:455 (2009)

• Need to break up the genome intomanageable segments

• Create index of short sequences

• Match seeds against genome index

Page 28: Introduction to Short Read Sequencing Analysis

Reference genome indexing usingBurrows-Wheeler transform

alignment Trapnell and Salzberg, Nat Biotechnology 27:455 (2009)

• Reversible encoding scheme• Simplifies genome sequence• Results in “indexed” genome• Very rapid alignments

Page 29: Introduction to Short Read Sequencing Analysis

Bowtie 2

Pre-built Indexed genomes

Bowtie 1 and Bowtie 2indexes are not compatible

Page 30: Introduction to Short Read Sequencing Analysis

Alignments in Bowtie 2

@HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCATGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG

Multiseed alignment (ungapped) Seed length: 16 nt, every 10 nt# mismatches: 0

Mismatch = -6

TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTGTGCACACTGAAGGTCCTGGAATATGGCGAGAAAACTGAAAATCATGGAAA--GAGAAATACACACTTTAGGACGTG

RefRead

Gap = -11-5 to open

-3 to extend by 1 bp

Seeds are extended (gaps allowed) to generate alignment Match = 2

Page 31: Introduction to Short Read Sequencing Analysis

http://bowtie-bio.sourceforge.net/manual.shtmlhttp://bowtie-bio.sourceforge.net/bowtie2/manual.shtml

Mapping in highly repetitive regions

ELAND is conservative• Non-unique alignments are flagged; only one is reported in export.txt• Post-alignment CASAVA analyses ignore these

Bowtie will report non-unique alignments• User-specified options determine how these are reported

Page 32: Introduction to Short Read Sequencing Analysis

Sequence Alignment/Map (SAM) format

Standard format for reporting short read alignment data• BAM is compressed version

Header

Alignment info

http://samtools.sourceforge.net/

Page 33: Introduction to Short Read Sequencing Analysis

Summary

•Read the material posted for this lecture on the class wiki

•Next week: first Regulomics lecture