4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization...

24
CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 4: Sequencing By Hybridization - September 02, 2004 Lecturer: Wing-Kin Sung Scribe: Stanley NG Kwang Loong 4.1 Introduction The process of extracting DNA sequences from a given DNA sample is commonly known as sequence reconstruction or DNA sequencing. Efficient DNA sequencing of the genomes of individual species and organisms is a critical task for the ad- vancement of biological sciences, medicine, and agriculture. Progress in modern sequencing methods is required to meet the challenge of sequencing Megabase to even Gigabase quantities of DNA genome. One promising sequencing method is Sequencing by Hybridization (SBH), in which sets of oligonucleotides are hybridized under conditions that allow detec- tion of complementary sequences in the target nucleic acid. The unprecedented sequence search parallelism of the SBH method has allowed development of a high-throughput and low-cost DNA sequencing technique. Since the advent of the earliest DNA sequencing techniques in 1970’s by Sanger and Maxim-Gilbert, the rate at which genome information is sequenced has increased exponentially over the past three decades [A97]. It is still accel- erating, to the extent that the draft sequence of the entire human genome of approximately 3 billion base pairs (bp) and the genome of a considerable num- ber of other organisms of medical, agricultural or scientific importance have yet to be determined. Unfortunately, there is a huge gap between the size of genomes and the speed of present sequencing methods based commonly on gel separation. Even the se- quencing of bacterial genomes, which consists of 1 to 10 million bp (in comparison with 1 to 10 billion bp in vertebrates and plants), represents a great challenge. 4.2 Background on DNA Sequencing Techniques DNA sequencing techniques can be broadly categorized into Electrophoresis- Based and Oligonucleotides-Probe Array methods, both of which will be briefly described. 4-1

Transcript of 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization...

Page 1: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1

Lecture 4: Sequencing By Hybridization - September 02, 2004

Lecturer: Wing-Kin Sung Scribe: Stanley NG Kwang Loong

4.1 Introduction

The process of extracting DNA sequences from a given DNA sample is commonlyknown as sequence reconstruction or DNA sequencing. Efficient DNA sequencingof the genomes of individual species and organisms is a critical task for the ad-vancement of biological sciences, medicine, and agriculture. Progress in modernsequencing methods is required to meet the challenge of sequencing Megabase toeven Gigabase quantities of DNA genome.

One promising sequencing method is Sequencing by Hybridization (SBH), inwhich sets of oligonucleotides are hybridized under conditions that allow detec-tion of complementary sequences in the target nucleic acid. The unprecedentedsequence search parallelism of the SBH method has allowed development of ahigh-throughput and low-cost DNA sequencing technique.

Since the advent of the earliest DNA sequencing techniques in 1970’s bySanger and Maxim-Gilbert, the rate at which genome information is sequencedhas increased exponentially over the past three decades [A97]. It is still accel-erating, to the extent that the draft sequence of the entire human genome ofapproximately 3 billion base pairs (bp) and the genome of a considerable num-ber of other organisms of medical, agricultural or scientific importance have yetto be determined.

Unfortunately, there is a huge gap between the size of genomes and the speedof present sequencing methods based commonly on gel separation. Even the se-quencing of bacterial genomes, which consists of 1 to 10 million bp (in comparisonwith 1 to 10 billion bp in vertebrates and plants), represents a great challenge.

4.2 Background on DNA Sequencing Techniques

DNA sequencing techniques can be broadly categorized into Electrophoresis-

Based and Oligonucleotides-Probe Array methods, both of which will bebriefly described.

4-1

Page 2: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-2

4.2.1 Electrophoresis-Based Methods

For electrophoresis-based methods, each base position in the DNA chain is de-termined individually. The earliest systems were pioneered by both Sanger andMaxim-Gilbert relying on a four-lane, high-resolution polyacrylamide gel elec-trophoresis (PAGE) to separate the labelled DNA fragment, and to read the basesequence in the DNA chain in a staggered ladder-like fashion [GH01]. Technically,Sanger’s method is simpler and more time-efficient than that of Maxim-Gilbert,as such it is more commonly adopted for DNA sequencing applications.

Gradually, manual techniques are replaced by automated ones due to the ad-vancements in fluorescent/infrared primers or terminator labelling, and highly-sensitive detection systems bundled with computer-assisted sequencing software.Technological improvements provide greater automation (absence of sample load-ing, electrophoresis, and analysis), reduced operator time (no gel pouring is re-quired), ease-of-use, speed, accuracy, and reliability over traditional slab gel-based sequencing methods.

There exists two variants of automated systems, namely Gel-Based Sys-

tems and Capillary-Based Systems. Gel-Based Systems are considered thefirst generation of automated DNA sequencers that relied on PAGE to separatefluorescent labelled DNA fragments similar to those used in manual sequencing.In recent years, there is a gradual migration towards capillary-based systems,which use either an array of a single capillary filled with polyacrylamide or spe-cially developed polymers for separation of DNA fragments.

In theory, electrophoresis-based DNA sequencing techniques can identify ac-curately sequences of length about 1,000 bp. Empirically, only high-quality se-quences of DNA fragments having roughly 500 to 800 bp can be obtained. Inaddition, automated systems incur higher cost and lower throughput due to thenature of gel electrophoresis.

4.2.2 Oligonucleotides-Probe Array Methods

For Oligonucleotides-Probe Array methods, the complete DNA sequence is as-sembled based on experimental determination of oligonucleotide content of theDNA chain. Sequencing by hybridization (SBH) is one such promising techniquebased on an enabling technology known as DNA microarray. DNA microarray isfabricated by high-speed robots, generally on glass substrates, for which probeswith known identity are used to determine complementary binding. As such,DNA microarray is a popular technique for facilitating massively parallel geneexpression and gene discovery studies.

In the DNA microarray, every possible sequence of the region of interest isrepresented by an array of oligonucleotides, usually length-25 bases. The targetDNA fragment is labelled and hybridized to the microarray, and then scanningdetectors are used to monitor the signal. Hybridization occurs only if the target

Page 3: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-3

DNA fragment contains a sequence that exactly matches the immobilized oligonu-cleotides. Using this approach, an experiment with a single DNA microarray canprovide information on the existence of 60,000 to 70,000 of probes simultaneously,which is a dramatic increase in throughput.

SBH is based on the spectrum generated from DNA microarrays. This tech-nique has the advantages of low cost and high throughput. Currently, it is difficultto apply SBH to large-scale sequencing projects because of problems associatedwith controlling the specificity of hybridization. However, theoretically, it hasmuch potential over other sequencing techniques.

However, for sequencing of short DNA sequence, there are some promising re-sults. Strezoska et al(1991) and Morris and Huang(1999) accurately reconstructedthe spectrum for DNA samples of length 100 bp and 125 bp, respectively.

4.3 Introduction to Sequencing By Hybridiza-

tion

Sequencing By Hybridization (SBH) is a DNA sequencing approach allowingassembly of a contiguous DNA sequence from a collection of overlapping oligonu-cleotide sequences. SBH avoids experimental determination of which base is at aspecific position in the DNA chain. In SBH, a length-k probe is a substring of aDNA sample if it is positively expressed. This set of positively expressed probesis known as the spectrum of the DNA sample.

As an illustration, consider a single stranded DNA sample with sequence5′ACGCATC3′. If a spectrum of length-3 probes are used, only five probes,namely ACG, CGC, GCA, CAT , and ATC will hybridize with the single strandedDNA fragment 5′ACGCATC3′. The analyzed strand can be derived if positivelyexpressed probes can be rearranged with a two-base overlap and read vertically.This example is illustrated in Figure 4.1.

Advantages of SBH are its ability to sequence a DNA sample in a high-throughput, and inexpensive manner. However, SBH does suffer drawbacks re-lated to the inherent difficulties in determing the existence of k -mers precisely.Even if the detection can be solved, Figure 4.2 still illustrates SBH’s inabilityto uniquely determine a particular sequence. This limitation (i.e. the existentof more than one sequences conforming to the same spectrum) is caused by theorder or the number of occurrences of the probes. Such limitations are dictatedprimarily by the computational power but not by biochemical reactions.

4.4 Classical Approach to SBH

Three algorithms for reconstructing DNA sequence based on the microarray spec-trum will be introduced, namely Hamiltonian path, Euler path, and Gapped

Page 4: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4

Figure 4.1: Illustration of Sequencing By Hybridization

Figure 4.2: Two possible sequences can be generated from the same spectrum

probe. The first two algorithms will be covered in the following subsections, andGapped probe will be discussed in greater details at later section.

4.4.1 SBH and Hamiltonian Path

Given a graph G = (V,E), Hamiltonian path is commonly defined as a singlepath P that visits every vertex once and only once. Therefore, Hamiltonian pathP = {vi1

, vi2, ..., vin

} is a permutation of vertices in V such that there is alwaysan edge between adjacent vertices i.e. (vik

, vik+1) ∈ E.

SBH can be transformed or reduced to Hamiltonian path problem i.e. to findthe unique path that visits each vertex (representing each unique probe) exactlyonce. Given the input spectrum S, the SBH problem can be solved by finding aHamiltonian path in G = (V,E), which is defined as follows:

1. Each vertex v represents a length-k probe in S, so |V | = |S|.

2. Two vertices are connected if their corresponding length-k probes overlap

Page 5: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-5

by (k − 1) bases.

Figure 4.3 illustrates the graph G corresponding to a spectrum S = {ACG,

ATC,CAT,CGC,GCA} containing 5 length-3 probes. The 5 vertices representthe 5 probes in S. Since ACG and CGC overlap by CG, a directed edge existsbetween the vertices ACG and CGC, and similarly for the rest of the overlappingpairs. The Hamiltonian path can be easily deduced to be ACG → (CG)C →(GC)A→ (CA)T → (AT )C, which visits every vertex exactly once. Given thata directed edge links two “overlap” probes, and probes are represented by verticesin the Hamiltonian path, the recovered sequence would be ACGCATC.

Figure 4.3: Unique Hamiltonian path can be constructed

Figure 4.4 illustrates an unsuccessful attempt to recover the target DNA se-quence from a spectrum S = {ACC,CCA,CCG,CCT,CGC,CTC,GCC, TCC}.This example does not have a unique Hamiltonian path, since two Hamiltonianpaths can be constructed from these 7 vertices as shown in Figure 4.4(a) and4.4(b). Thus, no unique target DNA sequence can be found.

Besides the inability to construct unique Hamiltonian path, the solution forHamiltonian Path problem is infamously known to be NP-complete. However,there are workarounds to improve SBH, and to overcome the issue of exponentialtime recovery of the target DNA sequence. One approach is to transform SBHinto Eulerian Path problem, which will be introduced next.

4.4.2 SBH and Eulerian Path

The idea arises from the observation that the spectrum S can be transformed toa graph G with every edge (rather than vertex) representing a length-k probe inS. Instead of visiting each vertex once and only once, the problem now becomesvisiting each edge exactly once. The resulting problem is commonly known asthe Eulerian path problem.

Applying this idea, the reconstruction of an DNA sequence using SBH isequivalent to finding an Eulerian path in the corresponding graph G. Given theinput spectrum S, the SBH problem can now be solved by finding a Eulerianpath in G = (V,E), which is defined as follows:

1. Each vertex v represents a (k− 1)-prefix or a (k− 1)-suffix of any length-kprobe in S.

Page 6: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-6

(a) ACCTCCGCCA

(b) ACCGCCTCCA

Figure 4.4: No unique Hamiltonian path can be constructed

2. For each length-k probe in S, connects the vertices representing the (k−1)-prefix and (k − 1)-suffix with a directed edge, such that |E| = |S|.

Figure 4.5 illustrates the directed graph G corresponding to the spectrumS = {ACG,ATC,CAT,CGC,GCA}. The Eulerian path of this graph can beeasily deduced to be AC → CG → GC → CA → AT → TC. Hence, the targetDNA sequence is ACGCATC.

4.4.3 More on Eulerian Path

Unlike Hamiltonian path problem requiring that each vertex is visited once andonly once, Eulerian path problem mandates that each edge is visited exactlyonce. Eulerian path problem can be efficiently solved in linear time, which is a

Page 7: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-7

Figure 4.5: Unique Eulerian path can be constructed

tremendous improvement over solving Hamiltonian path problem. Eulerian Pathproblem obeys two properties as stated in the following two lemmas:

Lemma 4.1 A graph G has an Eulerian path if and only if the following aretrue:

1. Exactly two vertices u satisfy |indegree(u) − outdegree(u)| = 1.

2. All other vertices v satisfy indegree(v) = outdegree(v).

3. The graph G is connected.

Proof: See CS1102.

Lemma 4.2 An Eulerian path can be recovered in O(n) time.

Proof: See CS1102.Thus, the transformation of SBH problem to Eulerian path problem makes

the problem tractable in O(n) time, where n is the length of the DNA sequence.Figure 4.6 depicts the solution for solving Eulerian path problem given the

spectrum S = {ACC,CCA,CCG,CCT,CGC,CTC,GCC, TCC}. This is thesame example given earlier that has no unique Hamiltonian path. From Lemma11.1, it is possible to verify whether graph G has an Eulerian path or not. Since,only vertices AC and CA satisfy |indegree(u)− outdegree(u)| = 1, while all theother vertices satisfy indegree(v) = outdegree(v). Thus, the connected graph G

contains an Eulerian paths. Moreover, graph G contains more than one Eulerianpaths. Hence, no unique DNA sequence can be reconstructed from S.

4.4.4 When can we uniquely reconstruct a sequence?

Here is an observation: when there exists interleaved pairs of repeated (k − 1)-tuples in S, then it is not possible to uniquely reconstruct a sequence.

As an illustration, consider k = 4 and the spectrum S = {ATTA, TTAC,TACG, ACGT, CGTT, GTTT, TTTC, TTCC, TCCA, CCAA, CAAT, AATC,ATCA, TCAA, CAAC, AACG, ACGA, CGAT, GATC, ATCG, TCGC, CGCC,GCCA, CCAG, CAGC, AGCC}. There are two possible sequences which can bereconstructed from this spectrum, which are as follows:

Page 8: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-8

Figure 4.6: No unique Eulerian path can be constructed

attACGtttCCAatcaACGatcgCCAgcc

attACGatcgCCAatcaACGtttCCAgcc

It can be observed that the ambiguity is caused by two interleaved pairs i.e.ACGtttCCA and ACGatcgCCA.

To calculate the maximum length sequence that can be reconstructed by re-ducing SBH problem to Eulerian path problem, first consider a spectrum S witha length-n sequence X as shown in Figure 4.7.

Figure 4.7: Maximal length sequence

Next, calculate the expected number of interleaved pairs of repeated (k − 1)-tuples in X. There exists Cn

4 possible choices to select the 4 positions. Since two(k − 1)-tuples are the same as the other two, the probability is ( 1

4)2(k−1). For

the k-th base should be different, the probability is 34. Therefore, the expected

number of interleaved pairs of repeated (k − 1)-tuples in X is Cn

4 (14)2(k−1)(3

4).

To ensure uniqueness in classical SBH, it is required that Cn

4 (14)2(k−1)(3

4) < 1.

Approximately, n4

24< 42k−1

3= 24k

12. Thus, n < 20.252k. In other words, the length of

an unambiguously reconstructible sequence should be shorter than 20.252k. Dyeret al. [DFS94] and Arratia et al. [AMRW96] gave a tighter bound. In practice,

Page 9: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-9

k = 8. So the maximum length of a sequence that can be uniquely reconstructedby SBH will be shorter than 305. This means that the classical SBH based onk-grams probing scheme performs even worst than electrophoresis-based DNAsequencing methods.

Are there any solutions to improve classical SBH ?

4.5 New Approaches for Solving SBH

Limitations of SBH in large-scale genome sequencing is mainly due to the draw-back that classical SBH is usually unable to uniquely reconstruct long DNA se-quences. Many new approaches have been proposed to overcome this limitation.Three of them, namely, Positional Sequencing by Hybridization, Sequenc-

ing using Interactive Protocols, and Gapped probes. The application ofGapped probes will be detailed in the next section.

4.5.1 Positional Sequencing by Hybridization

Positional Sequencing by Hybridization (PSBH) handles ambiguity by addingextra information of the probes. This approach assumes that additional infor-mation on the possible positions of the probes are given. During the Eulerianpath reconstruction, each probe is only allowed to appear in those preregisteredlocations. Limiting the possible positions of probes would introduce many con-straints to the sequencing problem, and would prune many redundant alternativesolutions.

Mathematically, the PSBH problem can be translated to the positional Eu-lerian path problem (PEP) i.e. given a directed graph with a list of allowedpositions on each edge, decide if there exists an Eulerian path in which eachedge only appears in one of its allowed positions. Hannnenhalli et al. [HPLS96]showed that PEP is NP-complete, even if all the lists of allowed positions are in-terval of equal length. Furthermore, Ben-Dor et al [BPSS99] proved that PSBHis NP-complete, even if each probe has at most three allowed positions.

Given that PSBH suffers NP-completeness, it has yet to find sufficient prac-tical solutions.

4.5.2 Sequencing using Interactive Protocols

Sequencing using Interactive Protocols is another approach proposed to differen-tiate between possible sequences. In this approach, an input of a set of candidatesequences C (may result from a classical SBH or other experiments) is used toidentify which member of C is the true sequence S. Skiena and Sundaram [SS95]have demonstrated that subsequence queries are significantly more powerful thansubstring queries. However, the application of this approach is limited because of

Page 10: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-10

technical issues in dealing with combinatorial explosion of storing all 4k substringsfor even modest-sized k.

4.5.3 Gapped Probes

Gapped probes seek to improve the performance of classical SBH by changingthe probe scheme to increase the sequence length. Gapped probes are probesrepresented as a string of symbols from the alphabet A = {A,C,G, T, ∗}. * isthe new symbol meaning “don’t care” or universal base. It assumes that * isWatson-Crick complementary to any of the bases A, C, G and T. Gapped probewould then bind to the target string if and only if there is a substring of thetarget that is Watson-Crick complementary to the probe. Frieze et al. [FPU99]showed that SBH algorithm based on (s, r)-gapped probe defined below, canuniquely reconstruct sequences of expected length asymptotic to the theoreticalinformation upper bound, substantiating that gapped probes can indeed improvethe efficiency of classical SBH.

Definition 4.3 For fixed parameters s and r, the set of (s,r)-gapped probes isgiven as all probes of the form Xs(U s−1X)r where X ranges over the 4 standardDNA bases (A, C, G, and T) and U is the universal base. The U in this contextis not to be confused with the U base in RNA.

Definition 4.4 For fixed parameters s and r, the probe pattern is given as 1s(0s−11)r.

Definition 4.5 For fixed parameters s and r, probe length is given as v = s(r+1).

As an illustration, (2,2)-probing scheme has probe pattern of 110101, 44 dif-ferent probes, and probe length v = 2*(2+1) = 6. The current state of the art(4,4)-probing scheme has probe pattern of 11110001000100010001, 48 differentprobes, and probe length v = 4*(4+1) = 20.

An understanding of gapped-probes is best illustrated with Figure 4.8. ADNA sample of sequence ACGCATCGGC hybridizes to a spectrum of gappedprobes of pattern = 110101, where ’1’ corresponds to the columns of the spectrumthat have any of the 4 standard DNA bases (A, C, G, and T), and ’0’ correspondsto the columns of the spectrum that has *. This example leads to the problemof how to reconstruct the sequence from the spectrum based on gapped-probes.The solutions for solving this problem are given in the next two sections.

Page 11: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-11

Figure 4.8: Illustration of Gapped Probes

4.6 Noiseless Spectrum Standard Reconstruc-

tion Algorithm

4.6.1 Algorithm Description

Given a putative sequence t and a probe p of length v, if (v−1)-prefix of p matches(v − 1)-suffix of t, probe p is denoted as a feasible extension of the sequence t.

The putative sequence is constructed symbol by symbol, starting with aprimer (short initial sequence) of the sequence. Every step attempts to findall feasible extensions. If there is exactly one feasible extension p, the putativesequence is extended by the last symbol of p. If there are more than one feasibleextensions, then we use breadth first search (BFS) to simultaneously extend allpossible paths. At some future steps when certain paths do not have any feasibleextension, those paths would be terminated and discarded. BFS is continueduntil either only one path is left (this path is allowed extension thereafter), or thesearching tree grows to some predefined depth H and the algorithm terminatesimmediately. The algorithm is stated as follows:

1. Let t←short sequence.

2. Find the set of all feasible extensions C of t.

3. If |C| = 1, then extends t with the last symbol of the feasible extension.

4. If |C| > 1, use BFS searching until there is only one path left, then returnto step 2. Else if the tree grows to height H, then reconstruction fails.

5. If t terminates with end primer, then stops with success; otherwise, returnto step 2.

Page 12: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-12

Figure 4.9: Spectrum probes used in Noiseless Spectrum Algorithm

The algorithm is illustrated as follows. Assume the sample DNA sequenceis ACGCATCGGATA and all available (2, 2)-probes used for the sequence areshown in Figure 4.9. Given that the primer (initial putative sequence) isACGCA. Searching through the spectrum table, exactly one feasible extensioni.e. AC*C*T can be found. The putative sequence is extended by appending withthe last symbol T of the feasible extension. Now the putative sequence becomesACGCAT (Figure 4.10). Repeat step 2 (Figure 4.11), two feasible extensionsCG*A*C and CG*A*A are found. There exists two possible paths ACGCATC,or ACGCATA. The probe CG*A*A is known as a fooling probe as it leads tothe incorrect path. Using BFS to simultaneously extend both paths. Both pu-tative sequences ACGCATC and ACGCATA can be extended by GC*T*G (Fig-ure 4.12). However, ACGCATAG has no feasible extension thereafter, it is thendeleted. Continue with the path ACGCATCG, and repeat step 2 , the finalputative sequence is found to be ACGCATCGG (Figure 4.13).

Figure 4.10: Illustration of Noiseless Spectrum Algorithm (First Iteration)

Page 13: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-13

Figure 4.11: Illustration of Noiseless Spectrum Algorithm (Second Iteration)

Figure 4.12: Illustration of Noiseless Spectrum Algorithm (Third Iteration)

Figure 4.13: Illustration of Noiseless Spectrum Algorithm (Final Iteration)

Page 14: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-14

4.6.2 Analysis of Failure Modes

If there exists too many fooling probes, this algorithms will fail. For example, ifthere is another fooling probes to extend the wrong path ACGCATAG, eventuallythe searching tree grows to level H and the algorithm terminates. In general,there are two cases where the algorithm will be unsuccessful in reconstructingthe sequence, denoted as failure mode 1 and failure mode 2 respectively.

In the case of failure mode 1 as illustrated in Figure 4.14, two extant paths areidentical except in their initial symbols. Consider the spectrum of a sequence CC-TACGCATCGGATAC..., the set of fooling probes CC*A*A, TA*A*A, CA*A*C,AC*T*G leads to a fooling sequence of CCTACACATCA. This sequence differsfrom the correct sequence at the 6th position. Between position 7 to 11, all ofwhich are supported by the fooling probes. All symbols after the 11th position ofthis fooling sequence will be supported by the feasible extensions of the correctsequence. Consequentially, this incorrect path extends to the full length of thecorrect path. The probability that failure mode 1 occurs, is denoted as P1(m).

Figure 4.14: Illustration of failure mode 1

In the case of failure mode 2 as illustrated in Figure 4.15, two paths of thesearch tree represent the two distinct portions of DNA sample. The incorrectpath is due to the set of fooling probes extending from another part of the DNAsample. This leads to a fooling sequence that differs from the correct sequenceat the 6th position onwards. Starting from position 6, this fooling sequence issupported by the fooling probes, and extends to the full length of the correctpath. The probability that failure mode 2 occurs, is denoted as P2(m).

Figure 4.16 shows the success rate of the Noiseless Spectrum Standard Re-construction algorithm, where the smoothed and bent curve correspond to thetheoretical and practical success rate of the algorithm respectively. Success rate≈ 1 − P1 − P2. It can be observed that in theory, using (s, r)-probe, NoiselessSpectrum Standard Reconstruction algorithm is able to reconstruct sequencesof length 4(s+r) with high probability. For example, with (4, 4)-probe, it canreconstruct sequences of length greater than 10,000. Though, the performanceof this algorithm is good, it is not realistic as it assumes that the spectrum isperfect without noise. This assumption is practically invalid. The next section

Page 15: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-15

Figure 4.15: Illustration of failure mode 2

Figure 4.16: Success rate of the Noiseless Spectrum Standard Reconstructionalgorithm

introduces the improvement made to handle noisy input data.

4.7 Noisy Spectrum Standard Reconstruction

Algorithm

There are basically two types of errors in a hybridization model, namely:

1. False Negative: Missing probe in the spectrum.

2. False Positive: Presence of probes which do not belong to the sequence’sspectrum. Also known as false-hit error.

Typically, studies considering hybridization error use the following error modeli.e.

Page 16: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-16

1. Any existing probe in the spectrum can be suppressed with a fixed proba-bility ε (False Negatives).

2. Any probe at Hamming distance 1 from an existing probe may be addedinto the spectrum with a fixed probability ε′ (False Positives).

Hybridization Noise is expressed in terms of error rates ε and ε′ for falsenegatives and false positives respectively. Intuitively, a false positive is not asdetrimental as a false negative. The presence of the latter will certainly causethe Noiseless Spectrum Standard Reconstruction Algorithm to fail because onlyexisting probes are considered for feasible extension. In contrast, false positivesresult in more probes than required (consequently, the number of fooling probes)in the spectrum. As long as the rate of false positives remains adequately small,its effect can be neglected. Thus, greater emphasis will be devoted to false nega-tive for discussion.

Noisy Spectrum Standard Reconstruction Algorithm is an improved algorithmdealing with false negatives by recovering from those probes that have been sup-pressed by hybridization error. The algorithm is based on the following modifi-cations:

1. The spectrum query always returns four possible scored extensions (for eachof the DNA base). Each extension has a score of 0 if it is a feasible extension(corresponding probe presents in the spectrum) and score of 1, if otherwise.

2. All paths are extended in a breadth first search manner using additivescoring i.e. the score of a path is the sum of the score of all probes in thepath.

3. A path will be terminated whenever its score exceeds by a threshold θ

relative to the path with the minimum score.

4. The tree construction is pursued up to a maximum depth H (programparameter).

It can be observed that when θ = 1, the algorithm corresponds to the NoiselessSpectrum Standard Reconstruction Algorithm. This means that after the queryreturns the four possible extensions, those without support from the spectrum(i.e. score of 1) will be suppressed. Thus, by setting θ > 1 would allow recoveryof false negatives. Successful reconstruction of the whole DNA sequence requiresthe absence of fooling probes. This is particularly so when spurious path is morelikely to occur than that of closely spaced false negatives (which make the score ofthe correct path exceeds the threshold θ). However, to use θ > 1 would cause thedepth of the branching to be always greater than 1. This entails the constructionof larger tree and correspondingly increases in the computational cost.

Figure 4.18 to 4.30 illustrate in details the steps required for executing thealgorithm.

Page 17: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-17

4.7.1 Analysis of Failure Modes

Three failure modes for reconstruction process can be identified. Two of whichwere mentioned for Noiseless Spectrum Standard Reconstruction Algorithm inthe previous section, both cases were caused by the presence of large numberof fooling probes. For the third failure mode, it occurs when the correct pathwith false negatives is killed by a path which has fooling probes. Wrong pathelimination usually happens in the presence of adequate number of false negatives,while there exists at least a spurious path whose score is smaller then the correctpath by at least the threshold θ. Then the algorithm will prune the correct pathand the reconstruction fails. The detailed analysis of this failure mode is given in[LPSW02]. A graphical summary of the analysis is given in Figure 4.17, whereSuccess rate ≈ 1−P1−P2−P3. P3 is the probability that failure mode 3 occurs.

In conclusion, Noisy Spectrum Standard Reconstruction Algorithm can re-construct DNA sequence in the presence of noise, however the complexity isexponential. This may slow down the sequencing process.

Figure 4.17: Experimental performance for Θ = 2

References

[LPSW02] Leong, H.W.,Preparata, F.P., Sung, W.K. and Willy, H,,“On the control of hypridization noise in DNA Sequencing-by-Hybridization.”,WABI, 2002.

[FPU99] Frieze, A.M., Preparata, F.P. and Upfal, E., “Optimal Re-construction of a Sequence From its Probes.” J. Comp. Biol., vol 6,361-368, 1999.

Page 18: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-18

Figure 4.18: Noisy Spectrum Standard Reconstruction Algorithm using ReverseProbe

Figure 4.19: Noisy Spectrum Standard Reconstruction Algorithm (Step 1)

[PU00] Preparata, F.P. and Upfal, E., “Sequencing-by-hybridization atthe information-theory bound: an optimal algorithm.”, J. Comp. Biol.,vol 7, 621-630, 2000.

[BPSS99] Ben-Dor, A., Pe’er, I., Shamir, R. and Sharan, R., “On theComplexity of Positional Sequencing by Hybridization.”, Proc. CPM’99, 88-100, 1999.

[SS95] Skiena, S.S. and Sundaram, G., “Reconstructing Strings from Sub-strings.”, J. Comp. Biol., vol 2, 333-353, 1995.

[HPLS96] Hannenhalli, S., Pevzner, P., Lewis, H. and Skiena, S., “Po-sitional sequencing by hybridization.”, Computer Applications in theBiosciences., vol 12, 19-24, 1996.

[DFS94] Dyer, M.E., Frieze, A.M. and Suen, S., “The probability of uniquesolutions of sequencing by hybridization.”, J. Comp. Biol., vol 1,

Page 19: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-19

Figure 4.20: Noisy Spectrum Standard Reconstruction Algorithm (Step 2)

Figure 4.21: Noisy Spectrum Standard Reconstruction Algorithm (Step 3)

Page 20: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-20

Figure 4.22: Noisy Spectrum Standard Reconstruction Algorithm (Step 4)

Figure 4.23: Noisy Spectrum Standard Reconstruction Algorithm (Step 5)

Page 21: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-21

Figure 4.24: Noisy Spectrum Standard Reconstruction Algorithm (Step 6)

Figure 4.25: Noisy Spectrum Standard Reconstruction Algorithm (Step 7)

Page 22: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-22

Figure 4.26: Noisy Spectrum Standard Reconstruction Algorithm(Step 8)

Figure 4.27: Noisy Spectrum Standard Reconstruction Algorithm (Step 9)

Page 23: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-23

Figure 4.28: Noisy Spectrum Standard Reconstruction Algorithm (Step 10)

Figure 4.29: Noisy Spectrum Standard Reconstruction Algorithm (Step 11)

Page 24: 4.1 Introductionksung/cs5238/2004Sem1/note/lecture4a.pdf · Lecture 4: Sequencing By Hybridization - September 02, 2004 4-4 Figure 4.1: Illustration of Sequencing By Hybridization

Lecture 4: Sequencing By Hybridization - September 02, 2004 4-24

Figure 4.30: Noisy Spectrum Standard Reconstruction Algorithm (Step 12)

105-110, 1994.

[PP91] Pevzner, P.A.,Lysov, Y.P., Khrapko, K.R., Belyavsky, A.V.,Florentiev, V.L. and Mirzabekov, A.D., “Improved chips for se-quencing by hybridization.”, J. Biomolecul. Struct. & Dynamics, vol9, 399-410, 1991.

[A97] Alphey, L., “DNA Sequencing - From experimental methods tobioinformatics.”, Springer, 1997.

[GH01] Graham, C.A and Hill, A.J.M., “DNA Sequencing Protocols, 2nded.”, Humana Press, 2001.

[AFV94] Adams, M.D., Fields, C. and Venter, J.C., “Automated DNASequencing and Analysis.”, Academic Press, 1994.

[AMR96] Arratia, R., Martin, D., Reinert, G. and Watermau, H.S.,“Poisson process approximation for sequence repeats, and sequencingby hybridization” Journal of Computational Biology, vol 3, 425-484,1996.