Fitting algebraic curves to noisy data

16
http://www.elsevier.com/locate/jcss Journal of Computer and System Sciences 67 (2003) 325–340 Fitting algebraic curves to noisy data Sanjeev Arora 1 and Subhash Khot ,2 Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ 08544, USA Received 10 July 2002; revised 10 December 2002 Abstract We introduce the following problem which is motivated by applications in vision and pattern detection: We are given pairs of datapoints ðx 1 ; y 1 Þ; ðx 2 ; y 2 Þ; y; ðx m ; y m ÞA½1; 1 ½1; 1; a noise parameter d40; a degree bound d ; and a threshold r40: We desire an algorithm that enlists every degree d polynomial h such that jhðx i Þ y i jpd for at least r fraction of the indices i: ð1Þ If d ¼ 0; this is just the list decoding problem that has been popular in complexity theory and for which Sudan gave a polyðm; d Þ time algorithm. However, for d40; the problem as stated becomes ill-posed and one needs a careful reformulation (see the Introduction). We prove a few basic results about this (reformulated) problem. We show that the problem has no polynomial-time algorithm (our counterexample works for r ¼ 0:5). This is shown by exhibiting an instance of the problem where the number of solutions is as large as expðd 0:5e Þ and every pair of solutions is far from each other in c N norm. On the algorithmic side, we give a rigorous analysis of a brute force algorithm that runs in exponential time. Also, in surprising contrast to our lowerbound, we give a polynomial-time algorithm for learning the polynomials assuming the data is generated using a mixture model in which the mixing weights are ‘‘nondegenerate.’’ r 2003 Elsevier Inc. All rights reserved. Keywords: Curve Fitting; Noisy polynomial reconstruction; List decoding; Learning theory; Vision 1. Introduction In this paper, we study the following problem motivated by applications in vision (see Section 5) and also interesting from the viewpoint of learning theory, approximation theory and the list decoding problem in complexity theory. ARTICLE IN PRESS Corresponding author. E-mail addresses: [email protected] (S. Arora), [email protected] (S. Khot). 1 Supported by David and Lucile Packard Fellowship, NSF Grant CCR-0098180, and NSF ITR Grant CCR- 0205594. Work partially done while visiting the Computer Science Department at UC Berkeley. 2 Same funding as Arora. 0022-0000/03/$ - see front matter r 2003 Elsevier Inc. All rights reserved. doi:10.1016/S0022-0000(03)00012-6

Transcript of Fitting algebraic curves to noisy data

Page 1: Fitting algebraic curves to noisy data

http://www.elsevier.com/locate/jcss

Journal of Computer and System Sciences 67 (2003) 325–340

Fitting algebraic curves to noisy data

Sanjeev Arora1 and Subhash Khot�,2

Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ 08544, USA

Received 10 July 2002; revised 10 December 2002

Abstract

We introduce the following problem which is motivated by applications in vision and pattern detection:We are given pairs of datapoints ðx1; y1Þ; ðx2; y2Þ;y; ðxm; ymÞA½�1; 1� � ½�1; 1�; a noise parameter d40; adegree bound d; and a threshold r40:We desire an algorithm that enlists every degree d polynomial h suchthat

jhðxiÞ � yijpd for at least r fraction of the indices i: ð1ÞIf d ¼ 0; this is just the list decoding problem that has been popular in complexity theory and for whichSudan gave a polyðm; dÞ time algorithm. However, for d40; the problem as stated becomes ill-posed andone needs a careful reformulation (see the Introduction). We prove a few basic results about this(reformulated) problem. We show that the problem has no polynomial-time algorithm (our counterexampleworks for r ¼ 0:5). This is shown by exhibiting an instance of the problem where the number of solutions is

as large as expðd0:5�eÞ and every pair of solutions is far from each other in cN norm. On the algorithmicside, we give a rigorous analysis of a brute force algorithm that runs in exponential time. Also, in surprisingcontrast to our lowerbound, we give a polynomial-time algorithm for learning the polynomials assumingthe data is generated using a mixture model in which the mixing weights are ‘‘nondegenerate.’’r 2003 Elsevier Inc. All rights reserved.

Keywords: Curve Fitting; Noisy polynomial reconstruction; List decoding; Learning theory; Vision

1. Introduction

In this paper, we study the following problem motivated by applications in vision (see Section5) and also interesting from the viewpoint of learning theory, approximation theory and the listdecoding problem in complexity theory.

ARTICLE IN PRESS

�Corresponding author.

E-mail addresses: [email protected] (S. Arora), [email protected] (S. Khot).1Supported by David and Lucile Packard Fellowship, NSF Grant CCR-0098180, and NSF ITR Grant CCR-

0205594. Work partially done while visiting the Computer Science Department at UC Berkeley.2Same funding as Arora.

0022-0000/03/$ - see front matter r 2003 Elsevier Inc. All rights reserved.

doi:10.1016/S0022-0000(03)00012-6

Page 2: Fitting algebraic curves to noisy data

Definition 1 (Noisy polynomial reconstruction problem). We are given pairs of datapointsðx1; y1Þ; ðx2; y2Þ;y; ðxm; ymÞ; a noise parameter d40; a degree bound d; and a threshold r40:Wehave to find every degree d polynomial h satisfying

jhðxiÞ � yijpd for at least r fraction of the indices i ð2ÞSuch a polynomial will be called a ðr; dÞ-fitting polynomial.

In the noise-free case (i.e. d ¼ 0), this is exactly the list decoding problem for which Sudan gave[15] an algorithm that runs in time polyðm; dÞ and works even for subconstant values of r:However the case we are interested in, i.e. when d40; the problem as stated becomes ill-posed andwe need to reformulate it carefully.An immediate objection to Definition 1 is that there could be uncountably many ðr; dÞ-fitting

polynomials. Indeed, if there exists a polynomial that is ðr; gÞ-fitting for some god thenperturbing its coefficients gives uncountably many polynomials that are ðr; dÞ-fitting. To avoidthis technicality, we instead demand a collection C of polynomials such that every ðr; dÞ-fittingpolynomial h is ‘‘close’’ to some polynomial fAC: The notion of closeness we use in this paper isthe closeness in cN norm.3 Since the data itself is corrupted by noise upto d; it is reasonable toallow a further margin for approximations. We demand that every ðr; dÞ-fitting polynomial h is3d-close in cN norm to some polynomial in the desired collection C:Another issue of concern is whether one should have some bound on the values xi; yi of

datapoints. We will assume that xi; yiA½�1; 1�: The bound on x-coordinates can be assumedw.l.o.g. by scaling. The bound on the y-coordinates is necessary to make the problem well defined(see Section 2 for a discussion on this issue). Thus, all polynomials in the paper will be required totake values in ½�1; 1� over the interval ½�1; 1� and we call such polynomials ‘‘well-behaved’’polynomials.With this reformulation of the problem, we are ready to describe the main results in the paper.

For most of the paper we assume that the data is produced using a mixture model. Our algorithmswork only in this restricted model. Moreover, the most interesting result in the paper is a negativeresult (see Theorem 6) that holds in the mixture model and therefore, holds for the generalproblem as well (i.e. problem posed in Definition 1 along with its reformulation).

Definition 2 (Mixture model and the mixture learning problem). In the mixture model forgenerating datapoints, there is a set of k-well-behaved polynomials of degree d and a noiseparameter d: The data is produced by picking a sequence of random points in ½�1; 1� and for eachpoint, randomly picking the value of one of the k polynomials at that point and adding (in anadverserial manner) a noise upto d to this value.In the mixture learning problem the algorithm is given a random sample from the above

mixture, and told k; d; and d: It has to reconstruct the polynomials in the mixture within cN errorat most 3d:

ARTICLE IN PRESS

3For two polynomials f and g; their cN distance is defined to be jj f � gjjN

¼ supxA½�1;1�j f ðxÞ � gðxÞj: Their c2

distance is defined as jj f � gjj2 ¼R 1�1

j f ðxÞ � gðxÞj2ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffið1� x2Þ

p dx

!1=2

: This definition of c2 norm is standard in Chebyshev

approximation, as explained later.

S. Arora, S. Khot / Journal of Computer and System Sciences 67 (2003) 325–340326

Page 3: Fitting algebraic curves to noisy data

We will also be interested in the special case of the mixture learning problem when k ¼ 1: Westate this as the Basic Problem.

Definition 3 (Basic problem). There is an unknown well-behaved degree d polynomialgðxÞ: Given m random points a1; a2;y; amA½�1; 1� and m values b1; b2;y; bm wherebiA½gðxiÞ � d; gðxiÞ þ d�; how large must m be before we can efficiently reconstruct gðxÞ withincN norm 3d?

The problems we consider have a natural place in the classical theory of approximations andcurve fitting [12]. The method of least-square errors is designed for fitting polynomials to noisydata, but it does not suffice even for the Basic problem because it uses c2 noise whereas we assumethe noise is adverserial and just that it is bounded in cN norm. In Section 3 we show a procedure

for which Oðd2

d logðd=dÞÞ datapoints suffice. This may be seen as an analogue of least square fit foradverserial cN noise.We do not know of any efficient classical algorithm or even an approach for the general

problem in Definition 1 and its reformulation. The problem seems to require exponential search(especially if ro1=2; when a majority of the datapoints may be ‘‘outliers’’) because of the need toidentify a sufficiently large subset of datapoints which are all described by the same polynomial, inthe sense of (2). In fact, a basic tool of machine vision called the Hough transform—invented in1962 to detect particle tracks in bubble chamber images—takes precisely such a brute forceapproach [6,10,11]. However, it is not clear a priori whether an exponential search is inherent,especially in view of Sudan’s list decoding algorithm [15], which solves the problem in polynomialðin d; 1=rÞ time in the noise-free case. In fact, extending Sudan’s algorithm was the originalmotivation of our paper.

Remark. Actually, the reconstruction problem in which the data comes from a mixture model wassolved even earlier for the noiseless case by Ar et al. [2]. We emphasize that in Sudan’s and Aret al.’s paper, the parameter r represents the noise whereas for us, the parameter d represents thenoise. We consider Sudan’s and Ar et al.’s results as the noise-free case i.e. d ¼ 0:

In this paper we prove some positive and negative results that shed some light on the status of thenoisy polynomial reconstruction problem. Our positive results hold only for the mixture learningproblem. We first show the following positive result for the Basic problem.

Theorem 4. There is a polynomial-time algorithm that solves the Basic problem for degree d

polynomials and noise parameter d by sampling a data set of size m ¼ Oðd2

d logðd=dÞÞ: The algorithm

reconstructs the unknown polynomial within cN error 3d with high probability over the choice of the

sample.

We use this result to give an algorithm for the mixture learning problem.

Theorem 5. There is an algorithm that learns a mixture of k ¼ Oð1Þ polynomials with degree d and

noise parameter d in time expðOðkd2logðd=dÞ=dÞÞ: The algorithm reconstructs all the k polynomials

ARTICLE IN PRESS

S. Arora, S. Khot / Journal of Computer and System Sciences 67 (2003) 325–340 327

Page 4: Fitting algebraic curves to noisy data

in the mixture upto an cN error 3d with high probability over the choice of the sample. The size of the

sample used is Oðkd2

d logðd=dÞÞ:

This Theorem uses a brute-force approach and it can be viewed as a rigorous upper bound onthe running time of the Hough transform. To the best of our knowledge, the Hough transformhad not been precisely analysed before, though recently, independently of our work, Dasguptaet al. [4] analyzed an efficient version of the Hough transform for d ¼ 1 (the running time and thesample size needed in their algorithm is polynomial in 1=d).The running time of the algorithm in Theorem 5 can be improved to expðOðkd=d logðd=dÞÞÞ in

the special case when the polynomials output by the procedure only describe the datapoints whosex-coordinates lie within ½�1þ Z; 1� Z� for some small but fixed Z40: This condition seems toarise from boundary effects since the algorithm has no information about the curves outside theinterval ½�1; 1�: We call this a weak learning algorithm and it appears as Theorem 11.The most interesting result in the paper is a negative result. We show in Section 7 that even if

the data is produced by a mixture of polynomials, there is no efficient analogue of Sudan’salgorithm. Specifically, we show that

Theorem 6. For every e40; there exists two explicitly given well-behaved degree d polynomialsg1; g2 with the following property. Given a sample (of arbitrarily large size) from a mixture of g1 and

g2; there are expðd0:5�eÞ essentially different polynomials of degree d which fit 50% of thesamplepoints (in the sense of (2)). These polynomials are also well-behaved and the cN distance

between every pair of these polynomials is close to 2. This result holds for a noise parameter d4 1de:

Thus if our goal is to output every polynomial that fits the data, we cannot hope to do it in time

less than expðd0:5�eÞ: This ‘‘bad’’ example works as long as d4 1d0:5�e: We also consider the case

when d is much smaller, say d ¼ 1=dc for an arbitrary constant c independent of d: In Section 8 weextend our bad example to this case; the data is again produced by a mixture of two polynomialsand expðdeÞ many polynomials fit 50% of the datapoints.Finally, we note a couple of reasons why our counterexamples in Sections 7 and 8 need not be

cause for undue pessimism. First, our bad examples leave open the possibility of an efficientalgorithm that outputs at least one (as opposed to all) polynomial that describes the data. Thismay be useful in practice. Second, our counterexample is very degenerate: in practice, thepolynomials may be nondegenerate (it is not yet clear what this means) and efficientreconstruction may still be possible. To illustrate this, we give one polynomial-time algorithmin the paper, a weak analogue of the Ar et al. result [2]: We consider a generalization of the

mixture model where the polynomials ðgiÞki¼1 in the mixture have weights ðwiÞk

i¼1 associated with

them and one picks the polynomial gi with probability wi for evaluation at any point. Ouralgorithm reconstructs a mixture of any k ¼ Oð1Þ polynomials when the mixing weights satisfy anondegeneracy condition. If k ¼ 2 the condition specializes to saying that the mixing weights aresomewhat different, say 51% and 49%. For general k; if fw1;w2;y;wkg are the (unknown)mixing weights, then for every pair of subsets S1;S2 of f1;y; kg; we require that

PiAS1

wi andPiAS2

wi differ by at least 1=dc for some known c: For instance, this condition is true with high

probability for a random set of mixing weights.

ARTICLE IN PRESS

S. Arora, S. Khot / Journal of Computer and System Sciences 67 (2003) 325–340328

Page 5: Fitting algebraic curves to noisy data

1.1. The techniques used

Algorithms for the noise-free case such as those of Berlekamp–Welch [16] ðd ¼ 0; r41=2Þ andSudan’s list decoding [15] (d ¼ 0 and r can be less than 1/2) use notions such as ‘‘roots of thepolynomial’’, ‘‘divisibility of one polynomial by another,’’ etc. that do not hold up well in thenoisy case. For instance, the algorithms use the fact that a nonzero degree d polynomial has atmost d roots. An analogue of this fact in the noisy case would be ‘‘a nonzero degree d polynomial

is far from zero most of the time.’’ Unfortunately, this is not true: xd ; though it only has 1 root, isvery close to 0 when jxjo1� Z for any fixed Z and nevertheless takes the value 1 at x ¼ 1: Luckily,representing polynomials in the Chebyshev basis removes a lot of these difficulties, as we will seein Section 3. We will also need basic ideas of approximation theory. The algorithms will useBernstein–Markov upperbound on derivatives of polynomials (Theorem 9) and the lowerbounds/counterexamples will use results concerning conditions under which ‘‘smooth’’ functions can bewell approximated by degree d polynomials.

2. Basic definitions

For a polynomial pðxÞ : ½�1; 1�-R let jjpðxÞjjN

¼ supxA½�1;1�fjpðxÞjg:A polynomial pðxÞ is called ‘‘well-behaved’’ if jjpðxÞjj

Np1:

Definition 7. A polynomial qðxÞ is a g-approximation to p if jjp � qjjNpg:

We will assume that the data is generated by a mixture model as in Definition 2. The polynomialsgi; g2;y; gk in the mixture are assumed to be ‘‘well-behaved’’ degree d polynomials. Thisassumption is necessary to make the mixture learning problem well defined. One cannot expect analgorithm to solve the problem for a fixed value of d without any restriction on the cN norm ofthe polynomials. An alternative would be, if the cN norm is a then assume the noise parameterto be d a:Note that the mixture model defined in Definition 2 assumes equal mixing weights for the

polynomials in the mixture. In general, we could allow weights w1;w2;y;wk whereP

i wi ¼ 1

(i.e., when picking which polynomial to use at a particular point, the ith polynomial gi is pickedwith probability wi). All our algorithms work with any general mixing weights, and our lower-bounds/counterexamples use equal mixing weights.As mentioned in Definition 2, The goal will be to recover a set of polynomials containing, for

each gi; a polynomial fi that is a 3d-approximation to gi:

3. The basic problem

Before considering how to reconstruct the mixture of polynomials, let us consider the basicproblem defined in Definition 3, which seeks a noisy analogue of the standard fact that a degree d

polynomial can be reconstructed from its value at any d þ 1 points.

ARTICLE IN PRESS

S. Arora, S. Khot / Journal of Computer and System Sciences 67 (2003) 325–340 329

Page 6: Fitting algebraic curves to noisy data

We have not yet found a satisfactory analysis of this problem in the literature. Note that we arenot restricting attention to any specific reconstruction algorithm (say, Lagrange interpolation), sothis is not a standard problem of sensitivity analysis.The following approach suggests itself for reconstructing the polynomial gðxÞ in the Basic

problem: Given the datapoints ðaj; bjÞmj¼1; denote the unknown coefficients of gðxÞ by c0; c1;y; cd

and solve the following linear program:

bj � dpXd

i¼0cia

ijpbj þ d; j ¼ 1; 2;y;m: ð3Þ

Note that the LP is feasible since the coefficients of g satisfy it. (Remark: Note also that if we didn’tknow d explicitly we could just make d a variable and find the smallest d for which the LP is feasible.)Unfortunately, feasible solutions to this LP need not give 3d-approximations to g even if m is

large. The LP only approximates the unknown polynomial on the given points and noteverywhere in the interval ½�1; 1�: We need to somehow incorporate the constraint that theunknown polynomial must take values in ½�1; 1�: We do not yet know how to do this directly butwe can do it indirectly once we move to the Chebyshev representation of the polynomial. A briefintroduction to Chebyshev representation is as follows (see [1, Chapter 2]): For each degree n; the

Chebyshev polynomial TnðxÞ is defined as TnðxÞ ¼ cosðn cos�1ðxÞÞ where T0 ¼ 1=ffiffiffi2

p: The

Chebyshev polynomials are orthonormal functions with respect to the following inner product:

/f ; gS ¼ 2

p

Z 1

�1

f ðxÞgðxÞffiffiffiffiffiffiffiffiffiffiffiffiffi1� x2

p dx: ð4Þ

(The integral exists when f ; g are polynomials.) Thus every degree d polynomial has a unique

representation asPd

i¼0 ciTiðxÞ: The c2 norm of a function is jj f jj2 ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi/f ; fS

p; the c2 norm of the

polynomialPd

i¼0 ciTiðxÞ turns out to be ðPd

i¼0 c2i Þ1=2: (Thus in particular, the constant

polynomial f ðxÞ ¼ a has c2 normffiffiffi2

pjaj:)

Intuitively speaking, the advantage of the Chebyshev representation is that a polynomial has a‘‘small’’ coefficient vector iff jj f jj2 is small. Note that the unknown polynomial takes values in ½�1; 1�and hence its c2-norm is at most

ffiffiffi2

p: In particular, each of its Chebyshev coefficient is at most

ffiffiffi2

p:

Let I be a set of d6 equally spaced points that cover the interval ½�1; 1�:We construct a new LPas follows: The variables are the Chebyshev coefficients c0; c1;y; cd of the unknown polynomial.We have constraints as in (3) and we add extra constraints that bound each of the Chebyshevcoefficient ci and enforce the condition that the polynomial takes values in ½�1; 1� for all points inthe set I : Thus the LP is

bj � dpXd

i¼0ciTiðajÞpbj þ d; j ¼ 1; 2;y;m;

jcijpffiffiffi2

p; i ¼ 0; 1;y; d;

Xd

i¼0ciTiðxÞ

p1; 8xAI :

The following theorem is sufficient to prove Theorem 4.

ARTICLE IN PRESS

S. Arora, S. Khot / Journal of Computer and System Sciences 67 (2003) 325–340330

Page 7: Fitting algebraic curves to noisy data

Theorem 8. Let m ¼ Oðd2

d logðd=dÞÞ and let h be the polynomial obtained from taking any feasible

solution to the above LP. With probability at least 1/2 (over the choice of the random sample), h is a3d-approximation to the unknown polynomial.

We need the Bernstein–Markov inequality (see [7]) which we state below.

Theorem 9 (Bernstein–Markov inequality). For every degree d polynomial f : ½�1; 1�-R

jj f 0jjNpjj f jj

N d2;

j f 0ðxÞjp dffiffiffiffiffiffiffiffiffiffiffiffiffi1� x2

p jj f jjN

8xAð�1; 1Þ:

Now we prove Theorem 8.

Proof. First we claim that jjhjjNp1þ d3=jI jp2: Since hðxÞ ¼

Pdi¼0 ciTiðxÞ; we have

jh0ðxÞjpXd

i¼0jcijjT 0

i ðxÞjpXd

i¼0

ffiffiffi2

pi2pd3;

where we used the fact that jT 0i ðxÞjpi2 for xA½�1; 1� (this follows from the Bernstein–Markov

inequality since jjTiðxÞjjN ¼ 1). By construction, h takes values in ½�1; 1� for all points in I ; andthe distance between successive points of I is 2=jI j: Now the claim follows from the fact that thederivative h0 by definition gives the rate of change in h:

We showed that jjh0jjNpd3; however using Bernstein–Markov inequality again and the bound

jjhjjNp2; we get a stronger bound of jjh0jj

Np2d2:

Now we are ready to finish the proof of the theorem. Let e denote the largest distance betweentwo successive points out of a1; a2;y; am: That is, every interval of size e contains at least one ofthe points. (The points form an e-net.)A simple calculation shows that the probability is high that e ¼ Oðlogm=mÞ ¼ Oðd=d2Þ:Now h and g are functions satisfying jjg0jj

N; jjh0jj

Np2d2; and hence jjðh � gÞ0jj

Np4d2: But h; g

differ by at most 2d on the points in the e-net, so jjðh � gÞjjNp2dþ 4ed2; which is at most 3d

when the constants are appropriately chosen. &

We state the following corollary of the proof of Theorem 8.

Corollary 10. In the setting of Theorem 8, if the unknown polynomial g satisfies jjg0jjN

¼ C then

m ¼ OðC=d logðC=dÞÞ random samples suffice.

Proof. Since jjg0jjN

¼ C; we can add the following constraints to the LP:

Xd

i¼0ciT

0i ðxÞ

pC 8xAI :

ARTICLE IN PRESS

S. Arora, S. Khot / Journal of Computer and System Sciences 67 (2003) 325–340 331

Page 8: Fitting algebraic curves to noisy data

The bound jjh0jjNp2d2 still holds. We then apply Bernstein–Markov inequality to h0 and get the

bound jjh00jjNp2d4: This bound and the fact that jh0ðxÞjpC 8xAI together imply the stronger

bound jjh0jjNpC þ 2d4=jI jp2C: A random sample of m ¼ OðC=d logðC=dÞÞ points forms

a d5C-net and h differs by at most d from the unknown polynomial g on these sample points. Hence

jhðxÞ � gðxÞjp3d everywhere in ½�1; 1�: &

Other classes of polynomials: We note that the Bernstein–Markov inequality is tight in general,but many interesting classes of polynomials are known for which the derivative has a tighterupperbound. For these polynomials, Corollary 10 gives better upperbound on m: Erdos [8] showsthat if g has all its zeroes in R\ð�1; 1Þ then jjg0jj

N¼ OðdÞ: Erdelyi [7] has shown that the same

bound holds if the polynomial has at most Oð1Þ zeroes in the unit disk in the complex plane. If thepolynomial has form

Pi aix

i where jaijp1 then jjg0jjNpOðd3=2Þ:

We also note that we can generalize our algorithm to learn curves of the type yc ¼ gðxÞ; where c

is some constant. (For instance, a circle in the plane is described by y2 ¼ a2 � x2 for some a:) Foreach datapoint ðai; biÞ we just change it to ðai; b

ci Þ and then run our algorithm. This gives, for

instance, a rigorous analysis of the Hough transform for such curves.

4. Learning mixtures of polynomials

The algorithms to learn the mixture of polynomials are easily derived from the results of theprevious section. We prove Theorem 5 in this section.In order to reconstruct all k polynomials in the mixture, we take a random sample of

MXOððkd2 logðd=dÞÞ=dÞ points in ½�1; 1�: With high probability, there are mXOððd2 logðd=dÞÞ=dÞsamples from each polynomial. Now exhaustively enumerating all subsets of size M

m

�samplepoints and applying the reconstruction procedure of the previous section to each subset,we are guaranteed to find 3d-approximations to all polynomials in the mixture.Note that this is basically an implementation of the Hough transform, except it is backed by our

formal analysis of the reconstruction procedure of the previous section.

Remark 1. If the mixing weights of the polynomials are not equal but are w1;w2;y;wk then the

number of samples needed is M ¼ Oððkd2 logðd=dÞÞ=wmindÞ samples where wmin ¼ mini wi:

Taking this many samples ensures that with high probability there are at least Oððd2 logðd=dÞÞ=dÞrandom samples from each polynomial.

Remark 2. There will be ‘‘false’’ positives in general: the algorithm finds polynomials that fit thedata even though they were not in the mixture. This is unavoidable, as we show in Section 7.

4.1. Better algorithm for weak learning

Now we consider the weakening of the mixture learning problem in Definition 2 in which thepolynomials have to fit the unknown polynomials only in the interval ½�1þ Z; 1þ Z� for somesmall constant Z40: Let us call this weak learning.

ARTICLE IN PRESS

S. Arora, S. Khot / Journal of Computer and System Sciences 67 (2003) 325–340332

Page 9: Fitting algebraic curves to noisy data

Theorem 11. Weak learning is possible in expðOðkdd logðd=dÞÞÞ time when the data is produced using

a mixture of k polynomials.

Proof. We proceed in a similar manner as in the proof of Theorem 8. However, when restrictedto the interval ½�1þ Z; 1� Z�; the Bernstein–Markov inequality gives a better bound on the

derivatives and we get jg0ðxÞj; jh0ðxÞjp2dffiffiZ

p when xA½�1þ Z; 1� Z�: So it suffices to take

m ¼ Oðð d

dffiffiffiZ

p logðd=dÞÞÞ and M ¼ Oððk ddffiffiZ

p logðd=dÞÞÞ in the proof of Theorem 5. &

5. Applications to machine vision and pattern detection

Possible applications to machine vision and feature detection motivated our study. Much workin vision is influenced by the so-called Marr paradigm [13], also called inverse optics. The idea is toroughly mimic the human vision system, which appears to understand scenes by first detectinglow-level features (shape, texture) and then progressing to high-level recognition. Note thateveryday scenes often have simple descriptions—for instance, artists have no problems depicting ascene with a few pencil strokes. But such simple descriptions seem hard for computers to come by.Clearly, improving machine vision will entail in improvement in algorithms at all levels. The

problems studied in this paper are related to low-level vision: how to detect shapes. Currentapproaches to this problem start by detecting edges, defined roughly as regions of localdiscontinuity (in light intensity, color, etc.). Hueckel [9], Marr and Hildreth [14], and Canny [3]are some landmarks in the design of edge detection algorithms.After detecting edges, one may try to piece together such local data into a more global (and

hopefully, more succinct) description. One promising idea is to identify low degree algebraic curvesin the data. These curves (boundaries of roads, doors, houses, cars) often occur in scenes.4 Ofcourse, a description in terms of polynomial curves would necessarily need to be approximate toallow for noise and also piecewise (e.g., the curve of the road may be occluded by k foregroundobjects, and so appear as a union of k þ 1 segments). Note that finding such curves may be usefulnot just in understanding the scene, but also in drawing a quick sketch of the pixel data, whichmay be useful for compression, organizing a graphics library for quick searching, etc. Such globalcurve fitting in the presence of noise seems difficult because of the exponential nature of theHough transform, and the research in this paper stemmed from trying to understand ifexponential search is inherent. Unfortunately, this necessarily involves considering the degree d tobe ‘‘large,’’ whereas even the case d ¼ 3 or 4 is very interesting in practice. Also, in this paper, weare interested in finding every polynomial that fits the data. This is the right requirement from thelist decoding perspective, but not strictly motivated by vision applications where typically onesatisfactory polynomial suffices.However, we do point out that our brute-force algorithms have a property desirable for vision

applications, namely, they continue to work even when part of the curve is hidden due toocclusions. Specifically, data might be missing for one of the polynomials in some interval ½u; v�:

ARTICLE IN PRESS

4Of course, the true curve may not be a polynomial or even algebraic, but one could consider its Chebyshev

approximation.

S. Arora, S. Khot / Journal of Computer and System Sciences 67 (2003) 325–340 333

Page 10: Fitting algebraic curves to noisy data

(Thus far we had assumed that we are given values of the polynomials at random points in ½�1; 1�;and hence every reasonably large subinterval contains at least one of the points.)We remark that it is not possible in general to reconstruct the polynomial in the interval for

which data is unavailable; see Section 7.1. Thus we only try to reconstruct the polynomial inintervals where data is available.In the setting of Theorem 8, this corresponds to the situation where the ai’s lie in some sequence

of nonoverlapping intervals ½l1; j1�; ½l2; j2�;y; ½ls; js� (think of s as a constant, and the length of each

interval as at least 1=ffiffiffid

por so, so that a random sample of points is an e-net for them). The goal is

to find a polynomial which describes g only in these subintervals. Note that we have to assumethat g is well behaved in the entire interval ½�1; 1�:The proof of the theorem clearly generalizes to this setting and it shows that h and g agree

within 3d in the intervals for which information is available.

6. Mixture learning when mixing weights are nondegenerate

In this section, we first describe a polynomial-time algorithm to reconstruct a mixture of twowell-behaved polynomials ðgi; g2Þ where the mixing weights are somewhat different, i.e. boundedaway from 1/2.

Theorem 12. There exists a polynomial-time algorithm to learn a mixture of two polynomials when

their mixing weights are bounded away from 1/2.

Proof. Assume that the mixing weights of the two polynomials g1ðxÞ and g2ðxÞ are 60% and 40%,the general case would follow similarly.The algorithm is very simple given our analysis in Section 3. For any point xA½�1; 1�

consider a small interval ½x � dd3; x þ d

d3�: For a sample of d4 logð1=dÞ=d points, with high

probability, Oðlog dÞ of the samples lie in this interval. We are given that 60% of thesesample points give an approximate value of g1ðxÞ; i.e. the value lies in the interval½g1ðxÞ � d; g1ðxÞ þ d� and 40% of the samples give a value in the interval ½g2ðxÞ � d; g2ðxÞ þ d�:We note here that since the derivatives are bounded by Oðd2Þ; the value of the two polynomials

is essentially constant over the interval ½x � dd3;x þ d

d3�: Hence 60% of the values seen in this

interval will lie in ½g1ðxÞ � d; g1ðxÞ þ d� and 40% of them lie in ½g2ðxÞ � d; g2ðxÞ þ d�: If the twointervals overlap, we assume g1ðxÞ and g2ðxÞ are the same. If the two do not overlap, we call theone containing 60% of the values as g1ðxÞ and the one containing 40% of the points as g2ðxÞ:(This is where we use the assumption that the mixing weights are not too similar; it breaks thesymmetry between the two polynomials.) Thus at every point x we can reconstruct both g1ðxÞ andg2ðxÞ: The sample is large enough so that we can reconstruct the values of the two polynomials atsay, Oðd2=dÞ equally spaced points. Now applying the techniques from Section 3 enables us torecover the polynomials g1; g2: &

We can generalize this theorem to a mixture of any k ¼ Oð1Þ polynomials provided the mixingweights satisfy the following nondegeneracy condition.

ARTICLE IN PRESS

S. Arora, S. Khot / Journal of Computer and System Sciences 67 (2003) 325–340334

Page 11: Fitting algebraic curves to noisy data

Definition 13. A set of mixing weights fw1;w2;y;wkg is called nondegenerate if each weight is atleast 1=dc and for every two distinct subsets S1;S2 of f1;y; kg; the sums

PiAS1

wi andP

iAS2wi

differ by at least 1=dc: Here c41 is known.

Theorem 14. There exists a polyðd; 1=dÞ time algorithm to learn a mixture of k ¼ Oð1Þ polynomialswhen their mixing weights are nondegenerate.

Proof. First, we describe a polynomial-time algorithm that solves the reconstruction problemassuming the weights are nondegenerate and known to the algorithm. From this the result willfollow since the algorithm can in poly(kd) time try all choices of mixing weights which aremultiples of 1=kdc and proceed assuming that those are the true mixing weights. When it hits uponthe correct mixing weights, it will succeed.The main use of the nondegeneracy condition will be that knowing the sum of weights in an

unknown subset of f1;y; kg allows us to find the subset itself. Thus this sum of weights is aunique ‘‘signature’’ for the subset. Formally, for each subset SDf1;y; kg we call wS ¼

PiAS wi

the signature of S:The following is the algorithm assuming the mixing weights are known and nondegenerate. As

in the algorithm for k ¼ 2; we generate for each point x a set of spk values that are observed inthe data whose x-coordinate is in some small interval around x: Each value has a certainfrequency. In the above 60 – 40 example, the possible frequencies could only be 40%, 60%, or100%. More generally, the frequencies correspond to subsets of polynomials that takeapproximately the same value at that point. The nondegeneracy condition allows us to associatea subset with each frequency; this is just the signature of the subset.Thus for each polynomial in the mixture we know its value (upto d) at polyðdÞ points and we

can apply our algorithm for the Basic problem. &

7. A counterexample

In this section we prove Theorem 6. We show that given a noisy data which is known to come

from a mixture of two degree d polynomials, there could be 2d0:5�epairs of polynomials ðf ; gÞ that

can explain the data within noise d ¼ 1=de: Thus the number of ‘‘correct answers’’ for thepolynomial reconstruction problem is superpolynomial and hence a polynomial-time algorithmcannot enumerate them all. For concreteness, the counterexample below shows this for e ¼ 1=6but it generalizes to any fixed 0oeo1=2: (Figures 1 and 2)

We define piecewise continuous functions fi; 0pip13d1=3 � 1 as follows. The support of the

function fi is ½3iþ1d1=3 ;

3iþ2d1=3 �: So the functions fi have disjoint supports. We define

fiðxÞ ¼

0 if xp3iþ1d1=3 ;

2d1=3ðx � 3iþ1d1=3 Þ if 3iþ1

d1=3pxp3iþ3

2d1=3 ;

1� 2d1=3 x � 3iþ32

d1=3

� if

3iþ32

d1=3pxp3iþ2d1=3 ;

0 if xX3iþ2d1=3 :

8>>>>>>><>>>>>>>:

ARTICLE IN PRESS

S. Arora, S. Khot / Journal of Computer and System Sciences 67 (2003) 325–340 335

Page 12: Fitting algebraic curves to noisy data

The data is generated as follows: pick a random point x; and check if there is an i such that xlies in the support of fi: If so, output fiðxÞ and �fiðxÞ: Otherwise output 0. We claim that this datacan be fitted (in the sense of (2) and r ¼ 50% ) with a degree d polynomial within a noise

parameter 1=d1=6: In fact, we show that 2Oðd1=3Þ such polynomials exist.

We use the proof of the classical Weirstrass approximation theorem using Bernsteinpolynomials.

Definition 15. For a continuous function f : ½0; 1�-R; and integer d; the Bernstein polynomialBdðf ;xÞ is defined as

Bdðf ;xÞ ¼Xd

k¼0f

k

d

� d

k

!xkð1� xÞd�k:

We use the following approximation theorem [1, p. 30].

ARTICLE IN PRESS

1

0

Y

X1

f ( x ) i

B ( f , x ) d i

3 i 3 i + 1 3 i + 2 3 i + 3

d d d d1/3 1/3 1/31/3

Fig. 1. The functions fi and their approximations Bdðfi; xÞ:

0

1

-1

g (1, 1, -1, 1)

g (-1, -1, 1, -1)

Y

X

Fig. 2. The polynomials gv and g�v for v ¼ ð1; 1;�1; 1Þ:

S. Arora, S. Khot / Journal of Computer and System Sciences 67 (2003) 325–340336

Page 13: Fitting algebraic curves to noisy data

Theorem 16. If f : ½0; 1�-R satisfies the Lipschitz condition j f ðxÞ � f ðyÞjpLjx � yj then the

Bernstein polynomials uniformly converge to f : Specifically,

j f ðxÞ � Bdðf ; xÞjp Lffiffiffid

p 8xA½0; 1�:

Remark. Theffiffiffid

pappearing here is the reason why this particular counterexample works only

for eo1=2:

Clearly, the fi’s defined above satisfy the Lipschitz condition with L ¼ Oðd1=3Þ: Hence the

polynomial Bdðfi;xÞ approximates fi everywhere within Oð 1d1=6Þ: We will show that Bdðfi;xÞ

approximates fi much better outside the interval ½ 3id1=3;

3iþ3d1=3 �: Note that fi is zero outside this interval.

We show that jBdðfi;xÞjpexpð�Oðd1=6ÞÞ outside this interval. This is a simple application ofChernoff bound. Note that

fiðxÞ � Bdðfi;xÞ ¼Xd

k¼0fiðxÞ � fi

k

d

� � �d

k

!xkð1� xÞd�k:

For xp 3id1=3 we get (the case xX3iþ3

d1=3 follows similarly)

jBdðfi; xÞjpXd

k¼d2=3ð3iþ1Þ2

d

k

� xkð1� xÞd�k:

The right-hand side expression is twice the probability that there are more than d2=3ð3i þ 1Þ headsin d tosses of a coin which has probability xp 3i

d1=3 of turning up head. A standard application of

Chernoff bound gives the result.

Thus the polynomials Bdðfi; xÞ are essentially zero outside the interval ½ 3id1=3;

3iþ3d1=3 �: Within the

interval they have a dome-like shape with height 1. The polynomials Bdðfi; xÞ for different i areessentially nonoverlapping.

Let r ¼ 13d1=3 � 1: Now for a vector vAf�1; 1gr; v ¼ ðv0; v1;y; vr�1Þ; define the polynomial

gv ¼Xr�1j¼0

vj Bdðfj; xÞ:

Clearly, the polynomials gv are degree d polynomials. Now consider the mixture of polynomials gv

and g�v: Together, they describe the data within an error Oð1=d1=6Þ: The same is true for all v:

Thus there are 2r ¼ 2Oðd1=3Þ correct pairs of polynomials ðg; g0Þ which can correctly ‘‘explain’’ the

mixed data.

7.1. Counterexample for occluded curves

Consider the following twist to the Basic problem in Definition 3: there is an interval

½a; b�D½�1; 1� of size at least Oð 1ffiffid

p Þ such that g does not fit any of the points in this interval. (In

ARTICLE IN PRESS

S. Arora, S. Khot / Journal of Computer and System Sciences 67 (2003) 325–340 337

Page 14: Fitting algebraic curves to noisy data

other words, the noise in this interval is not bounded by d:) Then we show that solution is notuniquely defined inside ½a; b�:The proof consists of noticing that there are degree d polynomials that are very far from

zero inside ½a; b� but are exponentially close to 0 outside ½a; b�: Such polynomials can beconstructed in a similar manner as the polynomials Bdðfi;xÞ: We can add any such polynomial toa solution to get another solution that fits the data just as well (since the data excludes pointsin ½a; b�).This justifies the assumption in the mixture model that the set of points that correctly describe

the unknown polynomial is an e-net for some small e: It also justifies the assumption in Section 5that the polynomial is required to fit the data only in the intervals in which correct data isavailable.

8. A stronger counterexample

The counterexample presented in the last section cannot be pushed through when the additive

noise d is smaller than 1ffiffid

p : In this section, we present a stronger counterexample where d is

exponentially small in d: We show that if the data is generated from a single degree d polynomialand an additive noise of d is added to it where d ¼ expð�deÞ then there could be expðdeÞpolynomials that fit 50% of the data for some small constant e40:

Let r ¼ de and consider the polynomial f ðxÞ ¼ TrðxÞrþ1 where TrðxÞ is the Chebyshevpolynomial of order r: It is well-known that the polynomial TrðxÞ has r distinct zeroes a1; a2;y; ar

in ð�1; 1Þ: These are precisely the zeroes of f ðxÞ and at these points the first r derivatives of f ðxÞvanish as well. Let us define a0 ¼ �1 and arþ1 ¼ 1: For a vector vAf�1; 1gr define a functionfvðxÞ as

fvðxÞ ¼ ð�1Þvi f ðxÞ for xA½ai; aiþ1�: ð5Þ

Since the first r derivatives of f ðxÞ vanish at its zeroes, the function fvðxÞ is r-times

differentiable. We show that each of the 2rþ1 curves ffv: vAf�1; 1grþ1g can be approximatedvery well with a degree d polynomial. This is shown using the following theorem from[5, Chapter 7, Theorem 6.2].

Definition 17. For any function f : ½�1; 1�-R; define its modulus of continuity as

oðf ; tÞ ¼ supjx�yjpt

j f ðxÞ � f ðyÞj:

Theorem 18. If a function f is r-times differentiable everywhere in ½�1; 1� then there is a constant Cr

depending upon r such that error of the best degree d approximation to it satisfies

Edðf ÞpCrd�roðf ðrÞ; 1=dÞ;

where f ðrÞ denotes the rth derivative of f : We have Crp2OðrÞ:

ARTICLE IN PRESS

S. Arora, S. Khot / Journal of Computer and System Sciences 67 (2003) 325–340338

Page 15: Fitting algebraic curves to noisy data

We apply this theorem to the function fv: Using Theorem 9 and a simple induction, one can show

that the rth derivative of f ðxÞ ¼ TrðxÞrþ1 is at most rOðrÞ which is also a bound on the rthderivative of fv: This gives

oðf ðrÞv ; 1=dÞp1

drOðrÞ:

Recall that r ¼ de: For a sufficiently small e40 we have

EdðfvÞpCrd�roðf ðrÞ

v ; 1=dÞ

p 2OðrÞd�r1

drOðrÞ

p expð�deÞ:

All the 2depairs ðfv; f�vÞ describe the same data, finishing the proof.

9. Conclusions

We have tried to present an interesting new problem which seems to be of interest in manyapplied areas, in addition to being an interesting new problem for the classical theory of curvefitting.Our counterexamples of Sections 7 and 8 suggest that the problem as stated has no polynomial-

time algorithm, even when the data is generated by a mixture of polynomials. However, we do notsee a reason for pessimism yet. For one thing, the counterexamples really rely on mixing weightsbeing equal, an unrealistic assumption in practice (indeed, when the mixing weights are‘‘nondegenerate’’, we can do the reconstruction efficiently using the algorithm of Section 6).All this assumes that the data is generated from a mixture of polynomials. If the data is

unstructured, then our counterexample really gives reason for pause, and it appears that noanalogue of Sudan’s algorithm—which does work for unstructured data—can exist. However,even here, it is conceivable that a polynomial-time algorithm exists which generates at least onecurve (as opposed to all curves) that fits r fraction of the data. It might even be possible to have apolynomial-time algorithm that parameterizes all the fitting curves thereby avoiding explicitenumeration. Such approaches may be useful in practice.

Acknowledgments

Thanks to Ravi Kannan, Madhu Sudan and Alex Samorodnitsky for useful conversations invarious stages of this work and to Dasgupta et al. for sending us their manuscript. We thankBjorn Poonen for his help with Section 8 and Tamas Erdelyi for useful pointers. Many thanks toanonymous referees for their excellent comments.

ARTICLE IN PRESS

S. Arora, S. Khot / Journal of Computer and System Sciences 67 (2003) 325–340 339

Page 16: Fitting algebraic curves to noisy data

References

[1] N.I. Achieser, Theory of Approximation, Dover Publications, New York, 1992.

[2] S. Ar, R. Lipton, R. Rubinfeld, M. Sudan, Reconstructing algebraic funcions from mixed data, SIAM J. Comput.

28 (2) (1999) 487–510 (prelim version in IEEE FOCS 1992).

[3] J. Canny, A computational approach to edge detection, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-8 (6)

(1986) 255–274.

[4] S. Dasgupta, E. Pavlov, Y. Singer, An efficient PAC algorithm for reconstructing a mixture of lines, 13th

International Conferencel on Algorithmic Learning Theory, Lubeck, Germany, 2002.

[5] R.A. DeVore, G.G. Lorentz, Constructive Approximation, Springer, Berlin, Heidelberg, 1993.

[6] R.O. Duda, P.E. Hart, The use of the Hough transformation to detect lines and curves in pictures, Comm. ACM

15 (1972) 11–15.

[7] T. Erdelyi, Markov-Bernstein type inequalities for polynomials under Erdos-type constraints, Paul Erdos and his

Mathematics, Vol. 1, Springer Verlag, Budapest, Hungary, 1999.

[8] P. Erdos, On extremal properties of the derivatives of polynomials, Ann. Math. 2 (1940) 310–313.

[9] A. Hueckel, An operator which locates edges in digital pictures, J. ACM 20 (1971) 113–125.

[10] J. Illingworth, J. Kittler, A survey of the Hough transform, Comput. Vision Graph. Image Process. 44 (1988)

87–116.

[11] V. Leavers, Which Hough transform? A survey of Hough transform methods, CVGIP—Image Understanding

58 (2) (1993) 250–264.

[12] G.G. Lorentz, Approximation of Functions, 2nd Edition, Chelsea, New York, NY, 1986.

[13] D. Marr, Vision, W.H. Freeman, New York, 1982.

[14] D. Marr, E. Hildreth, Theory of edge detection, Proc. Ro. Soc. London 207 (1980) 187–217.

[15] M. Sudan, Decoding of Reed Solomon codes beyond the error-correction bound, J. Complexity 13 (1) (1997)

180–193 (prelim. version in IEEE FOCS 1996).

[16] L. Welch, E.R. Berlekamp, Error correction of algebraic block codes, US Patent Number 4,633,470, 1986.

ARTICLE IN PRESS

S. Arora, S. Khot / Journal of Computer and System Sciences 67 (2003) 325–340340