A Probabilistic Theory of Pattern...

This is page 0Printer: Opaque this

A Probabilistic Theory ofPattern Recognition

Luc Devroye Laszlo Gyorfi Gabor Lugosi


Preface

Life is just a long random walk. Things are created because the circum-stances happen to be right. More often than not, creations, such as thisbook, are accidental. Nonparametric estimation came to life in the fiftiesand sixties and started developing at a frenzied pace in the late sixties, en-gulfing pattern recognition in its growth. In the mid-sixties, two young men,Tom Cover and Peter Hart, showed the world that the nearest neighbor rulein all its simplicity was guaranteed to err at most twice as often as the bestpossible discrimination method. Tom’s results had a profound influence onTerry Wagner, who became a Professor at the University of Texas at Austinand brought probabilistic rigor to the young field of nonparametric estima-tion. Around 1971, Vapnik and Chervonenkis started publishing a revolu-tionary series of papers with deep implications in pattern recognition, buttheir work was not well known at the time. However, Tom and Terry hadnoticed the potential of the work, and Terry asked Luc Devroye to read thatwork in preparation for his Ph.D. dissertation at the University of Texas.The year was 1974. Luc ended up in Texas quite by accident thanks to a tipby his friend and fellow Belgian Willy Wouters, who matched him up withTerry. By the time Luc’s dissertation was published in 1976, pattern recog-nition had taken off in earnest. On the theoretical side, important propertieswere still being discovered. In 1977, Stone stunned the nonparametric com-munity by showing that there are nonparametric rules that are convergentfor all distributions of the data. This is called distribution-free or universalconsistency, and it is what makes nonparametric methods so attractive.Yet, very few researchers were concerned with universal consistency—onenotable exception was Laci Gyorfi, who at that time worked in Budapest

Preface 2

amid an energetic group of nonparametric specialists that included SandorCsibi, Jozsef Fritz, and Pal Revesz.

So, linked by a common vision, Luc and Laci decided to join forces in theearly eighties. In 1982, they wrote six chapters of a book on nonparametricregression function estimation, but these were never published. In fact, thenotes are still in drawers in their offices today. They felt that the subject hadnot matured yet. A book on nonparametric density estimation saw the lightin 1985. Unfortunately, as true baby-boomers, neither Luc nor Laci had thetime after 1985 to write a text on nonparametric pattern recognition. EnterGabor Lugosi, who obtained his doctoral degree under Laci’s supervisionin 1991. Gabor had prepared a set of rough course notes on the subjectaround 1992 and proposed to coordinate the project—this book—in 1993.With renewed energy, we set out to write the book that we should havewritten at least ten years ago. Discussions and work sessions were heldin Budapest, Montreal, Leuven, and Louvain-La-Neuve. In Leuven, ourgracious hosts were Ed van der Meulen and Jan Beirlant, and in Louvain-La-Neuve, we were gastronomically and spiritually supported by LeopoldSimar and Irene Gijbels. We thank all of them. New results accumulated,and we had to resist the temptation to publish these in journals. Finally,in May 1995, the manuscript had bloated to such extent that it had to besent to the publisher, for otherwise it would have become an encyclopedia.Some important unanswered questions were quickly turned into masochisticexercises or wild conjectures. We will explain subject selection, classroomuse, chapter dependence, and personal viewpoints in the Introduction. Wedo apologize, of course, for all remaining errors.

We were touched, influenced, guided, and taught by many people. TerryWagner’s rigor and taste for beautiful nonparametric problems have in-fected us for life. We thank our past and present coauthors on nonpara-metric papers, Alain Berlinet, Michel Broniatowski, Ricardo Cao, Paul De-heuvels, Andras Farago, Adam Krzyzak, Tamas Linder, Andrew Nobel,Mirek Pawlak, Igor Vajda, Harro Walk, and Ken Zeger. Tamas Linder readmost of the book and provided invaluable feedback. His help is especiallyappreciated. Several chapters were critically read by students in Budapest.We thank all of them, especially Andras Antos, Miklos Csuros, Balazs Kegl,Istvan Pali, and Marti Pinter. Finally, here is an alphabetically ordered listof friends who directly or indirectly contributed to our knowledge and loveof nonparametrics: Andrew and Roger Barron, Denis Bosq, Prabhir Bur-man, Tom Cover, Antonio Cuevas, Pierre Devijver, Ricardo Fraiman, NedGlick, Wenceslao Gonzalez-Manteiga, Peter Hall, Eiichi Isogai, Ed Mack,Arthur Nadas, Georg Pflug, George Roussas, Winfried Stute, Tamas Sz-abados, Godfried Toussaint, Sid Yakowitz, and Yannis Yatracos.

Gabor diligently typed the entire manuscript and coordinated all contri-butions. He became quite a TEXpert in the process. Several figures weremade by idraw and xfig by Gabor and Luc. Most of the drawings weredirectly programmed in PostScript by Luc and an undergraduate student

Preface 3

at McGill University, Hisham Petry, to whom we are grateful. For Gabor,this book comes at the beginning of his career. Unfortunately, the othertwo authors are not so lucky. As both Luc and Laci felt that they wouldprobably not write another book on nonparametric pattern recognition—the random walk must go on—they decided to put their general view ofthe subject area on paper while trying to separate the important from theirrelevant. Surely, this has contributed to the length of the text.

So far, our random excursions have been happy ones. Coincidentally,Luc is married to Bea, the most understanding woman in the world, andhappens to have two great daughters, Natasha and Birgit, who do not strayoff their random courses. Similarly, Laci has an equally wonderful wife,Kati, and two children with steady compasses, Kati and Janos. During thepreparations of this book, Gabor met a wonderful girl, Arrate. They haverecently decided to tie their lives together.

On the less amorous and glamorous side, we gratefully acknowledge theresearch support of nserc canada, fcar quebec, otka hungary, andthe exchange program between the Hungarian Academy of Sciences and theRoyal Belgian Academy of Sciences. Early versions of this text were triedout in some classes at the Technical University of Budapest, KatholiekeUniversiteit Leuven, Universitat Stuttgart, and Universite Montpellier II.We would like to thank those students for their help in making this a betterbook.

Luc Devroye, Montreal, Quebec, CanadaLaci Gyorfi, Budapest, Hungary

Gabor Lugosi, Budapest, Hungary


Contents

“contentsline chapterPreface1 “contentsline chapter“numberline 1Introduc-tion10 “contentsline chapter“numberline 2The Bayes Error19 “contentslinesection“numberline 2.1The Bayes Problem19 “contentsline section“numberline2.2A Simple Example21 “contentsline section“numberline 2.3Another Sim-ple Example22 “contentsline section“numberline 2.4Other Formulas forthe Bayes Risk24 “contentsline section“numberline 2.5Plug-In Decisions26“contentsline section“numberline 2.6Bayes Error Versus Dimension27 “con-tentsline sectionProblems and Exercises28 “contentsline chapter“numberline3Inequalities and Alternate “newline Distance Measures31 “contentslinesection“numberline 3.1Measuring Discriminatory Information31 “contentslinesection“numberline 3.2The Kolmogorov Variational Distance33 “contentslinesection“numberline 3.3The Nearest Neighbor Error33 “contentsline sec-tion“numberline 3.4The Bhattacharyya Affinity34 “contentsline section“numberline3.5Entropy36 “contentsline section“numberline 3.6Jeffreys’ Divergence38“contentsline section“numberline 3.7“it F-Errors40 “contentsline section“numberline3.8The Mahalanobis Distance41 “contentsline section“numberline 3.9“it f-Divergences43 “contentsline sectionProblems and Exercises46 “contentslinechapter“numberline 4Linear Discrimination49 “contentsline section“numberline4.1Univariate Discrimination and Stoller Splits50 “contentsline section“numberline4.2Linear Discriminants54 “contentsline section“numberline 4.3The FisherLinear Discriminant56 “contentsline section“numberline 4.4The NormalDistribution57 “contentsline section“numberline 4.5Empirical Risk Mini-mization59 “contentsline section“numberline 4.6Minimizing Other Crite-ria65 “contentsline sectionProblems and Exercises66 “contentsline chap-ter“numberline 5Nearest Neighbor Rules71 “contentsline section“numberline

Preface 5

5.1Introduction71 “contentsline section“numberline 5.2Notation and Sim-ple Asymptotics73 “contentsline section“numberline 5.3Proof of Stone’sLemma76 “contentsline section“numberline 5.4The Asymptotic Probabil-ity of Error80 “contentsline section“numberline 5.5The Asymptotic Er-ror Probability of “newline Weighted Nearest Neighbor Rules82 “con-tentsline section“numberline 5.6k-Nearest Neighbor Rules: Even k85 “con-tentsline section“numberline 5.7Inequalities for the Probability of Error86“contentsline section“numberline 5.8Behavior When L∗ Is Small89 “con-tentsline section“numberline 5.9Nearest Neighbor Rules When L∗ = 091“contentsline section“numberline 5.10Admissibility of the Nearest NeighborRule92 “contentsline section“numberline 5.11The (k, l)-Nearest NeighborRule92 “contentsline sectionProblems and Exercises94 “contentsline chap-ter“numberline 6Consistency103 “contentsline section“numberline 6.1Uni-versal Consistency103 “contentsline section“numberline 6.2Classificationand Regression Estimation104 “contentsline section“numberline 6.3Parti-tioning Rules106 “contentsline section“numberline 6.4The Histogram Rule107“contentsline section“numberline 6.5Stone’s Theorem109 “contentsline sec-tion“numberline 6.6The k-Nearest Neighbor Rule112 “contentsline section“numberline6.7Classification Is Easier Than Regression “newline Function Estima-tion114 “contentsline section“numberline 6.8Smart Rules118 “contentslinesectionProblems and Exercises119 “contentsline chapter“numberline 7SlowRates of Convergence123 “contentsline section“numberline 7.1Finite Train-ing Sequence123 “contentsline section“numberline 7.2Slow Rates125 “con-tentsline sectionProblems and Exercises130 “contentsline chapter“numberline8Error Estimation133 “contentsline section“numberline 8.1Error Count-ing133 “contentsline section“numberline 8.2Hoeffding’s Inequality134 “con-tentsline section“numberline 8.3Error Estimation Without Testing Data137“contentsline section“numberline 8.4Selecting Classifiers137 “contentslinesection“numberline 8.5Estimating the Bayes Error140 “contentsline sec-tionProblems and Exercises142 “contentsline chapter“numberline 9The Reg-ular Histogram Rule146 “contentsline section“numberline 9.1The Methodof Bounded Differences146 “contentsline section“numberline 9.2Strong Uni-versal Consistency151 “contentsline sectionProblems and Exercises156 “con-tentsline chapter“numberline 10Kernel Rules 160 “contentsline section“numberline10.1Consistency162 “contentsline section“numberline 10.2Proof of the Con-sistency Theorem167 “contentsline section“numberline 10.3Potential Func-tion Rules172 “contentsline sectionProblems and Exercises175 “contentslinechapter“numberline 11Consistency of the “it k-Nearest Neighbor Rule182“contentsline section“numberline 11.1Strong Consistency183 “contentslinesection“numberline 11.2Breaking Distance Ties188 “contentsline section“numberline11.3Recursive Methods190 “contentsline section“numberline 11.4Scale-InvariantRules191 “contentsline section“numberline 11.5Weighted Nearest NeighborRules192 “contentsline section“numberline 11.6Rotation-Invariant Rules193“contentsline section“numberline 11.7Relabeling Rules193 “contentsline sec-tionProblems and Exercises195 “contentsline chapter“numberline 12Vapnik-

Preface 6

Chervonenkis Theory201 “contentsline section“numberline 12.1EmpiricalError Minimization 201 “contentsline section“numberline 12.2Fingering205“contentsline section“numberline 12.3The Glivenko-Cantelli Theorem207“contentsline section“numberline 12.4Uniform Deviations of Relative Fre-quencies “newline from Probabilities211 “contentsline section“numberline12.5Classifier Selection214 “contentsline section“numberline 12.6Sample Com-plexity216 “contentsline section“numberline 12.7The Zero-Error Case217“contentsline section“numberline 12.8Extensions221 “contentsline section-Problems and Exercises223 “contentsline chapter“numberline 13Combina-torial Aspects of Vapnik-Chervonenkis Theory230 “contentsline section“numberline13.1Shatter Coefficients and VC Dimension230 “contentsline section“numberline13.2Shatter Coefficients of Some Classes235 “contentsline section“numberline13.3Linear and Generalized Linear Discrimination Rules240 “contentslinesection“numberline 13.4Convex Sets and Monotone Layers241 “contentslinesectionProblems and Exercises245 “contentsline chapter“numberline 14LowerBounds for Empirical “newline Classifier Selection 248 “contentsline sec-tion“numberline 14.1Minimax Lower Bounds249 “contentsline section“numberline14.2The Case L\calC = 0250 “contentsline section“numberline 14.3Classeswith Infinite VC Dimension254 “contentsline section“numberline 14.4TheCase L\calC > 0254 “contentsline section“numberline 14.5Sample Com-plexity260 “contentsline sectionProblems and Exercises263 “contentslinechapter“numberline 15The Maximum Likelihood Principle264 “contentslinesection“numberline 15.1Maximum Likelihood: The Formats264 “contentslinesection“numberline 15.2The Maximum Likelihood Method: “newline Re-gression Format265 “contentsline section“numberline 15.3Consistency268“contentsline section“numberline 15.4Examples272 “contentsline section“numberline15.5Classical Maximum Likelihood: Distribution Format276 “contentslinesectionProblems and Exercises276 “contentsline chapter“numberline 16Para-metric Classification279 “contentsline section“numberline 16.1Example: Ex-ponential Families282 “contentsline section“numberline 16.2Standard Plug-In Rules283 “contentsline section“numberline 16.3Minimum Distance Esti-mates286 “contentsline section“numberline 16.4Empirical Error Minimiza-tion291 “contentsline sectionProblems and Exercises292 “contentsline chap-ter“numberline 17Generalized Linear Discrimination296 “contentsline sec-tion“numberline 17.1Fourier Series Classification297 “contentsline section“numberline17.2Generalized Linear Classification302 “contentsline sectionProblems andExercises304 “contentsline chapter“numberline 18Complexity Regulariza-tion306 “contentsline section“numberline 18.1Structural Risk Minimiza-tion307 “contentsline section“numberline 18.2Poor Approximation Prop-erties of VC Classes314 “contentsline section“numberline 18.3Simple Em-pirical Covering315 “contentsline sectionProblems and Exercises317 “con-tentsline chapter“numberline 19Condensed and Edited “newline NearestNeighbor Rules320 “contentsline section“numberline 19.1Condensed Near-est Neighbor Rules320 “contentsline section“numberline 19.2Edited Near-est Neighbor Rules326 “contentsline section“numberline 19.3Sieves and

Preface 7

Prototypes327 “contentsline sectionProblems and Exercises330 “contentslinechapter“numberline 20Tree Classifiers331 “contentsline section“numberline20.1Invariance335 “contentsline section“numberline 20.2Trees with the X-Property336 “contentsline section“numberline 20.3Balanced Search Trees339“contentsline section“numberline 20.4Binary Search Trees343 “contentslinesection“numberline 20.5The Chronological k-d Tree346 “contentsline sec-tion“numberline 20.6The Deep k-d Tree350 “contentsline section“numberline20.7Quadtrees351 “contentsline section“numberline 20.8Best Possible Per-pendicular Splits352 “contentsline section“numberline 20.9Splitting Cri-teria Based on Impurity Functions355 “contentsline section“numberline20.10A Consistent Splitting Criterion358 “contentsline section“numberline20.11BSP Trees360 “contentsline section“numberline 20.12Primitive Se-lection362 “contentsline section“numberline 20.13Constructing ConsistentTree Classifiers366 “contentsline section“numberline 20.14A Greedy Classi-fier367 “contentsline sectionProblems and Exercises378 “contentsline chap-ter“numberline 21Data-Dependent Partitioning384 “contentsline section“numberline21.1Introduction384 “contentsline section“numberline 21.2A Vapnik-ChervonenkisInequality for Partitions385 “contentsline section“numberline 21.3Consis-tency389 “contentsline section“numberline 21.4Statistically Equivalent Blocks394“contentsline section“numberline 21.5Partitioning Rules Based on Cluster-ing399 “contentsline section“numberline 21.6Data-Based Scaling403 “con-tentsline section“numberline 21.7Classification Trees405 “contentsline sec-tionProblems and Exercises406 “contentsline chapter“numberline 22Split-ting the Data408 “contentsline section“numberline 22.1The Holdout Es-timate408 “contentsline section“numberline 22.2Consistency and Asymp-totic Optimality410 “contentsline section“numberline 22.3Nearest Neigh-bor Rules with Automatic Scaling412 “contentsline section“numberline 22.4Clas-sification Based on Clustering414 “contentsline section“numberline 22.5Sta-tistically Equivalent Blocks415 “contentsline section“numberline 22.6Bi-nary Tree Classifiers416 “contentsline sectionProblems and Exercises416“contentsline chapter“numberline 23The Resubstitution Estimate418 “con-tentsline section“numberline 23.1The Resubstitution Estimate418 “con-tentsline section“numberline 23.2Histogram Rules421 “contentsline section“numberline23.3Data-Based Histograms and Rule Selection424 “contentsline section-Problems and Exercises426 “contentsline chapter“numberline 24DeletedEstimates of “newline the Error Probability428 “contentsline section“numberline24.1A General Lower Bound429 “contentsline section“numberline 24.2AGeneral Upper Bound for Deleted Estimates432 “contentsline section“numberline24.3Nearest Neighbor Rules434 “contentsline section“numberline 24.4Ker-nel Rules437 “contentsline section“numberline 24.5Histogram Rules439 “con-tentsline sectionProblems and Exercises441 “contentsline chapter“numberline25Automatic Kernel Rules443 “contentsline section“numberline 25.1Con-sistency444 “contentsline section“numberline 25.2Data Splitting448 “con-tentsline section“numberline 25.3Kernel Complexity452 “contentsline sec-tion“numberline 25.4Multiparameter Kernel Rules455 “contentsline sec-

Preface 8

tion“numberline 25.5Kernels of Infinite Complexity456 “contentsline sec-tion“numberline 25.6On Minimizing the Apparent Error Rate460 “con-tentsline section“numberline 25.7Minimizing the Deleted Estimate462 “con-tentsline section“numberline 25.8Sieve Methods466 “contentsline section“numberline25.9Squared Error Minimization466 “contentsline sectionProblems and Ex-ercises467 “contentsline chapter“numberline 26Automatic Nearest Neigh-bor Rules471 “contentsline section“numberline 26.1Consistency471 “con-tentsline section“numberline 26.2Data Splitting472 “contentsline section“numberline26.3Data Splitting for Weighted NN Rules473 “contentsline section“numberline26.4Reference Data and Data Splitting475 “contentsline section“numberline26.5Variable Metric NN Rules475 “contentsline section“numberline 26.6Se-lection of k Based on the Deleted Estimate478 “contentsline sectionProb-lems and Exercises478 “contentsline chapter“numberline 27Hypercubes andDiscrete Spaces480 “contentsline section“numberline 27.1Multinomial Dis-crimination480 “contentsline section“numberline 27.2Quantization483 “con-tentsline section“numberline 27.3Independent Components485 “contentslinesection“numberline 27.4Boolean Classifiers488 “contentsline section“numberline27.5Series Methods for the Hypercube489 “contentsline section“numberline27.6Maximum Likelihood492 “contentsline section“numberline 27.7KernelMethods493 “contentsline sectionProblems and Exercises493 “contentslinechapter“numberline 28Epsilon Entropy and Totally Bounded Sets498 “con-tentsline section“numberline 28.1Definitions498 “contentsline section“numberline28.2Examples: Totally Bounded Classes499 “contentsline section“numberline28.3Skeleton Estimates501 “contentsline section“numberline 28.4Rate ofConvergence504 “contentsline sectionProblems and Exercises505 “contentslinechapter“numberline 29Uniform Laws of Large Numbers507 “contentslinesection“numberline 29.1Minimizing the Empirical Squared Error507 “con-tentsline section“numberline 29.2Uniform Deviations of Averages from Ex-pectations508 “contentsline section“numberline 29.3Empirical Squared Er-ror Minimization511 “contentsline section“numberline 29.4Proof of Theo-rem “refvachth8 512 “contentsline section“numberline 29.5Covering Num-bers and Shatter Coefficients514 “contentsline section“numberline 29.6Gen-eralized Linear Classification519 “contentsline sectionProblems and Ex-ercises524 “contentsline chapter“numberline 30Neural Networks526 “con-tentsline section“numberline 30.1Multilayer Perceptrons526 “contentslinesection“numberline 30.2Arrangements530 “contentsline section“numberline30.3Approximation by Neural Networks536 “contentsline section“numberline30.4VC Dimension540 “contentsline section“numberline 30.5L1 Error Min-imization546 “contentsline section“numberline 30.6The Adaline and Pada-line551 “contentsline section“numberline 30.7Polynomial Networks552 “con-tentsline section“numberline 30.8Kolmogorov-Lorentz Networks “newlineand Additive Models554 “contentsline section“numberline 30.9ProjectionPursuit557 “contentsline section“numberline 30.10Radial Basis FunctionNetworks560 “contentsline sectionProblems and Exercises562 “contentslinechapter“numberline 31Other Error Estimates569 “contentsline section“numberline

Preface 9

31.1Smoothing the Error Count569 “contentsline section“numberline 31.2Pos-terior Probability Estimates574 “contentsline section“numberline 31.3Ro-tation Estimate576 “contentsline section“numberline 31.4Bootstrap577 “con-tentsline sectionProblems and Exercises580 “contentsline chapter“numberline32Feature Extraction581 “contentsline section“numberline 32.1Dimension-ality Reduction581 “contentsline section“numberline 32.2Transformationswith Small Distortion587 “contentsline section“numberline 32.3Admissibleand Sufficient Transformations589 “contentsline sectionProblems and Exer-cises593 “contentsline chapterAppendix595 “contentsline section“numberlineA.1Basics of Measure Theory595 “contentsline section“numberline A.2TheLebesgue Integral596 “contentsline section“numberline A.3Denseness Re-sults599 “contentsline section“numberline A.4Probability601 “contentslinesection“numberline A.5Inequalities602 “contentsline section“numberline A.6Convergenceof Random Variables604 “contentsline section“numberline A.7ConditionalExpectation605 “contentsline section“numberline A.8The Binomial Distri-bution606 “contentsline section“numberline A.9The Hypergeometric Dis-tribution609 “contentsline section“numberline A.10The Multinomial Dis-tribution609 “contentsline section“numberline A.11The Exponential andGamma Distributions610 “contentsline section“numberline A.12The Mul-tivariate Normal Distribution610 “contentsline chapterNotation611 “con-tentsline chapter“numberline References613 “contentsline chapterSubjectIndex644 “contentsline chapterAuthor Index653


1Introduction

Pattern recognition or discrimination is about guessing or predicting theunknown nature of an observation, a discrete quantity such as black orwhite, one or zero, sick or healthy, real or fake. An observation is a collec-tion of numerical measurements such as an image (which is a sequence ofbits, one per pixel), a vector of weather data, an electrocardiogram, or asignature on a check suitably digitized. More formally, an observation is ad-dimensional vector x. The unknown nature of the observation is called aclass. It is denoted by y and takes values in a finite set 1, 2, . . . ,M. Inpattern recognition, one creates a function g(x) : Rd → 1, . . . ,M whichrepresents one’s guess of y given x. The mapping g is called a classifier .Our classifier errs on x if g(x) 6= y.

How one creates a rule g depends upon the problem at hand. Experts canbe called for medical diagnoses or earthquake predictions—they try to moldg to their own knowledge and experience, often by trial and error. Theoret-ically, each expert operates with a built-in classifier g, but describing this gexplicitly in mathematical form is not a sinecure. The sheer magnitude andrichness of the space of x may defeat even the best expert—it is simply im-possible to specify g for all possible x’s one is likely to see in the future. Wehave to be prepared to live with imperfect classifiers. In fact, how shouldwe measure the quality of a classifier? We can’t just dismiss a classifier justbecause it misclassifies a particular x. For one thing, if the observation doesnot fully describe the underlying process (that is, if y is not a deterministicfunction of x), it is possible that the same x may give rise to two differenty’s on different occasions. For example, if we just measure water content ofa person’s body, and we find that the person is dehydrated, then the cause

1. Introduction 11

(the class) may range from a low water intake in hot weather to severediarrhea. Thus, we introduce a probabilistic setting, and let (X,Y ) be anRd ×1, . . . ,M-valued random pair. The distribution of (X,Y ) describesthe frequency of encountering particular pairs in practice. An error occursif g(X) 6= Y , and the probability of error for a classifier g is

L(g) = Pg(X) 6= Y .

There is a best possible classifier, g∗, which is defined by

g∗ = arg ming:Rd→1,...,M

Pg(X) 6= Y .

Note that g∗ depends upon the distribution of (X,Y ). If this distribution isknown, g∗ may be computed. The problem of finding g∗ is Bayes’ problem,and the classifier g∗ is called the Bayes classifier (or the Bayes rule). Theminimal probability of error is called the Bayes error and is denoted byL∗ = L(g∗). Mostly, the distribution of (X,Y ) is unknown, so that g∗ isunknown too.

We do not consult an expert to try to reconstruct g∗, but have access to agood database of pairs (Xi, Yi), 1 ≤ i ≤ n, observed in the past. This database may be the result of experimental observation (as for meteorologicaldata, fingerprint data, ecg data, or handwritten characters). It could alsobe obtained through an expert or a teacher who filled in the Yi’s after havingseen the Xi’s. To find a classifier g with a small probability of error is hope-less unless there is some assurance that the (Xi, Yi)’s jointly are somehowrepresentative of the unknown distribution. We shall assume in this bookthat (X1, Y1), . . . , (Xn, Yn), the data, is a sequence of independent identi-cally distributed (i.i.d.) random pairs with the same distribution as that of(X,Y ). This is a very strong assumption indeed. However, some theoreticalresults are emerging that show that classifiers based on slightly dependentdata pairs and on i.i.d. data pairs behave roughly the same. Also, simplemodels are easier to understand are more amenable to interpretation.

A classifier is constructed on the basis of X1, Y1, . . . , Xn, Yn and is de-noted by gn; Y is guessed by gn(X;X1, Y1, . . . , Xn, Yn). The process ofconstructing gn is called learning , supervised learning , or learning with ateacher . The performance of gn is measured by the conditional probabilityof error

Ln = L(gn) = Pgn(X;X1, Y1, . . . , Xn, Yn) 6= Y |X1, Y1, . . . , Xn, Yn) .

This is a random variable because it depends upon the data. So, Ln averagesover the distribution of (X,Y ), but the data is held fixed. Averaging overthe data as well would be unnatural, because in a given application, onehas to live with the data at hand. It would be marginally useful to knowthe number ELn as this number would indicate the quality of an average

1. Introduction 12

data sequence, not your data sequence. This text is thus about Ln, theconditional probability of error.

An individual mapping gn : Rd×Rd×1, . . . ,Mn −→ 1, . . . ,M isstill called a classifier . A sequence gn, n ≥ 1 is called a (discrimination)rule. Thus, classifiers are functions, and rules are sequences of functions.

A novice might ask simple questions like this: How does one constructa good classifier? How good can a classifier be? Is classifier A better thanclassifier B? Can we estimate how good a classifier is? What is the bestclassifier? This book partially answers such simple questions. A good deal ofenergy is spent on the mathematical formulations of the novice’s questions.For us, a rule—not a classifier—is good if it is consistent , that is, if

limn→∞

ELn = L∗

or equivalently, if Ln → L∗ in probability as n → ∞. We assume thatthe reader has a good grasp of the basic elements of probability, includingnotions such as convergence in probability, strong laws of large numbers foraverages, and conditional probability. A selection of results and definitionsthat may be useful for this text is given in the Appendix. A consistentrule guarantees us that taking more samples essentially suffices to roughlyreconstruct the unknown distribution of (X,Y ) because Ln can be pushedas close as desired to L∗. In other words, infinite amounts of informationcan be gleaned from finite samples. Without this guarantee, we would notbe motivated to take more samples. We should be careful and not imposeconditions on (X,Y ) for the consistency of a rule, because such conditionsmay not be verifiable. If a rule is consistent for all distributions of (X,Y ),it is said to be universally consistent .

Interestingly, until 1977, it was not known if a universally consistent ruleexisted. All pre-1977 consistency results came with restrictions on (X,Y ).In 1977, Stone showed that one could just take any k-nearest neighbor rulewith k = k(n) → ∞ and k/n → 0. The k-nearest neighbor classifier gn(x)takes a majority vote over the Yi’s in the subset of k pairs (Xi, Yi) from(X1, Y1), . . . , (Xn, Yn) that have the smallest values for ‖Xi − x‖ (i.e., forwhich Xi is closest to x). Since Stone’s proof of the universal consistencyof the k-nearest neighbor rule, several other rules have been shown to beuniversally consistent as well. This book stresses universality and hopefullygives a reasonable account of the developments in this direction.

Probabilists may wonder why we did not use convergence with probabil-ity one in our definition of consistency. Indeed, strong consistency—conver-gence of Ln to L∗ with probability one—implies convergence for almostevery sample as it grows. Fortunately, for most well-behaved rules, consis-tency and strong consistency are equivalent. For example, for the k-nearestneighbor rule, k → ∞ and k/n → 0 together imply Ln → L∗ with proba-bility one. The equivalence will be dealt with, but it will not be a majorfocus of attention. Most, if not all, equivalence results are based upon some

1. Introduction 13

powerful concentration inequalities such as McDiarmid’s. For example, wewill be able to show that for the k-nearest neighbor rule, there exists anumber c > 0, such that for all ε > 0, there exists N(ε) > 0 dependingupon the distribution of (X,Y ), such that

PLn − L∗ > ε ≤ e−cnε2, n ≥ N(ε) .

This illustrates yet another focus of the book—inequalities. Whenever pos-sible, we make a case or conclude a proof via explicit inequalities. Variousparameters can be substituted in these inequalities to allow the user todraw conclusions regarding sample size or to permit identification of themost important parameters.

The material in the book is often technical and dry. So, to stay focusedon the main issues, we keep the problem simple:

A. We only deal with binary classification (M = 2). The class Y takesvalues in 0, 1, and a classifier gn is a mapping: Rd × Rd ×0, 1n −→ 0, 1.

B. We only consider i.i.d. data sequences. We also disallow activelearning, a set-up in which the user can select the Xi’s determinis-tically.

C. We do not consider infinite spaces. For example, X cannot be arandom function such as a cardiogram. X must be a Rd-valuedrandom vector. The reader should be aware that many results givenhere may be painlessly extended to certain metric spaces of infinitedimension.

Let us return to our novice’s questions. We know that there are goodrules, but just how good can a classifier be? Obviously, Ln ≥ L∗ in allcases. It is thus important to know L∗ or to estimate it, for if L∗ is large,any classifier, including yours, will perform poorly. But even if L∗ were zero,Ln could still be large. Thus, it would be nice to have explicit inequalitiesfor probabilities such as

PLn ≥ L∗ + ε .

However, such inequalities must necessarily depend upon the distributionof (X,Y ). That is, for any rule,

lim infn→∞

supall distributions of (X,Y ) with L∗+ε<1/2

PLn ≥ L∗ + ε > 0 .

Universal rate of convergence guarantees do not exist. Rate of convergencestudies must involve certain subclasses of distributions of (X,Y ). For thisreason, with few exceptions, we will steer clear of the rate of convergencequicksand.

Even if there are no universal performance guarantees, we might still beable to satisfy our novice’s curiosity if we could satisfactorily estimate Ln

1. Introduction 14

for the rule at hand by a function Ln of the data. Such functions are callederror estimates. For example, for the k-nearest neighbor classifier, we coulduse the deleted estimate

Ln =1n

n∑i=1

Igni(Xi) 6=Yi ,

where gni(Xi) classifies Xi by the k-nearest neighbor method based uponthe data (X1, Y1), . . . , (Xn, Yn) with (Xi, Yi) deleted. If this is done, wehave a distribution-free inequality

P|Ln − Ln| > ε ≤ 6k + 1nε2

(the Rogers-Wagner inequality), provided that distance ties are broken inan appropriate manner. In other words, without knowing the distributionof (X,Y ), we can state with a certain confidence that Ln is contained in[Ln−ε, Ln+ε]. Thus, for many classifiers, it is indeed possible to estimate Lnfrom the data at hand. However, it is impossible to estimate L∗ universallywell: for any n, and any estimate of L∗ based upon the data sequence, therealways exists a distribution of (X,Y ) for which the estimate is arbitrarilypoor.

Can we compare rules gn and g′n? Again, the answer is negative:there exists no “best” classifier (or superclassifier), as for any rule gn,there exists a distribution of (X,Y ) and another rule g′n such that for alln, EL(g′n) < EL(gn). If there had been a universally best classifier,this book would have been unnecessary: we would all have to use it all thetime. This nonexistence implies that the debate between practicing patternrecognizers will never end and that simulations on particular examplesshould never be used to compare classifiers. As an example, consider the1-nearest neighbor rule, a simple but not universally consistent rule. Yet,among all k-nearest neighbor classifiers, the 1-nearest neighbor classifieris admissible—there are distributions for which its expected probability oferror is better than for any k-nearest neighbor classifier with k > 1. So, itcan never be totally dismissed. Thus, we must study all simple rules, and wewill reserve many pages for the nearest neighbor rule and its derivatives. Wewill for example prove the Cover-Hart inequality (Cover and Hart, 1967)which states that for all distributions of (X,Y ),

lim supn→∞

ELn ≤ 2L∗

where Ln is the probability of error with the 1-nearest neighbor rule. As L∗

is usually small (for otherwise, you would not want to do discrimination),2L∗ is small too, and the 1-nearest neighbor rule will do just fine.

The nonexistence of a best classifier may disappoint our novice. However,we may change the setting somewhat and limit the classifiers to a certain

1. Introduction 15

class C, such as all k-nearest neighbor classifiers with all possible values fork. Is it possible to select the best classifier from this class? Phrased in thismanner, we cannot possibly do better than

Ldef= inf

gn∈CPgn(X) 6= Y .

Typically, L > L∗. Interestingly, there is a general paradigm for pickingclassifiers from C and to obtain universal performance guarantees. It usesempirical risk minimization, a method studied in great detail in the workof Vapnik and Chervonenkis (1971). For example, if we select gn from C byminimizing

1n

n∑i=1

Ign(Xi) 6=Yi ,

then the corresponding probability of error Ln satisfies the following in-equality for all ε > 0:

PLn > L+ ε ≤ 8(nV + 1

)e−nε

2/128 .

Here V > 0 is an integer depending upon the massiveness of C only. Vis called the vc dimension of C and may be infinite for large classes C.For sufficiently restricted classes C, V is finite and the explicit universalbound given above can be used to obtain performance guarantees for theselected gn (relative to L, not L∗). The bound above is only valid if C isindependent of the data pairs (X1, Y1), . . . , (Xn, Yn). Fixed classes such asall classifiers that decide 1 on a halfspace and 0 on its complement arefine. We may also sample m more pairs (in addition to the n pairs alreadypresent), and use the n pairs as above to select the best k for use in the k-nearest neighbor classifier based on the m pairs. As we will see, the selectedrule is universally consistent if both m and n diverge and n/ logm → ∞.And we have automatically solved the problem of picking k. Recall thatStone’s universal consistency theorem only told us to pick k = o(m) andto let k →∞, but it does not tell us whether k ≈ m0.01 is preferable overk ≈ m0.99. Empirical risk minimization produces a random data-dependentk that is not even guaranteed to tend to infinity or to be o(m), yet theselected rule is universally consistent.

We offer virtually no help with algorithms as in standard texts, with twonotable exceptions. Ease of computation, storage, and interpretation hasspurred the development of certain rules. For example, tree classifiers con-struct a tree for storing the data, and partition Rd by certain cuts that aretypically perpendicular to a coordinate axis. We say that a coordinate axisis “cut.” Such classifiers have obvious computational advantages, and areamenable to interpretation—the components of the vector X that are cutat the early stages of the tree are most crucial in reaching a decision. Ex-pert systems, automated medical diagnosis, and a host of other recognition

1. Introduction 16

rules use tree classification. For example, in automated medical diagnosis,one may first check a patient’s pulse (component #1). If this is zero, thepatient is dead. If it is below 40, the patient is weak. The first componentis cut twice. In each case, we may then consider another component, andcontinue the breakdown into more and more specific cases. Several inter-esting new universally consistent tree classifiers are described in Chapter20.

The second group of classifiers whose development was partially basedupon easy implementations is the class of neural network classifiers, de-scendants of Rosenblatt’s perceptron (Rosenblatt, 1956). These classifiershave unknown parameters that must be trained or selected by the data, inthe way we let the data pick k in the k-nearest neighbor classifier. Most re-search papers on neural networks deal with the training aspect, but we willnot. When we say “pick the parameters by empirical risk minimization,”we will leave the important algorithmic complexity questions unanswered.Perceptrons divide the space by one hyperplane and attach decisions 1 and0 to the two halfspaces. Such simple classifiers are not consistent except fora few distributions. This is the case, for example, when X takes values on0, 1d (the hypercube) and the components of X are independent. Neuralnetworks with one hidden layer are universally consistent if the parametersare well-chosen. We will see that there is also some gain in considering twohidden layers, but that it is not really necessary to go beyond two.

Complexity of the training algorithm—the phase in which a classifiergn is selected from C—is of course important. Sometimes, one would liketo obtain classifiers that are invariant under certain transformations. Forexample, the k-nearest neighbor classifier is not invariant under nonlineartransformations of the coordinate axes. This is a drawback as componentsare often measurements in an arbitrary scale. Switching to a logarithmicscale or stretching a scale out by using Fahrenheit instead of Celsius shouldnot affect good discrimination rules. There exist variants of the k-nearestneighbor rule that have the given invariance. In character recognition, some-times all components of a vectorX that represents a character are true mea-surements involving only vector differences between selected points such asthe leftmost and rightmost points, the geometric center, the weighted cen-ter of all black pixels, the topmost and bottommost points. In this case,the scale has essential information, and invariance with respect to changesof scale would be detrimental. Here however, some invariance with respectto orthonormal rotations is healthy.

We follow the standard notation from textbooks on probability. Thus,random variables are uppercase characters such as X, Y and Z. Probabilitymeasures are denoted by greek letters such as µ and ν. Numbers and vectorsare denoted by lower case letters such as a, b, c, x, and y. Sets are alsodenoted by roman capitals, but there are obvious mnemonics: S denotesa sphere, B denotes a Borel set, and so forth. If we need many kinds ofsets, we will typically use the beginning of the alphabet (A, B, C). Most

1. Introduction 17

functions are denoted by f , g, φ, and ψ. Calligraphic letters such as A,C, and F are used to denote classes of functions or sets. A short list offrequently used symbols is found at the end of the book.

At the end of this chapter, you will find a directed acyclic graph that de-scribes the dependence between chapters. Clearly, prospective teachers willhave to select small subsets of chapters. All chapters, without exception,are unashamedly theoretical. We did not scar the pages with backbreakingsimulations or quick-and-dirty engineering solutions. The methods gleanedfrom this text must be supplemented with a healthy dose of engineeringsavvy. Ideally, students should have a companion text filled with beauti-ful applications such as automated virus recognition, telephone eavesdrop-ping language recognition, voice recognition in security systems, fingerprintrecognition, or handwritten character recognition. To run a real patternrecognition project from scratch, several classical texts on statistical pat-tern recognition could and should be consulted, as our work is limitedto general probability-theoretical aspects of pattern recognition. We haveover 430 exercises to help the scholars. These include skill honing exer-cises, brainteasers, cute puzzles, open problems, and serious mathematicalchallenges. There is no solution manual. This book is only a start. Use itas a toy—read some proofs, enjoy some inequalities, learn new tricks, andstudy the art of camouflaging one problem to look like another. Learn forthe sake of learning.

1. Introduction 18

Figure 1.1.


2The Bayes Error

2.1 The Bayes Problem

In this section, we define the mathematical model and introduce the no-tation we will use for the entire book. Let (X,Y ) be a pair of randomvariables taking their respective values from Rd and 0, 1. The randompair (X,Y ) may be described in a variety of ways: for example, it is definedby the pair (µ, η), where µ is the probability measure for X and η is theregression of Y on X. More precisely, for a Borel-measurable set A ⊆ Rd,

µ(A) = PX ∈ A,

and for any x ∈ Rd,

η(x) = PY = 1|X = x = EY |X = x.

Thus, η(x) is the conditional probability that Y is 1 given X = x. To seethat this suffices to describe the distribution of (X,Y ), observe that forany C ⊆ Rd × 0, 1, we have

C =(C ∩

(Rd × 0

))∪(C ∩

(Rd × 1

)) def= C0 × 0 ∪ C1 × 1,

and

P(X,Y ) ∈ C = PX ∈ C0, Y = 0+ PX ∈ C1, Y = 1

=∫C0

(1− η(x))µ(dx) +∫C1

η(x)µ(dx).

2.1 The Bayes Problem 20

As this is valid for any Borel-measurable set C, the distribution of (X,Y )is determined by (µ, η). The function η is sometimes called the a posterioriprobability.

Any function g : Rd → 0, 1 defines a classifier or a decision function.The error probability of g is L(g) = Pg(X) 6= Y . Of particular interestis the Bayes decision function

g∗(x) =

1 if η(x) > 1/20 otherwise.

This decision function minimizes the error probability.

Theorem 2.1. For any decision function g : Rd → 0, 1,

Pg∗(X) 6= Y ≤ Pg(X) 6= Y ,

that is, g∗ is the optimal decision.

Proof. Given X = x, the conditional error probability of any decision gmay be expressed as

Pg(X) 6= Y |X = x= 1−PY = g(X)|X = x= 1− (PY = 1, g(X) = 1|X = x+ PY = 0, g(X) = 0|X = x)= 1−

(Ig(x)=1PY = 1|X = x+ Ig(x)=0PY = 0|X = x

)= 1−

(Ig(x)=1η(x) + Ig(x)=0(1− η(x))

),

where IA denotes the indicator of the set A. Thus, for every x ∈ Rd,

Pg(X) 6= Y |X = x −Pg∗(X) 6= Y |X = x= η(x)

(Ig∗(x)=1 − Ig(x)=1

)+ (1− η(x))

(Ig∗(x)=0 − Ig(x)=0

)= (2η(x)− 1)

(Ig∗(x)=1 − Ig(x)=1

)≥ 0

by the definition of g∗. The statement now follows by integrating both sideswith respect to µ(dx). 2

Figure 2.1. The Bayes decision inthe example on the left is 1 if x > a,and 0 otherwise.

2.2 A Simple Example 21

Remark. g∗ is called the Bayes decision and L∗ = Pg∗(X) 6= Y isreferred to as the Bayes probability of error, Bayes error, or Bayes risk.The proof given above reveals that

L(g) = 1−EIg(X)=1η(X) + Ig(X)=0(1− η(X))

,

and in particular,

L∗ = 1−EIη(X)>1/2η(X) + Iη(X)≤1/2(1− η(X))

. 2

We observe that the a posteriori probability

η(x) = PY = 1|X = x = EY |X = x

minimizes the squared error when Y is to be predicted by f(X) for somefunction f : Rd → R:

E(η(X)− Y )2

≤ E

(f(X)− Y )2

.

To see why the above inequality is true, observe that for each x ∈ Rd,

E(f(X)− Y )2|X = x

= E

(f(x)− η(x) + η(x)− Y )2|X = x

= (f(x)− η(x))2 + 2(f(x)− η(x))Eη(x)− Y |X = x

+E(η(X)− Y )2|X = x

= (f(x)− η(x))2 + E

(η(X)− Y )2|X = x

.

The conditional median, i.e., the function minimizing the absolute errorE|f(X)−Y | is even more closely related to the Bayes rule (see Problem2.12).

2.2 A Simple Example

Let us consider the prediction of a student’s performance in a course(pass/fail) when given a number of important factors. First, let Y = 1denote a pass and let Y = 0 stand for failure. The sole observation Xis the number of hours of study per week. This, in itself, is not a fool-proof predictor of a student’s performance, because for that we would needmore information about the student’s quickness of mind, health, and so-cial habits. The regression function η(x) = PY = 1|X = x is probablymonotonically increasing in x. If it were known to be η(x) = x/(c + x),c > 0, say, our problem would be solved because the Bayes decision is

g∗(x) =

1 if η(x) > 1/2 (i.e., x > c)0 otherwise.

2.3 Another Simple Example 22

The corresponding Bayes error is

L∗ = L(g∗) = Emin(η(X), 1− η(X)) = E

min(c,X)c+X

.

While we could deduce the Bayes decision from η alone, the same cannot besaid for the Bayes error L∗—it requires knowledge of the distribution of X.If X = c with probability one (as in an army school, where all students areforced to study c hours per week), then L∗ = 1/2. If we have a populationthat is nicely spread out, say, X is uniform on [0, 4c], then the situationimproves:

L∗ =14c

∫ 4c

0

min(c, x)c+ x

dx =14

log5e4≈ 0.305785.

Far away from x = c, discrimination is really simple. In general, discrimi-nation is much easier than estimation because of this phenomenon.

2.3 Another Simple Example

Let us work out a second simple example in which Y = 0 or Y = 1 accordingto whether a student fails or passes a course. X represents one or moreobservations regarding the student. The components of X in our examplewill be denoted by T , B, and E respectively, where T is the average numberof hours the students watches TV, B is the average number of beers downedeach day, and E is an intangible quantity measuring extra negative factorssuch as laziness and learning difficulties. In our cooked-up example, we have

Y = 1 if T +B + E < 7

0 otherwise.

Thus, if T , B, and E are known, Y is known as well. The Bayes classifierdecides 1 if T + B + E < 7 and 0 otherwise. The corresponding Bayesprobability of error is zero. Unfortunately, E is intangible, and not availableto the observer. We only have access to T and B. Given T and B, whenshould we guess that Y = 1? To answer this question, one must knowthe joint distribution of (T,B,E), or, equivalently, the joint distribution of(T,B, Y ). So, let us assume that T , B, and E are i.i.d. exponential randomvariables (thus, they have density e−u on [0,∞)). The Bayes rule comparesPY = 1|T,B with PY = 0|T,B and makes a decision consistent withthe maximum of these two values. A simple calculation shows that

PY = 1|T,B = PT +B + E < 7|T,B= PE < 7− T −B|T,B

= max(0, 1− e−(7−T−B)

).

2.3 Another Simple Example 23

The crossover between two decisions occurs when this value equals 1/2.Thus, the Bayes classifier is as follows:

g∗(T,B) = 1 if T +B < 7− log 2 = 6.306852819 . . .

0 otherwise.

Of course, this classifier is not perfect. The probability of error is

Pg∗(T,B) 6= Y = PT +B < 7− log 2, T +B + E ≥ 7

+PT +B ≥ 7− log 2, T +B + E < 7

= Ee−(7−T−B)IT+B<7−log 2

+P

(1− e−(7−T−B)

)I7>T+B≥7−log 2

=

∫ 7−log 2

0

xe−xe−(7−x) dx+∫ 7

7−log 2

xe−x(1− e−(7−x)

)dx

(since the density of T +B is ue−u on [0,∞))

= e−7

((7− log 2)2

2+ 2(8− log 2)− 8− 72

2+

(7− log 2)2

2

)(as

∫∞xue−udu = (1 + x)e−x)

= 0.0199611 . . . .

If we have only access to T , then the Bayes classifier is allowed to use Tonly. First, we find

PY = 1|T = PE +B < 7− T |T

= max(0, 1− (1 + 7− T )e−(7−T )

).

The crossover at 1/2 occurs at T = cdef= 5.321653009 . . ., so that the Bayes

classifier is given by

g∗(T ) = 1 if T < c

0 otherwise.

The probability of error is

Pg∗(T ) 6= Y = PT < c, T +B + E ≥ 7+ PT ≥ c, T +B + E < 7

= E

(1 + 7− T )e−(7−T )IT<c

+P

(1− (1 + 7− T )e−(7−T )

)I7>T≥c

2.4 Other Formulas for the Bayes Risk 24

=∫ c

0

e−x(1 + 7− x)e−(7−x) dx+∫ 7

c

e−x(1− (1 + 7− x)e−(7−x)

)dx

= e−7

(82

2− (8− c)2

2+ e−(c−7) − 1− (8− c)2

2+

12

)= 0.02235309002 . . . .

The Bayes error has increased slightly, but not by much. Finally, if we donot have access to any of the three variables, T , B, and E, the best we cando is see which class is most likely. To this end, we compute

PY = 0 = PT +B +E ≥ 7 = (1 + 7 + 72/2)e−7 = .02963616388 . . . .

If we set g ≡ 1 all the time, we make an error with probability 0.02963616388 . . ..In practice, Bayes classifiers are unknown simply because the distribution

of (X,Y ) is unknown. Consider a classifier based upon (T,B). Rosenblatt’sperceptron (see Chapter 4) looks for the best linear classifier based uponthe data. That is, the decision is of the form

g(T,B) = 1 if aT + bB < c

0 otherwise

for some data-based choices for a, b and c. If we have lots of data at our dis-posal, then it is possible to pick out a linear classifier that is nearly optimal.As we have seen above, the Bayes classifier happens to be linear. That isa sheer coincidence, of course. If the Bayes classifier had not been linear—for example, if we had Y = IT+B2+E<7—then even the best perceptronwould be suboptimal, regardless of how many data pairs one would have. Ifwe use the 3-nearest neighbor rule (Chapter 5), the asymptotic probabilityof error is not more than 1.3155 times the Bayes error, which in our ex-ample is about 0.02625882705. The example above also shows the need tolook at individual components, and to evaluate how many and which com-ponents would be most useful for discrimination. This subject is covered inthe chapter on feature extraction (Chapter 32).

2.4 Other Formulas for the Bayes Risk

The following forms of the Bayes error are often convenient:

L∗ = infg:Rd→0,1

Pg(X) 6= Y

= E minη(X), 1− η(X)

=12− 1

2E |2η(X)− 1| .

2.4 Other Formulas for the Bayes Risk 25

Figure 2.2. The Bayes decision whenclass-conditional densities exist. In thefigure on the left, the decision is 0 on[a, b] and 1 elsewhere.

In special cases, we may obtain other helpful forms. For example, if Xhas a density f , then

L∗ =∫

min(η(x), 1− η(x))f(x)dx

=∫

min ((1− p)f0(x), pf1(x)) dx,

where p = PY = 1, and fi(x) is the density of X given that Y =i. p and 1 − p are called the class probabilities, and f0, f1 are the class-conditional densities. If f0 and f1 are nonoverlapping, that is,

∫f0f1 = 0,

then obviously L∗ = 0. Assume moreover that p = 1/2. Then

L∗ =12

∫min (f0(x), f1(x)) dx

=12

∫f1(x)− (f1(x)− f0(x))+ dx

=12− 1

4

∫|f1(x)− f0(x)|dx.

Here g+ denotes the positive part of a function g. Thus, the Bayes error isdirectly related to the L1 distance between the class densities.

Figure 2.3. The shaded areais the L1 distance between theclass-conditional densities.

2.5 Plug-In Decisions 26

2.5 Plug-In Decisions

The best guess of Y from the observation X is the Bayes decision

g∗(x) =

0 if η(x) ≤ 1/21 otherwise =

0 if η(x) ≤ 1− η(x)1 otherwise.

The function η is typically unknown. Assume that we have access to non-negative functions η(x), 1 − η(x) that approximate η(x) and 1 − η(x) re-spectively. In this case it seems natural to use the plug-in decision function

g(x) =

0 if η(x) ≤ 1/21 otherwise,

to approximate the Bayes decision. The next well-known theorem (see,e.g., Van Ryzin (1966), Wolverton and Wagner (1969a), Glick (1973), Csibi(1971), Gyorfi (1975), (1978), Devroye and Wagner (1976b), Devroye (1982b),and Devroye and Gyorfi (1985)) states that if η(x) is close to the real aposteriori probability in L1-sense, then the error probability of decision gis near the optimal decision g∗.

Theorem 2.2. For the error probability of the plug-in decision g definedabove, we have

Pg(X) 6= Y − L∗ = 2∫Rd

|η(x)− 1/2|Ig(x) 6=g∗(x)µ(dx),

and

Pg(X) 6= Y − L∗ ≤ 2∫Rd

|η(x)− η(x)|µ(dx) = 2E|η(X)− η(X)|.

Proof. If for some x ∈ Rd, g(x) = g∗(x), then clearly the differencebetween the conditional error probabilities of g and g∗ is zero:

Pg(X) 6= Y |X = x −Pg∗(X) 6= Y |X = x = 0.

Otherwise, if g(x) 6= g∗(x), then as seen in the proof of Theorem 2.1, thedifference may be written as

Pg(X) 6= Y |X = x − Pg∗(X) 6= Y |X = x= (2η(x)− 1)

(Ig∗(x)=1 − Ig(x)=1

)= |2η(x)− 1|Ig(x) 6=g∗(x).

Thus,

Pg(X) 6= Y − L∗ =∫Rd

2|η(x)− 1/2|Ig(x) 6=g∗(x)µ(dx)

≤∫Rd

2|η(x)− η(x)|µ(dx),

2.6 Bayes Error Versus Dimension 27

since g(x) 6= g∗(x) implies |η(x)− η(x)| ≥ |η(x)− 1/2|. 2

When the classifier g(x) can be put in the form

g(x) =

0 if η1(x) ≤ η0(x)1 otherwise,

where η1(x), η0(x) are some approximations of η(x) and 1 − η(x), respec-tively, the situation differs from that discussed in Theorem 2.2 if η0(x) +η1(x) does not necessarily equal to one. However, an inequality analogousto that of Theorem 2.2 remains true:

Theorem 2.3. The error probability of the decision defined above is boundedfrom above by

Pg(X) 6= Y −L∗ ≤∫Rd

|(1−η(x))−η0(x)|µ(dx)+∫Rd

|η(x)−η1(x)|µ(dx).

The proof is left to the reader (Problem 2.9).

Remark. Assume that the class-conditional densities f0, f1 exist and areapproximated by the densities f0(x), f1(x). Assume furthermore that theclass probabilities p = PY = 1 and 1− p = PY = 0 are approximatedby p1 and p0, respectively. Then for the error probability of the plug-indecision function

g(x) =

0 if p1f1(x) ≤ p0f0(x)1 otherwise,

Pg(X) 6= Y − L∗

≤∫Rd

|(1− p)f0(x)− p0f0(x)|dx+∫Rd

|pf1(x)− p1f1(x)|dx.

See Problem 2.10. 2

2.6 Bayes Error Versus Dimension

The components of X that matter in the Bayes classifier are those thatexplicitly appear in η(X). In fact, then, all discrimination problems are one-dimensional, as we could equally well replace X by η(X) or by any strictlymonotone function of η(X), such as η7(X)+5η3(X)+η(X). Unfortunately,η is unknown in general. In the example in Section 2.3, we had in one case

η(T,B) = max(0, 1− e−(7−T−B)

)

Problems and Exercises 28

and in another case

η(T ) = max(0, 1− (1 + 7− T )e−(7−T )

).

The former format suggests that we could base all decisions on T +B. Thismeans that if we had no access to T and B individually, but to T + Bjointly, we would be able to achieve the same results! Since η is unknown,all of this is really irrelevant.

In general, the Bayes risk increases if we replace X by T (X) for anytransformation T (see Problem 2.1), as this destroys information. On theother hand, there exist transformations (such as η(X)) that leave the Bayeserror untouched. For more on the relationship between the Bayes error andthe dimension, refer to Chapter 32.

Problems and Exercises

Problem 2.1. Let T : X → X ′ be an arbitrary measurable function. If L∗X andL∗T (X) denote the Bayes error probabilities for (X,Y ) and (T (X), Y ), respectively,then prove that

L∗T (X) ≥ L∗X .

(This shows that transformations of X destroy information, because the Bayesrisk increases.)

Problem 2.2. Let X ′ be independent of (X,Y ). Prove that

L∗(X,X′) = L∗X .

Problem 2.3. Show that L∗ ≤ min(p, 1−p), where p, 1−p are the class probabil-ities. Show that equality holds if X and Y are independent. Exhibit a distributionwhere X is not independent of Y , but L∗ = min(p, 1− p).

Problem 2.4. Neyman-Pearson lemma. Consider again the decision prob-lem, but with a decision g, we now assign two error probabilities,

L(0)(g) = Pg(X) = 1|Y = 0 and L(1)(g) = Pg(X) = 0|Y = 1.

Assume that the class-conditional densities f0, f1 exist. For c > 0, define thedecision

gc(x) =

1 if cf1(x) > f0(x)0 otherwise.

Prove that for any decision g, if L(0)(g) ≤ L(0)(gc), then L(1)(g) ≥ L(1)(gc). Inother words, if L(0) is required to be kept under a certain level, then the decisionminimizing L(1) has the form of gc for some c. Note that g∗ is like that.

Problem 2.5. Decisions with rejection. Sometimes in decision problems,one is allowed to say “I don’t know,” if this does not happen frequently. Thesedecisions are called decisions with a reject option (see, e.g., Forney (1968), Chow


(1970)). Formally, a decision g(x) can have three values: 0, 1, and “reject.” Thereare two performance measures: the probability of rejection Pg(X) = “reject”,and the error probability Pg(X) 6= Y |g(X) 6= “reject”. For a 0 < c < 1/2,define the decision

gc(x) =

1 if η(x) > 1/2 + c0 if η(x) ≤ 1/2− c“reject” otherwise.

Show that for any decision g, if

Pg(X) = “reject” ≤ Pgc(X) = “reject”,

then

Pg(X) 6= Y |g(X) 6= “reject” ≥ Pgc(X) 6= Y |gc(X) 6= “reject”.

Thus, to keep the probability of rejection under a certain level, decisions of theform of gc are optimal (Gyorfi, Gyorfi, and Vajda (1978)).

Problem 2.6. Consider the prediction of a student’s failure based upon variablesT and B, where Y = IT+B+E<7 and E is an inaccessible variable (see Section2.3).

(1) Let T , B, and E be independent. Merely by changing the distributionof E, show that the Bayes error for classification based upon (T,B) canbe made as close as desired to 1/2.

(2) Let T and B be independent and exponentially distributed. Find a jointdistribution of (T,B,E) such that the Bayes classifier is not a linearclassifier.

(3) Let T and B be independent and exponentially distributed. Find a jointdistribution of (T,B,E) such that the Bayes classifier is given by

g∗(T,B) =

1 if T 2 +B2 < 10,0 otherwise .

(4) Find the Bayes classifier and Bayes error for classification based on (T,B)(with Y as above) if (T,B,E) is uniformly distributed on [0, 4]3.

Problem 2.7. Assume that T,B, and E are independent uniform [0, 4] randomvariables with interpretations as in Section 2.3. Let Y = 1 (0) denote whether astudent passes (fails) a course. Assume that Y = 1 if and only if TBE ≤ 8.

(1) Find the Bayes decision if no variable is available, if only T is available,and if only T and B are available.

(2) Determine in all three cases the Bayes error.(3) Determine the best linear classifier based upon T and B only.

Problem 2.8. Let η′, η′′ : Rd → [0, 1] be arbitrary measurable functions, and de-fine the corresponding decisions by g′(x) = Iη′(x)>1/2 and g′′(x) = Iη′′(x)>1/2.Prove that

|L(g′)− L(g′′)| ≤ Pg′(X) 6= g′′(X)and

|L(g′)− L(g′′)| ≤ E|2η(X)− 1|Ig′(X) 6=g′′(X)

.


Problem 2.9. Prove Theorem 2.3.

Problem 2.10. Assume that the class-conditional densities f0 and f1 exist andare approximated by the densities f0 and f1, respectively. Assume furthermorethat the class probabilities p = PY = 1 and 1 − p = PY = 0 are approxi-mated by p1 and p0. Prove that for the error probability of the plug-in decisionfunction

g(x) =

0 if p1f1(x) ≤ p0f0(x)1 otherwise,

we have

Pg(X) 6= Y −L∗ ≤∫Rd

|pf1(x)− p1f1(x)|dx+

∫Rd

|(1− p)f0(x)− p0f0(x)|dx.

Problem 2.11. Using the notation of Problem 2.10, show that if for a sequenceof fm,n(x) and pm,n (m = 0, 1),

limn→∞

∫Rd

|pf1(x)− p1,nf1,n(x)|2dx+

∫Rd

|(1− p)f0(x)− p0,nf0,n(x)|2dx = 0,

then for the corresponding sequence of plug-in decisions limn→∞ Pgn(X) 6=Y = L∗ (Wolverton and Wagner (1969a)). Hint: According to Problem 2.10, itsuffices to show that if we are given a deterministic sequence of density functionsf, f1, f2, f3, . . ., then

limn→∞

∫(fn(x)− f(x))2 dx = 0

implies

limn→∞

∫|fn(x)− f(x)|dx = 0.

(A function f is called a density function if it is nonnegative and∫f(x)dx = 1.)

To see this, observe that∫|fn(x)− f(x)|dx = 2

∫(fn(x)− f(x))+ dx = 2

∑i

∫Ai

(fn(x)− f(x))+ dx,

where A1, A2, . . . is a partition of Rd into unit cubes, and f+ denotes the positivepart of a function f . The key observation is that convergence to zero of each termof the infinite sum implies convergence of the whole integral by the dominatedconvergence theorem, since

∫(fn(x)− f(x))+ dx ≤

∫fn(x)dx = 1. Handle the

right-hand side by the Cauchy-Schwarz inequality.

Problem 2.12. Define the L1 error of a function f : Rd → R by J(f) =E|f(X) − Y |. Show that a function minimizing J(f) is the Bayes rule g∗,that is, J∗ = inff J(f) = J(g∗). Thus, J∗ = L∗. Define a decision by

g(x) =

0 if f(x) ≤ 1/21 otherwise,

Prove that its error probability L(g) = Pg(X) 6= Y satisfies the inequality

L(g)− L∗ ≤ J(f)− J∗.


3Inequalities and AlternateDistance Measures

3.1 Measuring Discriminatory Information

In our two-class discrimination problem, the best rule has (Bayes) proba-bility of error

L∗ = E min(η(X), 1− η(X)) .

This quantity measures how difficult the discrimination problem is. It alsoserves as a gauge of the quality of the distribution of (X,Y ) for patternrecognition. Put differently, if ψ1 and ψ2 are certain many-to-one mappings,L∗ may be used to compare discrimination based on (ψ1(X), Y ) with thatbased on (ψ2(X), Y ). When ψ1 projects Rd to Rd1 by taking the first d1

coordinates, and ψ2 takes the last d2 coordinates, the corresponding Bayeserrors will help us decide which projection is better. In this sense, L∗ is thefundamental quantity in feature extraction.

Other quantities have been suggested over the years that measure thediscriminatory power hidden in the distribution of (X,Y ). These may behelpful in some settings. For example, in theoretical studies or in certainproofs, the relationship between L∗ and the distribution of (X,Y ) maybecome clearer via certain inequalities that link L∗ with other functionalsof the distribution. We all understand moments and variances, but how dothese simple functionals relate to L∗? Perhaps we may even learn a thingor two about what it is that makes L∗ small. In feature selection, someexplicit inequalities involving L∗ may provide just the kind of numericalinformation that will allow one to make certain judgements on what kind offeature is preferable in practice. In short, we will obtain more information

3.1 Measuring Discriminatory Information 32

about L∗ with a variety of uses in pattern recognition.

3.3 The Nearest Neighbor Error 33

In the next few sections, we avoid putting any conditions on the distri-bution of (X,Y ).

3.2 The Kolmogorov Variational Distance

Inspired by the total variation distance between distributions, the Kol-mogorov variational distance

δKO =12E |PY = 1|X −PY = 0|X|

=12E|2η(X)− 1|

captures the distance between the two classes. We will not need anythingspecial to deal with δKO as

L∗ = E

12− 1

2|2η(X)− 1|

=

12− 1

2E|2η(X)− 1|

=12− δKO.

3.3 The Nearest Neighbor Error

The asymptotic error of the nearest neighbor rule is

LNN = E2η(X)(1− η(X))

(see Chapter 5). Clearly, 2η(1− η) ≥ min(η, 1− η) as 2 max(η, 1− η) ≥ 1.Also, using the notation A = min(η(X), 1− η(X)), we have

L∗ ≤ LNN = 2EA(1−A)≤ 2EA ·E1−A

(by the second association inequality of Theorem A.19)

= 2L∗(1− L∗) ≤ 2L∗, (3.1)

which are well-known inequalities of Cover and Hart (1967). LNN providesus with quite a bit of information about L∗.

The measure LNN has been rediscovered under other guises: Devijver andKittler (1982, p.263) and Vajda (1968) call it the quadratic entropy, andMathai and Rathie (1975) refer to it as the harmonic mean coefficient.

3.4 The Bhattacharyya Affinity 34

Figure 3.1. Relationship betweenthe Bayes error and the asymptoticnearest neighbor error. Every pointin the unshaded region is possible.

3.4 The Bhattacharyya Affinity

The Bhattacharyya measure of affinity (Bhattacharyya, (1946)) is − log ρ,where

ρ = E√

η(X)(1− η(X))

will be referred to as the Matushita error. It does not occur naturally as thelimit of any standard discrimination rule (see, however, Problem 6.11). ρwas suggested as a distance measure for pattern recognition by Matushita(1956). It also occurs under other guises in mathematical statistics—see, forexample, the Hellinger distance literature (Le Cam (1970), Beran (1977)).

Clearly, ρ = 0 if and only if η(X) ∈ 0, 1 with probability one, thatis, if L∗ = 0. Furthermore, ρ takes its maximal value 1/2 if and only ifη(X) = 1/2 with probability one. The relationship between ρ and L∗ is notlinear though. We will show that for all distributions, LNN is more usefulthan ρ if it is to be used as an approximation of L∗.

Theorem 3.1. For all distributions, we have

12− 1

2

√1− 4ρ2 ≤ 1

2− 1

2

√1− 2LNN

≤ L∗

≤ LNN

≤ ρ.

Proof. First of all,

ρ2 = E2√

η(X)(1− η(X))

≤ E η(X)(1− η(X)) (by Jensen’s inequality)

3.4 The Bhattacharyya Affinity 35

=LNN

2≤ L∗(1− L∗) (by the Cover-Hart inequality (3.1)).

Second, as√η(1− η) ≥ 2η(1− η) for all η ∈ [0, 1], we see that ρ ≥ LNN ≥

L∗. Finally, by the Cover-Hart inequality,

√1− 2LNN ≥

√1− 4L∗(1− L∗) = 1− 2L∗.

Putting all these things together establishes the chain of inequalities. 2

Figure 3.2. The inequalities link-ing ρ to L∗ are illustrated. Notethat the region is larger than thatcut out in the (L∗, LNN) plane inFigure 3.1.

The inequality LNN ≤ ρ is due to Ito (1972). The inequality LNN ≥ 2ρ2

is due to Horibe (1970). The inequality 1/2−√

1− 4ρ2/2 ≤ L∗ ≤ ρ can befound in Kailath (1967). The left-hand side of the last inequality was shownby Hudimoto (1957). All these inequalities are tight (see Problem 3.2). Theappeal of quantities like LNN and ρ is that they involve polynomials ofη, whereas L∗ = Emin(η(X), 1 − η(X)) is nonpolynomial. For certaindiscrimination problems in which X has a distribution that is known up tocertain parameters, one may be able to compute LNN and ρ explicitly asa function of these parameters. Via inequalities, this may then be used toobtain performance guarantees for parametric discrimination rules of theplug-in type (see Chapter 16).

For completeness, we mention a generalization of Bhattacharyya’s mea-sure of affinity, first suggested by Chernoff (1952):

δC = − log(Eηα(X)(1− η(X))1−α

),

where α ∈ (0, 1) is fixed. For α = 1/2, δC = − log ρ. The asymmetryintroduced by taking α 6= 1/2 has no practical interpretation, however.

3.5 Entropy 36

3.5 Entropy

The entropy of a discrete probability distribution (p1, p2, . . .) is defined by

H = H(p1, p2, . . .) = −∞∑i=1

pi log pi,

where, by definition, 0 log 0 = 0 (Shannon (1948)). The key quantity ininformation theory (see Cover and Thomas (1991)), it has countless appli-cations in many branches of computer science, mathematical statistics, andphysics. The entropy’s main properties may be summarized as follows.

A. H ≥ 0 with equality if and only if pi = 1 for some i. Proof: log pi ≤ 0for all i with equality if and only if pi = 1 for some i. Thus, entropyis minimal for a degenerate distribution, i.e., a distribution with theleast amount of “spread.”

B. H(p1, . . . , pk) ≤ log k with equality if and only if p1 = p2 = · · · = pk =1/k. In other words, the entropy is maximal when the distribution ismaximally smeared out. Proof:

H(p1, . . . , pk)− log k =k∑i=1

pi log(

1kpi

)≤ 0

by the inequality log x ≤ x− 1, x > 0.

C. For a Bernoulli distribution (p, 1−p), the binary entropyH(p, 1−p) =−p log p−(1− p) log(1− p) is concave in p.

Assume that X is a discrete random variable that must be guessed byasking questions of the type “is X ∈ A?,” for some sets A. Let N bethe minimum expected number of questions required to determine X withcertainty. It is well known that

Hlog 2

≤ N <H

log 2+ 1

(e.g., Cover and Thomas (1991)). Thus, H not only measures how spreadout the mass of X is, but also provides us with concrete computationalbounds for certain algorithms. In the simple example above, H is in factproportional to the expected computational time of the best algorithm.

We are not interested in information theory per se, but rather in itsusefulness in pattern recognition. For our discussion, if we fix X = x, thenY is Bernoulli (η(x)). Hence, the conditional entropy of Y given X = x is

H(η(x), 1− η(x)) = −η(x) log η(x)− (1− η(x)) log(1− η(x)).

3.5 Entropy 37

It measures the amount of uncertainty or chaos in Y given X = x. Aswe know, it takes values between 0 (when η(x) ∈ 0, 1) and log 2 (whenη(x) = 1/2), and is concave in η(x). We define the expected conditionalentropy by

E = E H(η(X), 1− η(X))= −E η(X) log η(X) + (1− η(X)) log(1− η(X)) .

For brevity, we will refer to E as the entropy. As pointed out above, E = 0 ifand only if η(X) ∈ 0, 1 with probability one. Thus, E and L∗ are relatedto each other.

Theorem 3.2.A. E ≤ H(L∗, 1 − L∗) = −L∗ logL∗ − (1 − L∗) log(1 − L∗) (Fano’s

inequality; Fano (1952), see Cover and Thomas (1991, p. 39)).B. E ≥ − log(1− LNN) ≥ − log(1− L∗).C. E ≤ log 2− 1

2 (1− 2LNN) ≤ log 2− 12 (1− 2L∗)2.

Figure 3.3. Inequalities (A)and (B) of Theorem 3.2 are il-lustrated here.

Proof.Part A. Define A = min(η(X), 1− η(X)). Then

E = EH(A, 1−A)≤ H(EA, 1−EA)

(because H is concave, by Jensen’s inequality)

= H(L∗, 1− L∗).

Part B.

E = −EA logA+ (1−A) log(1−A)≥ −E

log(A2 + (1−A)2

)(by Jensen’s inequality)

3.6 Jeffreys’ Divergence 38

= −Elog(1− 2A(1−A))≥ − log(1−E2A(1−A)) (by Jensen’s inequality)

= − log(1− LNN)

≥ − log(1− L∗).

Part C. By the concavity of H(A, 1−A) as a function of A, and Taylorseries expansion,

H(A, 1−A) ≤ log 2− 12(2A− 1)2.

Therefore, by Part A,

E ≤ log 2−E

12(2A− 1)2

= log 2− 1

2(1− 2LNN)

≤ log 2− 12

+ 2L∗(1− L∗)

(by the Cover-Hart inequality (3.1))

= log 2− 12(1− 2L∗)2. 2

Remark. The nearly monotone relationship between E and L∗ will seelots of uses. We warn the reader that near the origin, L∗ may decreaselinearly in E , but it may also decrease much faster than E209. Such widevariation was not observed in the relationship between L∗ and LNN (whereit is linear) or L∗ and ρ (where it is between linear and quadratic). 2

3.6 Jeffreys’ Divergence

Jeffreys’ divergence (1948) is a symmetric form of the Kullback-Leibler(1951) divergence

δKL = Eη(X) log

η(X)1− η(X)

.

It will be denoted by

J = E

(2η(X)− 1) logη(X)

1− η(X)

.

3.6 Jeffreys’ Divergence 39

To understand J , note that the function (2η−1) log η1−η is symmetric about

1/2, convex, and has minimum (0) at η = 1/2. As η ↓ 0, η ↑ 1, the functionbecomes unbounded. Therefore, J = ∞ if Pη(X) ∈ 0, 1 > 0. For thisreason, its use in discrimination is necessarily limited. For generalizationsof J , see Renyi (1961), Burbea and Rao (1982), Taneja (1983; 1987), andBurbea (1984). It is thus impossible to bound J from above by a functionof LNN and/or L∗. However, lower bounds are easy to obtain. As x log((1+x)/(1− x)) is convex in x and

(2η − 1) logη

1− η= |2η − 1| log

(1 + |2η − 1|1− |2η − 1|

),

we note that by Jensen’s inequality,

J ≥ E|2η(X)− 1| log(

1 + E|2η(X)− 1|1−E|2η(X)− 1|

)= (1− 2L∗) log

(1 + (1− 2L∗)1− (1− 2L∗)

)= (1− 2L∗) log

(1− L∗

L∗

)≥ 2(1− 2L∗)2.

The first bound cannot be universally bettered (it is achieved when η(x) isconstant over the space). Also, for fixed L∗, any value of J above the lowerbound is possible for some distribution of (X,Y ). From the definition of J ,we see that J = 0 if and only if η ≡ 1/2 with probability one, or L∗ = 1/2.

Figure 3.4. This figure illus-trates the above lower bound onJeffreys’ divergence in terms ofthe Bayes error.

Related bounds were obtained by Toussaint (1974b):

J ≥√

1− 2LNN log(

1 +√

1− 2LNN

1−√

1− 2LNN

)≥ 2(1− 2LNN).

The last bound is strictly better than our L∗ bound given above. See Prob-lem 3.7.

3.7 F-Errors 40

3.7 F-Errors

The error measures discussed so far are all related to expected values ofconcave functions of η(X) = PY = 1|X. In general, if F is a concavefunction on [0, 1], we define the F -error corresponding to (X,Y ) by

dF (X,Y ) = E F (η(X)) .

Examples of F -errors are

(a) the Bayes error L∗: F (x) = min(x, 1− x),

(b) the asymptotic nearest neighbor error LNN: F (x) = 2x(1− x),

(c) the Matushita error ρ: F (x) =√x(1− x),

(d) the expected conditional entropy E : F (x) = −x log x− (1−x) log(1−x),

(e) the negated Jeffreys’ divergence −J : F (x) = −(2x− 1) log x1−x .

Hashlamoun, Varshney, and Samarasooriya (1994) point out that if F (x) ≥min(x, 1−x) for each x ∈ [0, 1], then the corresponding F -error is an upperbound on the Bayes error. The closer F (x) is to min(x, 1− x), the tighterthe upper bound is. For example, F (x) = (1/2) sin(πx) ≤ 2x(1− x) yieldsan upper bound tighter than LNN. All these errors share the property thatthe error increases if X is transformed by an arbitrary function.

Theorem 3.3. Let t : Rd → Rk be an arbitrary measurable function.Then for any distribution of (X,Y ),

dF (X,Y ) ≤ dF (t(X), Y ).

Proof. Define ηt : Rk → [0, 1] by ηt(x) = PY = 1|t(X) = x, andobserve that

ηt(t(X)) = E η(X)|t(X) .Thus,

dF (t(X), Y ) = E F (ηt(t(X)))= E F (E η(X)|t(X))≥ E E F (η(X))|t(X) (by Jensen’s inequality)

= EF (η(X)) = dF (X,Y ). 2

Remark. We also see from the proof that the F -error remains unchangedif the transformation t is invertible. Theorem 3.3 states that F -errors are abit like Bayes errors—when information is lost (by replacing X by t(X)),F -errors increase. 2

3.8 The Mahalanobis Distance 41

3.8 The Mahalanobis Distance

Two conditional distributions with about the same covariance matrices andmeans that are far away from each other are probably so well separatedthat L∗ is small. An interesting measure of the visual distance betweentwo random variables X0 and X1 is the so-called Mahalanobis distance(Mahalanobis, (1936)) given by

∆ =√

(m1 −m0)TΣ−1(m1 −m0),

wherem1 = EX1,m0 = EX0, are the means, Σ1 = E(X1 −m1)(X1 −m1)T

and Σ0 = E

(X0 −m0)(X0 −m0)T

are the covariance matrices, Σ =

pΣ1 + (1 − p)Σ0, (·)T is the transpose of a vector, and p = 1 − p is amixture parameter. If Σ1 = Σ0 = σ2I, where I is the identity matrix, then

∆ =‖m1 −m0‖

σ

is a scaled version of the distance between the means. If Σ1 = σ21I,Σ0 =

σ20I, then

∆ =‖m1 −m0‖√pσ2

1 + (1− p)σ20

varies between ‖m1−m0‖/σ1 and ‖m1−m0‖/σ0 as p changes from 1 to 0.Assume that we have a discrimination problem in which given Y = 1, Xis distributed as X1, given Y = 0, X is distributed as X0, and p = PY =1, 1− p are the class probabilities. Then, interestingly, ∆ is related to theBayes error in a general sense. If the Mahalanobis distance between theclass-conditional distributions is large, then L∗ is small.

Theorem 3.4. (Devijver and Kittler (1982, p. 166)). For all dis-tributions of (X,Y ) for which E

‖X‖2

<∞, we have

L∗ ≤ LNN ≤2p(1− p)

1 + p(1− p)∆2.

Remark. For a distribution with mean m and covariance matrix Σ, theMahalanobis distance from a point x ∈ Rd to m is√

(x−m)TΣ−1(x−m).

In one dimension, this is simply interpreted as distance from the mean asmeasured in units of standard deviation. The use of Mahalanobis distancein discrimination is based upon the intuitive notion that we should classifyaccording to the class for which we are within the least units of standard

3.8 The Mahalanobis Distance 42

deviations. At least, for distributions that look like nice globular clouds,such a recommendation may make sense. 2

Proof. First assume that d = 1, that is, X is real valued. Let u and c bereal numbers, and consider the quantity E

(u(X − c)− (2η(X)− 1))2

.

We will show that if u and c are chosen to minimize this number, then itsatisfies

0 ≤ E(u(X − c)− (2η(X)− 1))2

= 2

(2p(1− p)

1 + p(1− p)∆2− LNN

), (3.2)

which proves the theorem for d = 1. To see this, note that the expressionE(u(X − c)− (2η(X)− 1))2

is minimized for c = EX−E2η(X)−1/u.

Then

E(u(X − c)− (2η(X)− 1))2

= Var2η(X)− 1+ u2 VarX − 2uCovX, 2η(X)− 1,

—where CovX,Z = E(X − EX)(Z − EZ)—which is, in turn, mini-mized for u = CovX, 2η(X) − 1/VarX. Straightforward calculationshows that (3.2) indeed holds.

To extend the inequality (3.2) to multidimensional problems, apply it tothe one-dimensional decision problem (Z, Y ), where Z = XTΣ−1(m1−m0).Then the theorem follows by noting that by Theorem 3.3,

LNN(X,Y ) ≤ LNN(Z, Y ),

where LNN(X,Y ) denotes the nearest-neighbor error corresponding to (X,Y ).2

In case X1 and X0 are both normal with the same covariance matrices,we have

Theorem 3.5. (Matushita (1973); see Problem 3.11). When X1

and X0 are multivariate normal random variables with Σ1 = Σ0 = Σ, then

ρ = E√

η(X)(1− η(X))

=√p(1− p) e−∆2/8.

If the class-conditional densities f1 and f0 may be written as functionsof (x−m1)TΣ−1

1 (x−m1) and (x−m0)TΣ−10 (x−m0) respectively, then ∆

remains relatively tightly linked with L∗ (Mitchell and Krzanowski (1985)),but such distributions are the exception rather than the rule. In general,when ∆ is small, it is impossible to deduce whether L∗ is small or not (seeProblem 3.12).

3.9 f-Divergences 43

3.9 f-Divergences

We have defined error measures as the expected value of a concave func-tion of η(X). This makes it easier to relate these measures to the Bayeserror L∗ and other error probabilities. In this section we briefly make theconnection to the more classical statistical theory of distances betweenprobability measures. A general concept of these distance measures, calledf -divergences, was introduced by Csiszar (1967). The corresponding theoryis summarized in Vajda (1989). F -errors defined earlier may be calculated ifone knows the class probabilities p, 1− p, and the conditional distributionsµ0, µ1 of X, given Y = 0 and Y = 1, that is,

µi(A) = PX ∈ A|Y = i i = 0, 1.

For fixed class probabilities, an F -error is small if the two conditional distri-butions are “far away” from each other. A metric quantifying this distancemay be defined as follows. Let f : [0,∞) → R∪−∞,∞ be a convex func-tion with f(1) = 0. The f-divergence between two probability measures µand ν on Rd is defined by

Df (µ, ν) = supA=Aj

∑j

ν(Aj)f(µ(Aj)ν(Aj)

),

where the supremum is taken over all finite measurable partitions A of Rd.If λ is a measure dominating µ and ν—that is, both µ and ν are absolutelycontinuous with respect to λ—and p = dµ/dλ and q = dν/dλ are thecorresponding densities, then the f -divergence may be put in the form

Df (µ, ν) =∫q(x)f

(p(x)q(x)

)λ(dx).

Clearly, this quantity is independent of the choice of λ. For example, wemay take λ = µ + ν. If µ and ν are absolutely continuous with respect tothe Lebesgue measure, then λ may be chosen to be the Lebesgue measure.By Jensen’s inequality, Df (µ, ν) ≥ 0, and Df (µ, µ) = 0.

An important example of f -divergences is the total variation, or varia-tional distance obtained by choosing f(x) = |x− 1|, yielding

V (µ, ν) = supA=Aj

∑j

|µ(Aj)− ν(Aj)|.

For this divergence, the equivalence of the two definitions is stated byScheffe’s theorem (see Problem 12.13).

Theorem 3.6. (Scheffe (1947)).

V (µ, ν) = 2 supA|µ(A)− ν(A)| =

∫|p(x)− q(x)|λ(dx),

where the supremum is taken over all Borel subsets of Rd.


Another important example is the Hellinger distance, given by f(x) =(1−

√x)2:

H2(µ, ν) = supA=Aj

2

1−∑j

√µ(Aj)ν(Aj)

= 2

(1−

∫ √p(x)q(x)λ(dx)

).

The quantity I2(µ, ν) =∫ √

p(x)q(x)λ(dx) is often called the Hellingerintegral. We mention two useful inequalities in this respect. For the sake ofsimplicity, we state their discrete form. (The integral forms are analogous,see Problem 3.21.)

Lemma 3.1. (LeCam (1973)). For positive sequences ai and bi, bothsumming to one,

∑i

min(ai, bi) ≥12

(∑i

√aibi

)2

.

Proof. By the Cauchy-Schwarz inequality,

∑i:ai<bi

√aibi ≤

√ ∑i:ai<bi

ai

√ ∑i:ai<bi

bi ≤√ ∑i:ai<bi

ai .

This, together with the inequality (x + y)2 ≤ 2x2 + 2y2, and symmetry,implies

(∑i

√aibi

)2

=

∑i:ai<bi

√aibi +

∑i:ai≥bi

√aibi

2

≤ 2

( ∑i:ai<bi

√aibi

)2

+ 2

∑i:ai≥bi

√aibi

2

≤ 2

∑i:ai<bi

ai +∑

i:ai≥bi

bi

= 2

∑i

min(ai, bi). 2


Lemma 3.2. (Devroye and Gyorfi (1985, p. 225)). Let a1, . . . , ak,b1, . . . bk be nonnegative numbers such that

∑ki=1 ai =

∑ki=1 bi = 1. Then

k∑i=1

√aibi ≤ 1−

(∑ki=1 |ai − bi|

)2

8.

Proof.(k∑i=1

|ai − bi|

)2

≤

(k∑i=1

(√ai −

√bi

)2)(

k∑i=1

(√ai +

√bi

)2)

(by the Cauchy-Schwarz inequality)

≤ 4k∑i=1

(√ai −

√bi

)2

= 8

(1−

k∑i=1

√aibi

),

which proves the lemma. 2

Information divergence is obtained by taking f(x) = x log x:

I(µ, ν) = supA=Aj

∑j

µ(Aj) log(µ(Aj)ν(Aj)

).

I(µ, ν) is also called the Kullback-Leibler number.Our last example is the χ2-divergence, defined by f(x) = (x− 1)2:

χ2(µ, ν) = supA=Aj

∑j

(µ(Aj)− ν(Aj))2

ν(Aj)

=∫p2(x)q(x)

λ(dx)− 1.

Next, we highlight the connection between F -errors and f -divergences.Let µ0 and µ1 denote the conditional distributions of X given Y = 0 andY = 1. Assume that the class probabilities are equal: p = 1/2. If F is aconcave function, then the F -error dF (X,Y ) may be written as

dF (X,Y ) = F

(12

)−Df (µ0, µ1),

where

f(x) = −12F

(x

1 + x

)(1 + x) + F

(12

),


and Df is the corresponding f -divergence. It is easy to see that f is convex,whenever F is concave. A special case of this correspondence is

L∗ =12

(1− 1

2V (µ0, µ1)

),

if p = 1/2. Also, it is easy to verify, that

ρ =√p(1− p) I2(µ0, µ1),

where ρ is the Matushita error. For further connections, we refer the readerto the exercises.


Problem 3.1. Show that for every (l1, l∗) with 0 ≤ l∗ ≤ l1 ≤ 2l∗(1− l∗) ≤ 1/2,

there exists a distribution of (X,Y ) with LNN = l1 and L∗ = l∗. Therefore, theCover-Hart inequalities are not universally improvable.

Problem 3.2. Tightness of the bounds. Theorem 3.1 cannot be improved.(1) Show that for all α ∈ [0, 1/2], there exists a distribution of (X,Y ) such

that LNN = L∗ = α.(2) Show that for all α ∈ [0, 1/2], there exists a distribution of (X,Y ) such

that LNN = α,L∗ = 12− 1

2

√1− 2α.

(3) Show that for all α ∈ [0, 1/2], there exists a distribution of (X,Y ) suchthat L∗ = ρ = α.

(4) Show that for all α ∈ [0, 1/2], there exists a distribution of (X,Y ) such

that LNN = α,L∗ = 12− 1

2

√1− 4ρ2.

Problem 3.3. Show that E ≥ L∗.

Problem 3.4. For any α ≤ 1, find a sequence of distributions of (Xn, Yn) havingexpected conditional entropies En and Bayes errors L∗n such that L∗n → 0 asn→∞, and En decreases to zero at the same rate as (L∗n)α.

Problem 3.5. Concavity of error measures. Let Y denote the mixturerandom variable taking the value Y1 with probability p and the value Y2 withprobability 1−p. Let X be a fixed Rd-valued random variable, and define η1(x) =PY1 = 1|X = x, η2(x) = PY2 = 1|X = x, where Y1, Y2 are Bernoulli randomvariables. Clearly, η(x) = pη1(x) + (1 − p)η2(x). Which of the error measuresL∗, ρ, LNN, E are concave in p for fixed joint distribution of X,Y1, Y2? Can everydiscrimination problem (X,Y ) be decomposed this way for some Y1, p, Y2, whereη1(x), η2(x) ∈ 0, 1 for all x? If not, will the condition η1(x), η2(x) ∈ 0, 1/2, 1for all x do?

Problem 3.6. Show that for every l∗ ∈ [0, 1/2], there exists a distribution of(X,Y ) with L∗ = l∗ and E = H(L∗, 1− L∗). Thus, Fano’s inequality is tight.


Problem 3.7. Toussaint’s inequalities (1974b). Mimic a proof in the textto show that

J ≥√

1− 2LNN log

(1 +

√1− 2LNN

1−√

1− 2LNN

)≥ 2(1− 2LNN).

Problem 3.8. Show that L∗ ≤ e−δC , where δC is Chernoff’s measure of affinitywith parameter α ∈ (0, 1).

Problem 3.9. Prove that L∗ = p − E (2η(X)− 1)+, where p = PY = 1,and (x)+ = max(x, 0).

Problem 3.10. Show that J ≥ −2 log ρ − 2H(p, 1 − p), where p = PY = 1(Toussaint (1974b)).

Problem 3.11. Let f1 and f0 be two multivariate normal densities with meansm0,m1 and common covariance matrix Σ. If PY = 1 = p, and f1, f0 are theconditional densities of X given Y = 1 and Y = 0 respectively, then show that

ρ =√p(1− p) e−∆2/8,

where ρ is the Matushita error and ∆ is the Mahalanobis distance.

Problem 3.12. For every δ ∈ [0,∞) and l∗ ∈ [0, 1/2] with l∗ ≤ 2/(4 + δ2), finddistributions µ0 and µ1 for X given Y = 1, Y = 0, such that the Mahalanobis dis-tance ∆ = δ, yet L∗ = l∗. Therefore, the Mahalanobis distance is not universallyrelated to the Bayes risk.

Problem 3.13. Show that the Mahalanobis distance ∆ is invariant under linearinvertible transformations of X.

Problem 3.14. Lissack and Fu (1976) have suggested the measures

δLF = E |2η(X)− 1|α , 0 < α <∞.

For α = 1, this is twice the Kolmogorov distance δKO. Show the following:(1) If 0 < α ≤ 1, then 1

2(1− δLF) ≤ L∗ ≤ (1− δ

1/αLF ).

(2) If 1 ≤ α <∞, then 12(1− δ

1/αLF ) ≤ L∗ ≤ (1− δLF).

Problem 3.15. Hashlamoun, Varshney, and Samarasooriya (1994) suggest usingthe F -error with the function

F (x) =1

2sin(πx)e−1.8063(x− 1

2 )2

to obtain tight upper bounds on L∗. Show that F (x) ≥ min(x, 1−x), so that thecorresponding F -error is indeed an upper bound on the Bayes risk.

Problem 3.16. Prove that L∗ ≤ max(p(1− p))(1− 1

2V (µ0, µ1)

).

Problem 3.17. Prove that L∗ ≤√p(1− p)I2(µ0µ1). Hint: min(a, b) ≤

√ab.


Problem 3.18. Assume that the components of X = (X(1), . . . , X(d)) are con-ditionally independent (given Y ), and identically distributed, that is, PX(i) ∈A|Y = j = νj(A) for i = 1, . . . , d and j = 0, 1. Use the previous exercise to showthat

L∗ ≤√p(1− p) (I2(ν0, ν1))

d .

Problem 3.19. Show that χ2(µ1, µ2) ≥ I(µ1, µ2). Hint: x− 1 ≥ log x.

Problem 3.20. Show the following analog of Theorem 3.3. Let t : Rd → Rk bea measurable function, and µ, ν probability measures on Rd. Define the measuresµt and νt on Rk by µt(A) = µ(t−1(A)) and νt(A) = ν(t−1(A)). Show that forany convex function f , Df (µ, ν) ≥ Df (µt, νt).

Problem 3.21. Prove the following connections between the Hellinger integraland the total variation: (

1− 1

2V (µ, ν)

)≥ 1

2(I2(µ, ν))

2 ,

and(V (µ, ν))2 ≤ 8(1− I2(µ, ν)).

Hint: Proceed analogously to Lemmas 3.1 and 3.2.

Problem 3.22. Pinsker’s inequality. Show that

(V (µ, ν))2 ≤ 2I(µ, ν)

(Csiszar (1967), Kullback (1967), and Kemperman (1969)). Hint: First provethe inequality if µ and ν are concentrated on the same two atoms. Then defineA = x : p(x) ≥ q(x), and the measures µ∗, ν∗ on the set 0, 1 by µ∗(0) =1 − µ∗(1) = µ(A) and ν∗(0) = 1 − ν∗(1) = ν(A), and apply the previous result.Conclude by pointing out that Scheffe’s theorem states V (µ∗, ν∗) = V (µ, ν), andthat I(µ∗, ν∗) ≤ I(µ, ν).


4Linear Discrimination

In this chapter, we split the space by a hyperplane and assign a differentclass to each halfspace. Such rules offer tremendous advantages—they areeasy to interpret as each decision is based upon the sign of

∑di=1 aix

(i)+a0,where x = (x(1), . . . , x(d)) and the ai’s are weights. The weight vectordetermines the relative importance of the components. The decision is alsoeasily implemented—in a standard software solution, the time of a decisionis proportional to d—and the prospect that a small chip can be built tomake a virtually instantaneous decision is particularly exciting.

Rosenblatt (1962) realized the tremendous potential of such linear rulesand called them perceptrons. Changing one or more weights as new dataarrive allows us to quickly and easily adapt the weights to new situations.Training or learning patterned after the human brain thus became a reality.This chapter merely looks at some theoretical properties of perceptrons. Webegin with the simple one-dimensional situation, and deal with the choice ofweights in Rd further on. Unless one is terribly lucky, linear discriminationrules cannot provide error probabilities close to the Bayes risk, but thatshould not diminish the value of this chapter. Linear discrimination is atthe heart of nearly every successful pattern recognition method, includingtree classifiers (Chapters 20 and 21), generalized linear classifiers (Chapter17), and neural networks (Chapter 30). We also encounter for the first timerules in which the parameters (weights) are dependent upon the data.

4.1 Univariate Discrimination and Stoller Splits 50

Figure 4.1. Rosenblatt’s perceptron. The decision is basedupon a linear combination of the components of the inputvector.

4.1 Univariate Discrimination and Stoller Splits

As an introductory example, let X be univariate. The crudest and simplestpossible rule is the linear discrimination rule

g(x) =y′ if x ≤ x′

1− y′ otherwise,

where x′ is a split point and y′ ∈ 0, 1 is a class. In general, x′ and y′

are measurable functions of the data Dn. Within this class of simple rules,there is of course a best possible rule that can be determined if we know thedistribution. Assume for example that (X,Y ) is described in the standardmanner: let PY = 1 = p. Given Y = 1, X has a distribution functionF1(x) = PX ≤ x|Y = 1, and given Y = 0, X has a distribution functionF0(x) = PX ≤ x|Y = 0, where F0 and F1 are the class-conditionaldistribution functions. Then a theoretically optimal rule is determined bythe split point x∗ and class y∗ given by

(x∗, y∗) = arg min(x′,y′)

Pg(X) 6= Y

(the minimum is always reached if we allow the values x′ = ∞ and x′ =−∞). We call the corresponding minimal probability of error L and notethat

L = inf(x′,y′)

Iy′=0 (pF1(x′) + (1− p)(1− F0(x′)))

+ Iy′=1 (p(1− F1(x′)) + (1− p)F0(x′)).


A split defined by (x∗, y∗) will be called a theoretical Stoller split (Stoller(1954)).

Lemma 4.1. L ≤ 1/2 with equality if and only if L∗ = 1/2.

Proof. Take (x′, y′) = (−∞, 0). Then the probability of error is 1 − p =PY = 0. Take (x′, y′) = (−∞, 1). Then the probability of error is p.Clearly,

L∗ ≤ L ≤ min(p, 1− p).

This proves the first part of the lemma. For the second part, if L = 1/2,then p = 1/2, and for every x, pF1(x) + (1 − p)(1 − F0(x)) ≥ 1/2 andp(1−F1(x))+(1−p)F0(x) ≥ 1/2. The first inequality implies pF1(x)−(1−p)F0(x) ≥ p−1/2, while the second implies pF1(x)−(1−p)F0(x) ≤ p−1/2.Therefore, L = 1/2 means that for every x, pF1(x)−(1−p)F0(x) = p−1/2.Thus, for all x, F1(x) = F0(x), and therefore L∗ = 1/2. 2

Lemma 4.2.

L =12− sup

x

∣∣∣∣pF1(x)− (1− p)F0(x)− p+12

∣∣∣∣ .In particular, if p = 1/2, then

L =12− 1

2supx|F1(x)− F0(x)| .

Proof. Set ρ(x) = pF1(x)− (1− p)F0(x). Then, by definition,

L = infx

min ρ(x) + 1− p, p− ρ(x)

=12− sup

x

∣∣∣∣ρ(x)− p+12

∣∣∣∣(since mina, b = (a+ b− |a− b|)/2). 2

The last property relates the quality of theoretical Stoller splits to theKolmogorov-Smirnov distance supx |F1(x)−F0(x)| between the class-condi-tional distribution functions. As a fun exercise, consider two classes withmeans m0 = EX|Y = 0, m1 = EX|Y = 1, and variances σ2

0 =VarX|Y = 0 and σ2

1 = VarX|Y = 1. Then the following inequalityholds.Theorem 4.1.

L∗ ≤ L ≤ 1

1 + (m0−m1)2

(σ0+σ1)2

.


Remark. When p = 1/2, Chernoff (1971) proved

L ≤ 1

2 + 2 (m0−m1)2

(σ0+σ1)2

.

Moreover, Becker (1968) pointed out that this is the best possible bound(see Problem 4.2). 2

Proof. Assume without loss of generality that m0 < m1. Clearly, L issmaller than the probability of error for the rule that decides 0 when x ≤m0 + ∆0, and 1 otherwise, where m1 −m0 = ∆0 + ∆1, ∆0,∆1 > 0.

Figure 4.2. The split providing thebound of Theorem 4.1.

The probability of error of the latter rule is

pF1(m0 + ∆0) + (1− p)(1− F0(m0 + ∆0))

= pPX ≤ m1 −∆1|Y = 1+ (1− p)PX > m0 + ∆0|Y = 0

≤ pσ2

1

σ21 + ∆2

1

+ (1− p)σ2

0

σ20 + ∆2

0

(by the Chebyshev-Cantelli inequality; see Appendix, Theorem A.17)

=p

1 + ∆21

σ21

+1− p

1 + ∆20

σ20

(take ∆1 = (σ1/σ0)∆0, and ∆0 = |m1 −m0|σ0/(σ0 + σ1) )

=1

1 + (m0−m1)2

(σ0+σ1)2

. 2

We have yet another example of the principle that well-separated classesyield small values for L and thus L∗. Separation is now measured in termsof the largeness of |m1 −m0| with respect to σ0 + σ1. Another inequalityin the same spirit is given in Problem 4.1.


The limitations of theoretical Stoller splits are best shown in a simpleexample. Consider a uniform [0, 1] random variable X, and define

Y =

1 if 0 ≤ X ≤ 13 + ε

0 if 13 + ε < X ≤ 2

3 − ε1 if 2

3 − ε ≤ X ≤ 1

for some small ε > 0. As Y is a function of X, we have L∗ = 0. If we areforced to make a trivial X-independent decision, then the best we can dois to set g(x) ≡ 1. The probability of error is P1/3 + ε < X < 2/3− ε =1/3 − 2ε. Consider next a theoretical Stoller split. One sees quickly thatthe best split occurs at x′ = 0 or x′ = 1, and thus that L = 1/3 − 2ε. Inother words, even the best theoretical split is superfluous. Note also thatin the above example, m0 = m1 = 1/2 so that the inequality of Theorem4.1 says L ≤ 1—it degenerates.

We now consider what to do when a split must be data-based. Stoller(1954) suggests taking (x′, y′) such that the empirical error is minimal. Hefinds (x′, y′) such that

(x′, y′) = arg min(x,y)∈R×0,1

1n

n∑i=1

(IXi≤x,Yi 6=y + IXi>x,Yi 6=1−y

).

(x′ and y′ are now random variables, but in spite of our convention, wekeep the lowercase notation for now.) We will call this Stoller’s rule. Thesplit is referred to as an empirical Stoller split. Denote the set (−∞, x]×y ∪ (x,∞)× 1− y by C(x, y). Then

(x′, y′) = arg min(x,y)

νn(C(x, y)),

where νn is the empirical measure for the dataDn = (X1, Y1), . . . , (Xn, Yn),that is, for every measurable setA ∈ R×0, 1, νn(A) = (1/n)

∑ni=1 I(Xi,Yi)∈A.

Denoting the measure of (X,Y ) inR×0, 1 by ν, it is clear that Eνn(C) =ν(C) = PX ≤ x, Y 6= y + PX > x, Y 6= 1 − y. Let Ln = Pgn(X) 6=Y |Dn be the error probability of the splitting rule gn with the data-dependent choice (x′, y′) given above, conditioned on the data. Then

Ln = ν(C(x′, y′))

= ν(C(x′, y′))− νn(C(x′, y′)) + νn(C(x′, y′))

≤ sup(x,y)

(ν(C(x, y))− νn(C(x, y))) + νn(C(x∗, y∗))

(where (x∗, y∗) minimizes ν(C(x, y)) over all (x, y))

≤ 2 sup(x,y)

|ν(C(x, y))− νn(C(x, y))|+ ν(C(x∗, y∗))

= 2 sup(x,y)

|ν(C(x, y))− νn(C(x, y))|+ L.

4.2 Linear Discriminants 54

From the next theorem we see that the supremum above is small even formoderately large n, and therefore, Stoller’s rule performs closely to the bestsplit regardless of the distribution of (X,Y ).

Theorem 4.2. For Stoller’s rule, and ε > 0,

PLn − L ≥ ε ≤ 4e−nε2/2,

and

ELn − L ≤√

2 log(4e)n

.

Proof. By the inequality given just above the theorem,

PLn − L ≥ ε ≤ P

sup(x,y)

|ν(C(x, y))− νn(C(x, y))| ≥ ε

2

≤ P

supx|ν(C(x, 0))− νn(C(x, 0))| ≥ ε

2

+P

supx|ν(C(x, 1))− νn(C(x, 1))| ≥ ε

2

≤ 4e−2n(ε/2)2

by a double application of Massart’s (1990) tightened version of the Dvoretzky-Kiefer-Wolfowitz inequality (1956) (Theorem 12.9). See Problem 4.5. We donot prove this inequality here, but we will thoroughly discuss several suchinequalities in Chapter 12 in a greater generality. The second inequalityfollows from the first via Problem 12.1. 2

The probability of error of Stoller’s rule is uniformly close to L over allpossible distributions. This is just a preview of things to come, as we maybe able to obtain good performance guarantees within a limited class ofrules.

4.2 Linear Discriminants

Rosenblatt’s perceptron (Rosenblatt (1962); see Nilsson (1965) for a gooddiscussion) is based upon a dichotomy ofRd into two parts by a hyperplane.The linear discrimination rule with weights a0, a1, . . . , ad is given by

g(x) =

1 ifd∑i=1

aix(i) + a0 > 0

0 otherwise,

4.2 Linear Discriminants 55

where x = (x(1), . . . , x(d)).

Figure 4.3. A linear discriminant in R2

that correctly classifies all but four datapoints.

Its probability of error is for now denoted by L(a, a0), where a = (a1, . . . , ad).Again, we set

L = infa∈Rd,a0∈R

L(a, a0)

for the best possible probability of error within this class. Let the class-conditional distribution functions of a1X

(1) + · · · + adX(d) be denoted by

F0,a and F1,a, depending upon whether Y = 0 or Y = 1. For L(a, a0), wemay use the bounds of Lemma 4.2, and apply them to F0,a and F1,a. Thus,

L =12− sup

asupx

∣∣∣∣pF1,a(x)− (1− p)F0,a(x)− p+12

∣∣∣∣ ,which, for p = 1/2, reduces to

L =12− 1

2supa

supx|F1,a(x)− F0,a(x)| .

Therefore, L = 1/2 if and only if p = 1/2 and for all a, F1,a ≡ F0,a. Thenapply the following simple lemma.

Lemma 4.3. (Cramer and Wold (1936)). X1 and X2, random vari-ables taking values in Rd, are identically distributed if and only if aTX1

and aTX2 have the same distribution for all vectors a ∈ Rd.

Proof. Two random variables have identical distributions if and only ifthey have the same characteristic function—see, for example, Lukacs andLaha (1964). Now, the characteristic function of X1 = (X(1)

1 , . . . , X(d)1 ) is

ψ1(a) = ψ1(a1, . . . , ad)

= Eei(a1X

(1)1 +···+adX

(d)1 )

= Eei(a1X

(1)2 +···+adX

(d)2 )

(by assumption)

= ψ2(a1, . . . , ad),

4.3 The Fisher Linear Discriminant 56

the characteristic function of X2. 2

Thus, we have proved the following:

Theorem 4.3. L ≤ 1/2 with equality if and only if L∗ = 1/2.

Thus, as in the one-dimensional case, whenever L∗ < 1/2, a meaningful(L < 1/2) cut by a hyperplane is possible. There are also examples inwhich no cut improves over a rule in which g(x) ≡ y for some y and allx, yet L∗ = 0 and L > 1/4 (say). To generalize Theorem 4.1, we offer thefollowing result. A related inequality is shown in Problem 4.7. The idea ofusing Chebyshev’s inequality to obtain such bounds is due to Yau and Lin(1968) (see also Devijver and Kittler (1982, p.162)).

Theorem 4.4. Let X0 and X1 be random variables distributed as X givenY = 0, and Y = 1 respectively. Set m0 = EX0, m1 = EX1. Definealso the covariance matrices Σ1 = E

(X1 −m1)(X1 −m1)T

and Σ0 =

E(X0 −m0)(X0 −m0)T

. Then

L∗ ≤ L ≤ infa∈Rd

1

1 + (aT (m1−m0))2

((aT Σ0a)1/2+(aT Σ1a)1/2)2

.

Proof. For any a ∈ Rd we may apply Theorem 4.1 to aTX0 and aTX1.Theorem 4.4 follows by noting that

EaTX0

= aTEX0 = aTm0,

EaTX1

= aTm1,

and that

VaraTX0

= E

aT (X0 −m0)(X0 −m0)Ta

= aTΣ0a,

VaraTX1

= aTΣ1a. 2

We may obtain explicit inequalities by different choices of a. a = m1−m0

yields a convenient formula. We see from the next section that a = Σ(m1−m0) with Σ = pΣ1 +(1−p)Σ0 is also a meaningful choice (see also Problem4.7).

4.3 The Fisher Linear Discriminant

Data-based values for a may be found by various criteria. One of the firstmethods was suggested by Fisher (1936). Let m1 and m0 be the sample

4.4 The Normal Distribution 57

means for the two classes (e.g., m1 =∑i:Yi=1Xi/|i : Yi = 1|.) Picture

projecting X1, . . . , Xn to a line in the direction of a. Note that this isperpendicular to the hyperplane given by aTx + a0 = 0. The projectedvalues are aTX1, . . . , a

TXn. These are all equal to 0 for those Xi on thehyperplane aTx = 0 through the origin, and grow in absolute value as weflee that hyperplane. Let σ2

1 and σ20 be the sample scatters for classes 1 and

0, respectively, that is,

σ21 =

∑i:Yi=1

(aTXi − aT m1

)2= aTS1a

and similarly for σ20 , where

S1 =∑i:Yi=1

(Xi − m1)(Xi − m1)T

is the scatter matrix for class 1.The Fisher linear discriminant is that linear function aTx for which the

criterion

J(a) =

(aT m1 − aT m0

)2σ2

1 + σ20

=

(aT (m1 − m0)

)2aT (S1 + S0)a

is maximum. This corresponds to finding a direction a that best separatesaT m1 from aT m0 relative to the sample scatter. Luckily, to find that a, weneed not resort to numerical iteration—the solution is given by

a = (S1 + S0)−1(m1 − m0).

Fisher’s suggestion is to replace (X1, Y1), . . . , (Xn, Yn) by (aTX1, Y1), . . .,(aTXn, Yn) and to perform one-dimensional discrimination. Usually, therule uses a simple split

ga0(x) =

1 if aTx+ a0 > 00 otherwise (4.1)

for some constant a0. Unfortunately, Fisher discriminants can be arbitrar-ily bad: there are distributions such that even though the two classes arelinearly separable (i.e., L = 0), the Fisher linear discriminant has an errorprobability close to 1 (see Problem 4.9).

4.4 The Normal Distribution

There are a few situations in which, by sheer accident, the Bayes ruleis a linear discriminant. While this is not a major issue, it is interesting

4.4 The Normal Distribution 58

to identify the most important case, i.e., that of the multivariate normaldistribution. The general multivariate normal density is written as

f(x) =1√

(2π)d det(Σ)e−

12 (x−m)T Σ−1(x−m),

where m is the mean (both x and m are d-component column vectors),Σ is the d × d covariance matrix, Σ−1 is the inverse of Σ, and det(Σ) isits determinant. We write f ∼ N(m,Σ). Clearly, if X has density f , thenm = EX and Σ = E(X −m)(X −m)T .

The multivariate normal density is completely specified by d+(d2

)formal

parameters (m and Σ). A sample from the density is clustered in an ellip-tical cloud. The loci of points of constant density are ellipsoids describedby

(x−m)TΣ−1(x−m) = r2

for some constant r ≥ 0. The number r is the Mahalanobis distance fromx to m, and is in fact useful even when the underlying distribution is notnormal. It takes into account the directional stretch of the space determinedby Σ.

Figure 4.4. Points at equal Mahala-nobis distance from m.

Given a two-class problem in which X has a density (1−p)f0(x)+pf1(x)and f0 and f1 are both multivariate normal with parameters mi,Σi, i =0, 1, the Bayes rule is easily described by

g∗(x) =

1 if pf1(x) > (1− p)f0(x)0 otherwise.

Take logarithms and note that g∗(x) = 1 if and only if

(x−m1)TΣ−11 (x−m1)− 2 log p+ log(det(Σ1))

< (x−m0)TΣ−10 (x−m0)− 2 log(1− p) + log(det(Σ0)).

In practice, one might wish to estimate m1,m0,Σ1,Σ0 and p from thedata and use these estimates in the formula for g∗. Interestingly, as (x −mi)TΣ−1

i (x − mi) is the squared Mahalanobis distance from x to mi inclass i (called r2i ), the Bayes rule is simply

g∗(x) =

1 if r21 < r20 − 2 log((1− p)/p) + log(det(Σ0)/det(Σ1))0 otherwise.

4.5 Empirical Risk Minimization 59

In particular, when p = 1/2, Σ0 = Σ1 = Σ, we have

g∗(x) =

1 if r21 < r200 otherwise;

just classify according to the class whose mean is at the nearest Maha-lanobis distance from x. When Σ0 = Σ1 = Σ, the Bayes rule becomeslinear:

g∗(x) =

1 if aTx+ a0 > 00 otherwise,

where a = (m1 − m0)Σ−1, and a0 = 2 log(p/(1 − p)) + mT0 Σ−1m0 −

mT1 Σ−1m1. Thus, linear discrimination rules occur as special cases of Bayes

rules for multivariate normal distributions.Our intuition that a should be in the direction m1−m0 to best separate

the classes is almost right. Note nevertheless that a is not perpendicularin general to the hyperplane of loci at equal distance from m0 and m1.When Σ is replaced by the standard data-based estimate, we obtain infact the Fisher linear discriminant. Furthermore, when Σ1 6= Σ0, the deci-sion boundary is usually not linear, and Fisher’s linear discriminant musttherefore be suboptimal.

Historical remarks. In early statistical work on discrimination, thenormal distribution plays a central role (Anderson (1958)). For a simpleintroduction, we refer to Duda and Hart (1973). McLachlan (1992) hasmore details, and Raudys (1972; 1976) relates the error, dimensionalityand sample size for normal and nearly normal models. See also Raudysand Pikelis (1980; 1982). 2

4.5 Empirical Risk Minimization

In this section we present an algorithm that yields a classifier whose er-ror probability is very close to the minimal error probability L achievableby linear classifiers, provided that X has a density. The algorithm selectsa classifier by minimizing the empirical error over finitely many—2

(nd

)—

linear classifiers. For a rule

φ(x) =


the probability of error is

L(φ) = Pφ(X) 6= Y .


L(φ) may be estimated by the empirical risk

Ln(φ) =1n

n∑i=1

Iφ(Xi) 6=Yi,

that is, the number of errors made by the classifier φ is counted and nor-malized.

Assume that X has a density, and consider d arbitrary data pointsXi1 , Xi2 , . . . , Xid among X1, . . . , Xn, and let aTx + a0 = 0 be a hy-perplane containing these points. Because of the density assumption, the dpoints are in general position with probability one, and this hyperplane isunique. This hyperplane determines two classifiers:

φ1(x) =


and

φ2(x) =

1 if aTx+ a0 < 00 otherwise,

whose empirical errors Ln(φ1) and Ln(φ2) may be calculated. To eachd-tuple Xi1 , Xi2 , . . . , Xid of data points, we may assign two classifiers inthis manner, yielding altogether 2

(nd

)classifiers. Denote these classifiers by

φ1, . . . , φ2(nd). Let φ be a linear classifier that minimizes Ln(φi) over all

i = 1, . . . , 2(nd

).

We denote the best possible error probability by

L = infφL(φ)

over the class of all linear rules, and define φ∗ = arg minφ L(φ) as the bestlinear rule. If there are several classifiers with L(φ) = L, then we choose φ∗

among these in an arbitrary fixed manner. Next we show that the classifiercorresponding to φ is really very good.

Figure 4.5. If the data points are ingeneral position, then for each linear rulethere exists a linear split defined by a hy-perplane crossing d points such that thedifference between the empirical errors isat most d/n.

First note that there is no linear classifier φ whose empirical error Ln(φ)is smaller than L(φ)− d/n. This follows from the fact that since the data


points are in general position (recall the density assumption), then for eachlinear classifier we may find one whose defining hyperplane contains exactlyd data points such that the two decisions agree on all data points exceptpossibly for these d points—see Figure 4.5. Thus, we may view minimizationof the empirical error over the finite set φ1, . . . , φ2(n

d) as approximateminimization over the infinite set of linear classifiers. In Chapters 12 and13 we will develop the full theory for rules that are found by empiricalrisk minimization. Theorem 4.5 just gives you a taste of things to come.Other—more involved, but also more general—proofs go back to Vapnikand Chervonenkis (1971; 1974c).

Theorem 4.5. Assume that X has a density. If φ is found by empiricalerror minimization as described above, then, for all possible distributionsof (X,Y ), if n ≥ d and 2d/n ≤ ε ≤ 1, we have

PL(φ) > L+ ε

≤ e2dε

(2(n

d

)+ 1)e−nε

2/2.

Moreover, if n ≥ d, then

EL(φ)− L

≤√

2n

((d+ 1) log n+ (2d+ 2)).

Remark. With some care Theorem 4.5 and Theorem 4.6 below can beextended so that the density assumption may be dropped. One needs toensure that the selected linear rule has empirical error close to that of thebest possible linear rule. With the classifier suggested above this propertymay fail to hold if the data points are not necessarily of general position.The ideas presented here are generalized in Chapter 12 (see Theorem 12.2).2

Proof. We begin with the following simple inequality:

L(φ)− L = L(φ)− Ln(φ) + Ln(φ)− L(φ∗)

≤ L(φ)− Ln(φ) + Ln(φ∗)− L(φ∗) +d

n

(since Ln(φ) ≤ Ln(φ) + d/n for any φ)

≤ maxi=1,...,2(n

d)

(L(φi)− Ln(φi)

)+ Ln(φ∗)− L(φ∗) +

d

n.

Therefore, by the union-of-events bound, we have

PL(φ)− L > ε

≤

2(nd)∑

i=1

PL(φi)− Ln(φi) >

ε

2

+ P

Ln(φ∗)− L(φ∗) +

d

n>ε

2

.


To bound the second term on the right-hand side, observe that nLn(φ∗)is binomially distributed with parameters n and L(φ∗). By an inequalitydue to Chernoff (1952) and Okamoto (1958) for the tail of the binomialdistribution,

PLn(φ∗)− L(φ∗) >

ε

2− d

n

≤ e−2n( ε

2−dn )2

≤ e2dε · e−nε2/2.

We prove this inequality later (see Theorem 8.1). Next we bound one termof the sum on the right-hand side. Note that by symmetry all 2

(nd

)terms

are equal. Assume that the classifier φ1 is determined by the d-tuple of thefirst d data points X1, . . . , Xd. We write

PL(φ1)− Ln(φ1) >

ε

2

= E

PL(φ1)− Ln(φ1) >

ε

2

∣∣∣X1, . . . , Xd

,

and bound the conditional probability inside. Let (X ′′1 , Y

′′1 ), . . . , (X ′′

d , Y′′d )

be independent of the data and be distributed as the data (X1, Y1), . . .,(Xd, Yd). Define

(X ′i, Y

′i ) =

(X ′′

i , Y′′i ) if i ≤ d

(Xi, Yi) if i > d.

Then

PL(φ1)− Ln(φ1) >

ε

2

∣∣∣X1, . . . , Xd

≤ P

L(φ1)−

1n

n∑i=d+1

Iφ1(Xi) 6=Yi >ε

2

∣∣∣∣∣X1, . . . , Xd

≤ P

L(φ1)−

1n

n∑i=1

Iφ1(X′i) 6=Y ′

i +

d

n>ε

2

∣∣∣∣∣X1, . . . , Xd

= PL(φ1)−

1n

Binomial(n,L(φ1)) >ε

2− d

n

∣∣∣∣X1, . . . , Xd

(as L(φ1) depends upon X1, . . . , Xd only and

(X ′1, Y

′1) . . . , (X ′

d, Y′d) are independent of X1, . . . , Xd)

≤ e−2n( ε2−

dn )2

(by Theorem 8.1; use the fact that ε ≥ 2d/n)

≤ e−nε2/2e2dε.

The inequality for the expected value follows from the probability inequalityby the following simple argument: by the Cauchy-Schwarz inequality,(

EL(φ)− L

)2

≤ E(

L(φ)− L)2.


Denoting Z =(L(φ)− L

)2

, for any u > 0,

E Z = EZ|Z > uPZ > u+ EZ|Z ≤ uPZ ≤ u≤ PZ > u+ u

≤ e2d(2nd + 1

)e−nu/2 + u if u ≥

(2dn

)2

by the probability inequality, and since(nd

)≤ nd. Choosing u to minimize

the obtained expression yields the desired inequality: first verify that theminimum occurs for

u =2n

lognc

2,

where c = e2d(2nd + 1). Check that if n ≥ d, then u ≥ (2d/n)2. Then notethat the bound ce−nu/2 + u equals

2n

lognec

2≤ 2n

log(e2d+2nd+1

)=

2n

((d+ 1) log n+ (2d+ 2)). 2

Observe for now that the bound on PL(φ) > L+ ε

decreases rapidly

with n. To have an impact, it must become less than δ for small δ. Thishappens, roughly speaking, when

n ≥ c · dε2

(log

d

ε2+ log

1δ

)for some constant c. Doubling d, the dimension, causes this minimal samplesize to roughly double as well.

An important special case is when the distribution is linearly separable,that is, L = 0. In such cases the empirical risk minimization above per-forms even better as the size of the error improves to O(d log n/n) fromO(√

d log n/n). Clearly, the data points are linearly separable as well,

that is, Ln(φ∗) = 0 with probability one, and therefore Ln(φ) ≤ d/n withprobability one.

Theorem 4.6. Assume that X has a density, and that the best linear clas-sifier has zero probability of error (L = 0). Then for the empirical riskminimization algorithm of Theorem 4.5, for all n > d and ε ≤ 1,

PL(φ) > ε

≤ 2(n

d

)e−(n−d)ε,

andEL(φ)

≤ d log n+ 2

n− d.


Proof. By the union bound,

PL(φ) > ε ≤ P

max

i=1,2,...,2(nd):Ln(φi)≤ d

n

L(φi) > ε

≤2(n

d)∑i=1

PLn(φi) ≤

d

n, L(φi) > ε

.

By symmetry, this sum equals

2(n

d

)PLn(φ1) ≤

d

n, L(φ1) > ε

= 2

(n

d

)EPLn(φ1) ≤

d

n, L(φ1) > ε

∣∣∣∣X1, . . . , Xd

,

where, as in Theorem 4.5, φ1 is determined by the data points X1, . . . , Xd.However,

PLn(φ1) ≤

d

n, L(φ1) > ε

∣∣∣∣X1, . . . , Xd

≤ P φ1(Xd+1) = Yd+1, . . . , φ1(Xn) = Yn, L(φ1) > ε|X1, . . . , Xd

(since all of the (at most d) errors committed by φ1

occur for (X1, Y1), . . . , (Xd, Yd))

≤ (1− ε)n−d,

since the probability that no (Xi, Yi), pair i = d + 1, . . . , n falls in the set(x, y) : φ1(x) 6= y is less than (1 − ε)n−d if the probability of the set islarger than ε. The proof of the probability inequality may be completed bynoting that 1− x ≤ e−x.

For the expected error probability, note that for any u > 0,

EL(φ) =∫ ∞

0

PL(φ) > tdt

≤ u+∫ ∞

u

PL(φ) > tdt

≤ u+ 2nd∫ ∞

u

e−(n−d)tdt

(by the probability inequality and(nd

)≤ nd)

= u+2nd

ne−(n−d)u.

We choose u to minimize the obtained bound, which yields the desiredinequality. 2

4.6 Minimizing Other Criteria 65

4.6 Minimizing Other Criteria

Empirical risk minimization uses extensive computations, because Ln(φ)is not a unimodal function in general (see Problems 4.10 and 4.11). Also,gradient optimization is difficult because the gradients are zero almost ev-erywhere. In fact, given n labeled points in Rd, finding the best lineardichotomy is np hard (see Johnson and Preparata (1978)). To aid in theoptimization, some have suggested minimizing a modified empirical error,such as

Ln(φ) =1n

n∑i=1

Ψ((2Yi − 1)− aTXi − a0

)IYi 6=ga(Xi)

or

Ln(φ) =1n

n∑i=1

Ψ((2Yi − 1)− aTXi − a0

),

where Ψ is a positive convex function. Of particular importance here isthe mean square error criterion Ψ(u) = u2 (see, e.g., Widrow and Hoff(1960)). One can easily verify that Ln(φ) has a gradient (with respect to(a, a0)) that may aid in locating a local minimum. Let φ denote the lineardiscrimination rule minimizing

E(

(2Y − 1)− aTX − a0

)2over all a and a0. A description of the solution is given in Problem 4.14.

Even in a one-dimensional situation, the mean square error criterionmuddles the issue and does not give any performance guarantees:

Theorem 4.7. If sup(X,Y ) denotes the supremum with respect to all dis-tributions on R× 0, 1, then

sup(X,Y )

(L(φ)− L

)= 1,

where φ is a linear discriminant obtained by minimizing

E

((2Y − 1)− a1X − a0)2

over all a1 and a0.

Remark. This theorem establishes the existence of distributions of (X,Y )for which L(φ) > 1−ε and L < ε simultaneously for arbitrarily small ε > 0.Therefore, minimizing the mean square error criterion is not recommendedunless one has additional information regarding the distribution of (X,Y ).2

4.6 Minimizing Other Criteria 66

Proof. Let ε > 0 and θ > 0. Consider a triatomic distribution of (X,Y ):

P(X,Y ) = (−θ, 1) = P(X,Y ) = (1, 1) = ε/2,

P(X,Y ) = (0, 0) = 1− ε.

Figure 4.6. A distribution forwhich squared error minimizationfails.

For ε < 1/2, the best linear rule decides class 0 on [−θ/2,∞) and 1 else-where, for a probability of error L = ε/2. The mean square error criterionasks that we minimize

L(φ) =

(1− ε)(−1− v)2 +ε

2(1− u− v)2 +

ε

2(1 + uθ − v)2

with respect to a0 = v and a1 = u. Setting the derivatives with respect tou and v equal to zero yields

u =(v − 1)θ − v

1 + θ2, and v = 2ε− 1 +

ε

2u(θ − 1),

for

v =(2ε− 1)(1 + θ2)− ε

2θ(θ − 1)1 + θ2 − ε

2 (1− θ)2.

If we let ε ↓ 0 and let θ ↑ ∞, then v ∼ 3ε/2. Thus, for ε small enough andθ large enough, considering the decision at 0 only, L(φ) ≥ 1− ε, because atx = 0, ux+ v = v > 0. Thus, L(φ)− L ≥ 1− 3ε/2 for ε small enough andθ large enough. 2

Others have suggested minimizing

n∑i=1

(σ(aTXi + a0)− Yi

)2,

where σ(u) is a sigmoid, that is, an increasing function from 0 to 1 suchas 1/(1 + e−u), see, for example, Wassel and Sklansky (1972), Do Tu andInstalle (1975), Fritz and Gyorfi (1976), and Sklansky and Wassel (1979).Clearly, σ(u) = Iu≥0 provides the empirical error probability. However,the point here is to use smooth sigmoids so that gradient algorithms maybe used to find the optimum. This may be viewed as a compromise betweenthe mean squared error criteria and empirical error minimization. Here, too,anomalies can occur, and the error space is not well behaved, displayingmany local minima (Hertz, Krogh, and Palmer (1991, p.108)). See, however,Problems 4.16 and 4.17.



Problem 4.1. With the notation of Theorem 4.1, show that the error probabilityL of a one-dimensional theoretical Stoller split satisfies

L ≤ 4p(1− p)

1 + p(1− p) (m0−m1)2

(1−p)σ20+pσ2

1

(Gyorfi and Vajda (1980)). Is this bound better than that of Theorem 4.1? Hint:For any threshold rule gc(x) = Ix≥c and u > 0, write

L(gc) = PX − c ≥ 0, 2Y − 1 = −1+ PX − c < 0, 2Y − 1 = 1

≤ P|u(X − c)− (2Y − 1)| ≥ 1

≤ E(u(X − c)− (2Y − 1))2

by Chebyshev’s inequality. Choose u and c to minimize the upper bound.

Problem 4.2. Let p = 1/2. If L is the error probability of the one-dimensionaltheoretical Stoller split, show that

L ≤ 1

2 + 2 (m0−m1)2

(σ0+σ1)2

.

Show that the bound is achieved for some distribution when the class-conditionaldistributions of X (that is, given Y = 0 and Y = 1) are concentrated on twopoints each, one of which is shared by both classes (Chernoff (1971), Becker(1968)).

Problem 4.3. Let X be a univariate random variable. The distribution func-tions for X given Y = 1 and Y = 0 are F1 and F0 respectively. Assume thatthe moment generating functions for X exist, that is, E

etX |Y = 1

= ψ1(t),

EetX |Y = 0

= ψ0(t), t ∈ R, where ψ1, ψ0 are finite for all t. In the spirit

of Theorem 4.1, derive an upper bound for L in function of ψ1, ψ0. Apply yourbound to the case that F1 and F0 are both normal with possibly different meansand variances.

Problem 4.4. Signals in additive gaussian noise. Let s0, s1 ∈ Rd be fixed,and let N be a multivariate gaussian random variable with zero mean and co-variance matrix Σ. Let PY = 0 = PY = 1 = 1/2, and define

X =

s0 +N if Y = 0s1 +N if Y = 1.

Construct the Bayes decision and calculate L∗. Prove that if Σ is the identitymatrix, and s0 and s1 have constant components, then L∗ → 0 exponentiallyrapidly as d→∞.

Problem 4.5. In the last step of the proof of Theorem 4.2, we used the Dvoretzky-Kiefer-Wolfowitz-Massart inequality (Theorem 12.9). This result states that ifZ1, . . . , Zn are i.i.d. random variables on the real line with distribution function


F (z) = PZ1 ≤ z and empirical distribution function Fn(z) = (1/n)∑n

i=1IZi≤z,

then

P

supz∈R

|F (z)− Fn(z)| ≥ ε

≤ 2e−2nε2 .

Use this inequality to conclude that

P

supx

|ν(C(x, 1))− νn(C(x, 1))| ≥ ε

2

≤ 2e−2n(ε/2)2 .

Hint: Map (X,Y ) on the real line by a one-to-one function ψ : (R×0, 1) →Rsuch that Z = ψ((X,Y )) < 0 if and only if Y = 0. Use the Dvoretzky-Kiefer-Wolfowitz-Massart inequality for Z.

Problem 4.6. Let L be the probability of error for the best sphere rule, that is,for the rule that associates a class with the inside of a sphere Sx,r, and the otherclass with the outside. Here the center x, and radius r are both variable. Showthat L = 1/2 if and only if L∗ = 1/2, and that L ≤ 1/2.

Problem 4.7. With the notation of Theorem 4.4, show that the probability oferror L of the best linear discriminant satisfies

L ≤ 4p(1− p)

1 + p(1− p)∆2,

where

∆ =√

(m1 −m0)TΣ−1(m1 −m0),

is the Mahalanobis distance (Chapter 3) with Σ = pΣ1 + (1− p)Σ0 (Gyorfi andVajda (1980)). Interestingly, the upper bound is just twice the bound of Theorem3.4 for the asymptotic nearest neighbor error. Thus, a large Mahalanobis distancedoes not only imply that the Bayes error is small, but also, small error probabil-ities may be achieved by simple linear classifiers. Hint: Apply the inequality ofProblem 4.1 for the univariate random variable X ′ = aTX a = Σ−1(m1 −m0).

Problem 4.8. If mi and σ2i are the mean and variance of aTX, given that Y = i,

i = 0, 1, where a is a column vector of weights, then show that the criterion

J1(a) =(m1 −m0)

2

σ21 + σ2

0

is minimized for a = (Σ1 + Σ0)−1(M1 −M0), where Mi and Σi are the mean

vector and covariance matrix of X, given Y = i. Also, show that

J2(a) =(m1 −m0)

2

pσ21 + (1− p)σ2

0

is minimized for a = (pΣ1 + (1 − p)Σ0)−1(M1 − M0), where p, 1 − p are the

class probabilities. This exercise shows that if discrimination is attempted in onedimension, we might consider projections aTX where a maximizes the weighteddistance between the projected means.


Problem 4.9. In the Fisher linear discriminant rule (4.1) with free parametera0, show that for any ε > 0, there exists a distribution for (X,Y ), X ∈ R2, withL = 0 and E‖X‖2 < ∞ such that infa0 EL(ga0) > 1/2 − ε. Moreover, if a0

is chosen to minimize the squared error

E(

(2Y − 1)− aTX − a0

)2,

then EL(ga0) > 1− ε.

Problem 4.10. Find a distribution of (X,Y ) with X ∈ R2 such that with prob-

ability at least one half, Ln(φ) is not unimodal with respect to the weight vector(a, a0).

Problem 4.11. The following observation may help in developing a fast algo-rithm to find the best linear classifier in certain cases. Assume that the Bayes ruleis a linear split cutting through the origin, that is, L∗ = L(a∗) for some coefficientvector a∗ ∈ Rd, where L(a) denotes the error probability of the classifier

ga(x) =

1 if

∑d

i=1aix

(i) ≥ 00 otherwise,

and a = (a1, . . . , ad). Show that L(a) is unimodal as a function of a ∈ Rd, andL(a) is monotone increasing along rays pointing from a∗, that is, for any λ ∈ (0, 1)and a ∈ Rd, L(a)− L(λa+ (1− λ)a∗) ≥ 0 (Fritz and Gyorfi (1976)). Hint: Usethe expression

L(a) = 1/2−∫

(η(x)− 1/2) sign

(d∑i=1

aix(i)

)µ(dx)

to show that L(a)−L(λa+(1−λ)a∗) =∫A|η(x)−1/2|µ(dx) for some set A ⊂ Rd.

Problem 4.12. Let a = (a0, a1) and

a = arg mina

E((2Y − 1)− a1X − a0)

2 IYi 6=ga(Xi),

and ga(x) = Ia1x+a0>0. Show that for every ε > 0, there exists a distributionof (X,Y ) on R × 0, 1 such that L(a) − L ≥ 1 − ε, where L(a) is the errorprobability for g

a. Hint: Argue as in the proof of Theorem 4.7. A distribution

with four atoms suffices.

Problem 4.13. Repeat the previous exercise for

a = arg mina

E |(2Y − 1)− a1X − a0| .

Problem 4.14. Let φ∗ denote the linear discrimination rule that minimizes themean square error E

(2Y − 1− aTX − a0)

2

over all a and a0. As this criterion

is quadratic in (a, a0), it is unimodal. One usually approximates φ∗ by φ by


minimizing∑

i(2Yi − 1− aTXi − a0)

2 over all a and a0. Show that the minimalcolumn vector (a, a0) is given by(∑

i

X ′iX

′iT

)−1(∑i

(2Yi − 1)X ′i

),

where X ′i = (Xi, 1) is a (d+ 1)-dimensional column vector.

Problem 4.15. The perceptron criterion is

J =∑

i:2Yi−1 6=sign(aTXi+a0)

|aTXi + a0|.

Find a distribution for which L∗ = 0, L ≤ 1/4, yet lim infn→∞ ELn(φ) ≥ 1/2,where φ is the linear discrimination rule obtained by using the a and a0 thatminimize J .

Problem 4.16. Let σ be a monotone nondecreasing function on R satisfyinglimu→−∞ σ(u) = 0 and limu→∞ σ(u) = 1. For h > 0, define σh(u) = σ(hu).

Consider the linear discrimination rule φ with a and a0 chosen to minimize

n∑i=1

(σh(a

TXi + a0)− Yi)2.

For every fixed h > 0 and 0 < ε < 1, exhibit a distribution with L < ε and

lim infn→∞

ELn(φ) > 1− ε.

On the other hand, show that if h depends on the sample size n such that h→∞as n→∞, then for all distributions, ELn(φ) → L.

Problem 4.17. Given Y = i, let X be normal with mean mi and covariancematrix Σi, i = 0, 1. Consider discrimination based upon the minimization of thecriterion

E(Y − σ(XTAX + wTX + c)

)2with respect to A,w, and c, a d×d matrix, d×1 vector and constant respectively,where σ(u) = 1/(1 + e−u) is the standard sigmoid function. Show that this isminimized for the same A,w, and c that minimize the probability of error

P2Y − 1 6= sign(XTAX + wTX + c)

,

and conclude that in this particular case, the squared error criterion may be usedto obtain a Bayes-optimal classifier (Horne and Hush (1990)).


5Nearest Neighbor Rules

5.1 Introduction

Simple rules survive. The k-nearest neighbor rule, since its conception in1951 and 1952 (Fix and Hodges (1951; 1952; 1991a; 1991b)), has thus at-tracted many followers and continues to be studied by many researchers.Formally, we define the k-nn rule by

gn(x) =

1 if∑ni=1 wniIYi=1 >

∑ni=1 wniIYi=0

0 otherwise,

where wni = 1/k if Xi is among the k nearest neighbors of x, and wni = 0elsewhere. Xi is said to be the k-th nearest neighbor of x if the distance‖x − Xi‖ is the k-th smallest among ‖x − X1‖, . . . , ‖x − Xn‖. In case ofa distance tie, the candidate with the smaller index is said to be closer tox. The decision is based upon a majority vote. It is convenient to let k beodd, to avoid voting ties. Several issues are worth considering:

(A) Universal consistency. Establish convergence to the Bayes rule if k →∞ and k/n→ 0 as n→∞. This is dealt with in Chapter 11.

(B) Finite k performance. What happens if we hold k fixed and let n tendto infinity?

(C) The choice of the weight vector (wn1, . . . , wnn). Are equal weights forthe k nearest neighbors better than unequal weights in some sense?

5.1 Introduction 72

(D) The choice of a distance metric. Achieve invariance with respect to acertain family of transformations.

(E) The reduction of the data size. Can we obtain good performance whenthe data set is edited and/or reduced in size to lessen the storage load?

Figure 5.1. At every point the de-cision is the label of the closest datapoint. The set of points whose nearestneighbor is Xi is called the Voronoicell of Xi. The partition induced bythe Voronoi cells is a Voronoi parti-tion. A Voronoi partition of 15 ran-dom points is shown here.

In the first couple of sections, we will be concerned with convergenceissues for k nearest neighbor rules when k does not change with n. In par-ticular, we will see that for all distributions, the expected error probabilityELn tends to a limit LkNN that is in general close to but larger than L∗.The methodology for obtaining this result is interesting in its own right.The expression for LkNN is then studied, and several key inequalities suchas LNN ≤ 2L∗ (Cover and Hart (1967)) and LkNN ≤ L∗

(1 +

√2/k)

areproved and applied. The other issues mentioned above are dealt with in theremaining sections. For surveys of various aspects of the nearest neighboror related methods, see Dasarathy (1991), Devijver (1980), or Devroye andWagner (1982).

Remark. Computational concerns. Storing the n data pairs in an ar-ray and searching for the k nearest neighbors may take time proportional tonkd if done in a naive manner—the “d” accounts for the cost of one distancecomputation. This complexity may be reduced in terms of one or more ofthe three factors involved. Typically, with k and d fixed, O(n1/d) worst-case time (Papadimitriou and Bentley (1980)) and O(log n) expected time(Friedman, Bentley, and Finkel (1977)) may be achieved. Multidimensionalsearch trees that partition the space and guide the search are invaluable—for this approach, see Fukunaga and Narendra (1975), Friedman, Bentley,and Finkel (1977), Niemann and Goppert (1988), Kim and Park (1986),and Broder (1990). We refer to a survey in Dasarathy (1991) for more refer-ences. Other approaches are described by Yunck (1976), Friedman, Baskett,and Shustek (1975), Vidal (1986), Sethi (1981), and Farago, Linder, andLugosi (1993). Generally, with preprocessing, one may considerably reducethe overall complexity in terms of n and d. 2

5.2 Notation and Simple Asymptotics 73

5.2 Notation and Simple Asymptotics

We fix x ∈ Rd, and reorder the data (X1, Y1), . . . , (Xn, Yn) according toincreasing values of ‖Xi − x‖. The reordered data sequence is denoted by

(X(1)(x), Y(1)(x)), . . . , (X(n)(x), Y(n)(x)) or by (X(1), Y(1)), . . . , (X(n), Y(n))

if no confusion is possible. X(k)(x) is the k-th nearest neighbor of x.

Remark. We note here that, rather arbitrarily, we defined neighbors interms of the Euclidean distance ‖x−y‖. Surprisingly, the asymptotic prop-erties derived in this chapter remain valid to a wide variety of metrics—theasymptotic probability of error is independent of the distance measure (seeProblem 5.1). 2

Denote the probability measure for X by µ, and let Sx,ε be the closedball centered at x of radius ε > 0. The collection of all x with µ(Sx,ε) > 0for all ε > 0 is called the support of X or µ. This set plays a key rolebecause of the following property.

Lemma 5.1. If x ∈ support(µ) and limn→∞ k/n = 0, then ‖X(k)(x) −x‖ → 0 with probability one. If X is independent of the data and has prob-ability measure µ, then ‖X(k)(X)−X‖ → 0 with probability one wheneverk/n→ 0.

Proof. Take ε > 0. By definition, x ∈ support(µ) implies that µ(Sx,ε) > 0.Observe that ‖X(k)(x)− x‖ > ε if and only if

1n

n∑i=1

IXi∈Sx,ε <k

n.

By the strong law of large numbers, the left-hand side converges to µ(Sx,ε) >0 with probability one, while, by assumption, the right-hand side tends tozero. Therefore, ‖X(k)(x)− x‖ → 0 with probability one.

The second statement follows from the previous argument as well. Firstnote that by Lemma A.1 in the Appendix, PX ∈ support(µ) = 1, there-fore for every ε > 0,

P‖X(k)(X)−X‖ > ε

= E

IX∈ support(µ)P

‖X(k)(X)−X‖ > ε|X ∈ support(µ)

,

which converges to zero by the dominated convergence theorem, provingconvergence in probability. If k does not change with n, then ‖X(k)(X)−X‖is monotone decreasing for n ≥ k; therefore, it converges with probabilityone as well. If k = kn is allowed to grow with n such that k/n → 0, then


using the notation X(kn,n)(X) = X(k)(X), we see by a similar argumentthat the sequence of monotone decreasing random variables

supm≥n

‖X(km,m)(X)−X‖ ≥ ‖X(kn,n)(X)−X‖

converges to zero in probability, and therefore, with probability one as well.This completes the proof. 2

Because η is measurable (and thus well-behaved in a general sense) and‖X(k)(x)− x‖ is small, the values η(X(i)(x)) should be close to η(x) for alli small enough. We now introduce a proof method that exploits this fact,and will make subsequent analyses very simple—it suffices to look at datasamples in a new way via embedding. The basic idea is to define an auxiliaryrule g′n(x) in which the Y(i)(x)’s are replaced by k i.i.d. Bernoulli randomvariables with parameter η(x)—locally, the Y(i)(x)’s behave in such a way.It is easy to show that the error probabilities of the two rules are close, andanalyzing the behavior of the auxiliary rule is much more convenient.

To make things more precise, we assume that we are given i.i.d. data pairs(X1, U1), . . . , (Xn, Un), all distributed as (X,U), where X is as before (andhas probability measure µ on the Borel sets of Rd), and U is uniformlydistributed on [0, 1] and independent of X. If we set Yi = IUi≤η(Xi),then (X1, Y1), . . . , (Xn, Yn) are i.i.d. and distributed as the prototype pair(X,Y ). So why bother with the Ui’s? In embedding arguments, we will usethe same Ui’s to construct a second data sequence that is heavily corre-lated (coupled) with the original data sequence, and is more convenient toanalyze. For example, for fixed x ∈ Rd, we may define

Y ′i (x) = IUi≤η(x).

We now have an i.i.d. sequence with i-th vector given by byXi, Yi, Y′i (x), Ui.

Reordering the data sequence according to increasing values of ‖Xi − x‖yields a new sequence with the i-th vector denoted by X(i)(x), Y(i)(x),Y ′(i)(x), U(i)(x). If no confusion is possible, the argument x will be dropped.A rule is called k-local if for n ≥ k, gn is of the form

gn(x) =

1 if ψ(x, Y(1)(x), . . . , Y(k)(x)) > 0,0 otherwise, (5.1)

for some function ψ. For the k-nn rule, we have, for example,

ψ(x, Y(1), . . . , Y(k)) =k∑i=1

Y(i)(x)−k

2.

In other words, gn takes a majority vote over the k nearest neighbors of xand breaks ties in favor of class 0.


To study gn turns out to be almost equivalent to studying the approxi-mate rule g′n:

g′n(x) =

1 if ψ(x, Y ′(1)(x), . . . , Y′(k)(x)) > 0

0 otherwise.

The latter rule is of no practical value because it requires the knowledge ofη(x). Interestingly however, it is easier to study, as Y ′(1)(x), . . . , Y

′(k)(x) are

i.i.d., whereas Y(1)(x), . . . , Y(k)(x) are not. Note, in particular, the following:

Lemma 5.2. For all x, n ≥ k,

Pψ(x, Y(1)(x), . . . , Y(k)(x)) 6= ψ(x, Y ′(1)(x), . . . , Y

′(k)(x))

≤

k∑i=1

E|η(x)− η(X(i)(x))|

and

Pgn(x) 6= g′n(x) ≤k∑i=1

E|η(x)− η(X(i)(x))|

.

Proof. Both statements follow directly from the observation thatψ(x, Y(1)(x), . . . , Y(k)(x)) 6= ψ(x, Y ′(1)(x), . . . , Y

′(k)(x))

⊆

(Y(1)(x), . . . , Y(k)(x)) 6= (Y ′(1)(x), . . . , Y

′(k)(x))

⊆

k⋃i=1

η(X(i)(x)) ≤ U(i)(x) ≤ η(x)

∪

k⋃i=1

η(x) ≤ U(i)(x) ≤ η(X(i)(x))

,

and using the union bound and the fact that the U(i)(x)’s are uniform [0, 1].2

We need the following result, in which X is distributed as X1, but inde-pendent of the data sequence:

Lemma 5.3. (Stone (1977)). For any integrable function f , any n,and any k ≤ n,

k∑i=1

E|f(X(i)(X))|

≤ kγdE|f(X)|,

where γd ≤(1 + 2/

√2−

√3)d− 1 depends upon the dimension only.

5.3 Proof of Stone’s Lemma 76

The proof of this lemma is beautiful but a bit technical—it is given ina separate section. Here is how it is applied, and why, for fixed k, we maythink of f(X(k)(X)) as f(X) for all practical purposes.

Lemma 5.4. For any integrable function f ,

1k

k∑i=1

E|f(X)− f(X(i)(X))|

→ 0

as n→∞ whenever k/n→ 0.

Proof. Given ε > 0, find a uniformly continuous function g vanishing offa bounded set A, such that E|g(X)−f(X)| < ε (see Theorem A.8 in theAppendix). Then for each ε > 0, there is a δ > 0 such that ‖x − z‖ < δimplies |g(x)− g(z)| < ε. Thus,

1k

k∑i=1

E|f(X)− f(X(i)(X))|

≤ E |f(X)− g(X)|+

1k

k∑i=1

E|g(X)− g(X(i)(X))|

+

1k

k∑i=1

E|g(X(i)(X))− f(X(i)(X))|

≤ (1 + γd)E |f(X)− g(X)|+ ε+ ‖g‖∞P

‖X −X(k)(X)‖ > δ

(by Lemma 5.3, where δ depends on ε only)

≤ (2 + γd)ε+ o(1) (by Lemma 5.1). 2

5.3 Proof of Stone’s Lemma

In this section we prove Lemma 5.3. For θ ∈ (0, π/2), a cone C(x, θ) is thecollection of all y ∈ Rd for which angle(x, y) ≤ θ. Equivalently, in vectornotation, xT y/‖x‖‖y‖ ≥ cos θ. The set z + C(x, θ) is the translation ofC(x, θ) by z.


Figure 5.2. A cone of angleθ.

If y, z ∈ C(x, π/6), and ‖y‖ < ‖z‖, then ‖y − z‖ < ‖z‖, as we will nowshow. Indeed,

‖y − z‖2 = ‖y‖2 + ‖z‖2 − 2‖y‖‖z‖ yT z

‖y‖‖z‖

≤ ‖y‖2 + ‖z‖2 − 2‖y‖‖z‖ cos(π/3)

= ‖z‖2(

1 +‖y‖2

‖z‖2− ‖y‖‖z‖

)< ‖z‖2 (see Figure 5.3).

Figure 5.3. Thekey geometrical property of cones ofangle < π/2.

The following covering lemma is needed in what follows:

Lemma 5.5. Let θ ∈ (0, π/2) be fixed. Then there exists a set x1, . . . , xγd ⊂

Rd such that

Rd =γd⋃i=1

C(xi, θ).


Furthermore, it is always possible to take

γd ≤(

1 +1

sin(θ/2)

)d− 1.

For θ = π/6, we have

γd ≤

(1 +

2√2−

√3

)d− 1.

Proof. We assume without loss of generality that ‖xi‖ = 1 for all i. Eachxi is the center of a sphere Si of radius r = 2 sin(θ/2). Si has the propertythat

x : ‖x‖ = 1 ∩ Si = x : ‖x‖ = 1 ∩ C(xi, θ).

Let us only look at xi’s such that ‖xi − xj‖ ≥ r for all j 6= i. In thatcase,

⋃C(xi, θ) covers Rd if and only if

⋃Si covers x : ‖x‖ = 1. Then

the spheres S′i of radius r/2 centered at the xi’s are disjoint and⋃S′i ⊆

S0,1+r/2 − S0,r/2 (see Figure 5.4).

Figure 5.4. Bounding γd.

Thus, if vd = volume(S0,1),

γdvd

(r2

)d≤ vd

(1 +

r

2

)d− vd

(r2

)dor

γd ≤(

1 +2r

)d− 1 =

(1 +

1sin(θ/2)

)d− 1.

The last inequality follows from the fact that

sinπ

12=

√1− cos(π/6)

2=

√2−

√3

4. 2


Figure 5.5. Covering thespace by cones.

With the preliminary results out of the way, we coverRd by γd cones X+C(xj , π/6), 1 ≤ j ≤ γd, and mark in each cone theXi that is nearest toX, ifsuch anXi exists. IfXi belongs toX+C(xj , π/6) and is not marked, thenXcannot be the nearest neighbor of Xi in X1, . . . , Xi−1, X,Xi+1, . . . , Xn.Similarly, we might mark all k nearest neighbors of X in each cone (if thereare less than k points in a cone, mark all of them). By a similar argument, ifXi ∈ X+C(xj , π/6) is not marked, then X cannot be among the k nearestneighbors of Xi in X1, . . . , Xi−1, X,Xi+1, . . . , Xn. (The order of this setof points is important if distance ties occur with positive probability, andthey are broken by comparing indices.) Therefore, if f is a nonnegativefunction,

k∑i=1

Ef(X(i)(X))

= E

n∑i=1

IXi is among the k nearest neighbors of X in X1,...,Xnf(Xi)

= E

f(X)

n∑i=1

IX is among the k nearest neighbors of Xi in X1,...,Xi−1,X,Xi+1,...,Xn

(by exchanging X and Xi)

≤ E

f(X)

n∑i=1

IXi is marked

≤ kγdEf(X),

as we can mark at most k nodes in each cone, and the number of cones isat most γd—see Lemma 5.5. This concludes the proof of Stone’s lemma. 2

5.4 The Asymptotic Probability of Error 80

5.4 The Asymptotic Probability of Error

We return to k-local rules (and in particular, to k-nearest neighbor rules).Let D′

n = ((X1, Y1, U1), . . . , (Xn, Yn, Un)) be the i.i.d. data augmented bythe uniform random variables U1, . . . , Un as described earlier. For a decisiongn based on Dn, we have the probability of error

Ln = Pgn(X) 6= Y |D′n

= Psign

(ψ(X,Y(1)(X), . . . , Y(k)(X))

)6= sign(2Y − 1)|D′

n

,

where ψ is the function whose sign determines gn; see (5.1). Define therandom variables Y ′(1)(X), . . . , Y ′(k)(X) as we did earlier, and set

L′n = P

sign(ψ(X,Y ′(1)(X), . . . , Y ′(k)(X))

)6= sign(2Y − 1)|D′

n

.

By Lemmas 5.2 and 5.4,

E|Ln − L′n|

≤ Pψ(X,Y(1)(X), . . . , Y(k)(X)) 6= ψ(X,Y ′(1)(X), . . . , Y ′(k)(X))

≤

k∑i=1

E|η(X)− η(X(i)(X))|

= o(1).

Because limn→∞(EL′n −ELn) = 0, we need only study the rule g′n

g′n(x) =

1 if ψ(x, Z1, . . . , Zk) > 00 otherwise

(Z1, . . . , Zk are i.i.d. Bernoulli (η(x))) unless we are concerned with thecloseness of Ln to ELn as well.

We now illustrate this important time-saving device on the 1-nearestneighbor rule. Clearly, ψ(x,Z1) = 2Z1 − 1, and therefore

EL′n = PZ1 6= Y = E 2η(X)(1− η(X)) .

We have, without further work:

Theorem 5.1. For the nearest neighbor rule, for any distribution of (X,Y ),

limn→∞

ELn = E 2η(X)(1− η(X)) = LNN.

Under various continuity conditions (X has a density f and both f andη are almost everywhere continuous), this result is due to Cover and Hart

5.4 The Asymptotic Probability of Error 81

(1967). In the present generality, it essentially appears in Stone (1977). Seealso Devroye (1981c). Elsewhere (Chapter 3), we show that

L∗ ≤ LNN ≤ 2L∗(1− L∗) ≤ 2L∗.

Hence, the previous result says that the nearest neighbor rule is asymptot-ically at most twice as bad as the Bayes rule—especially for small L∗, thisproperty should be useful.

We formally define the quantity, when k is odd,

LkNN

= E

k∑j=1

(k

j

)ηj(X)(1− η(X))k−j

(η(X)Ij<k/2 + (1− η(X))Ij>k/2

) .

We have the following result:

Theorem 5.2. Let k be odd and fixed. Then, for the k-nn rule,

limn→∞

ELn = LkNN.

Proof. We note that it suffices to show that limn→∞EL′n = LkNN (inthe previously introduced notation). But for every n,

EL′n

= PZ1 + · · ·+ Zk >

k

2, Y = 0

+ P

Z1 + · · ·+ Zk <

k

2, Y = 1

= P

Z1 + · · ·+ Zk >

k

2, Z0 = 0

+ P

Z1 + · · ·+ Zk <

k

2, Z0 = 1

(where Z0, . . . , Zk are i.i.d. Bernoulli (η(X)) random variables),

which leads directly to the sought result. 2

Several representations of LkNN will be useful for later analysis. For ex-ample, we have

LkNN

= Eη(X)P

Binomial(k, η(X)) <

k

2

∣∣∣∣X+E

(1− η(X))P

Binomial(k, η(X)) >

k

2

∣∣∣∣X= E min(η(X), 1− η(X))

+E(

1− 2 min(η(X), 1− η(X)))P

Binomial(k, η(X)) >k

2

∣∣∣∣X .

5.5 The Asymptotic Error Probability of Weighted Nearest Neighbor Rules 82

It should be stressed that the limit result in Theorem 5.2 is distribution-free. The limit LkNN depends upon η(X) (or min(η(X), 1 − η(X))) only.The continuity or lack of smoothness of η is immaterial—it only mattersfor the speed with which ELn approaches the limit LkNN.

5.5 The Asymptotic Error Probability ofWeighted Nearest Neighbor Rules

Following Royall (1966), a weighted nearest neighbor rule with weightsw1, . . . , wk makes a decision according to

gn(x) =

1 if

∑ki:Y(i)(x)=1 wi >

∑ki:Y(i)(x)=0 wi

0 otherwise.

In case of a voting tie, this rule is not symmetric. We may modify it so thatgn(x)

def= −1 if we have a voting tie. The “−1” should be considered as anindecision. By previous arguments the asymptotic probability of error is afunction of w1, . . . , wk given by L(w1, . . . , wk) = Eα(η(X)), where

α(p) = P

k∑i=1

wiY′i >

k∑i=1

wi(1− Y ′i )

(1− p)

+P

k∑i=1

wiY′i ≤

k∑i=1

wi(1− Y ′i )

p,

where now Y ′1 , . . . , Y′k are i.i.d. Bernoulli (p). Equivalently, with Zi = 2Y ′i −

1 ∈ −1, 1,

α(p) = (1− p)P

k∑i=1

wiZi > 0

+ pP

k∑i=1

wiZi ≤ 0

.

Assume that P∑k

i=1 wiZi = 0

= 0 for now. Then, if p < 1/2,

α(p) = p+ (1− 2p)P

k∑i=1

wiZi > 0

,

and an antisymmetric expression is valid when p > 1/2. Note next thefollowing. If we let Nl be the number of vectors z = (z1, . . . , zk) ∈ −1, 1k


with∑Izi=1 = l and

∑wizi > 0, then Nl +Nk−l =

(kl

). Thus,

P

k∑i=1

wiZi > 0

=k∑l=0

Nlpl(1− p)k−l

=∑l<k/2

(k

l

)pk−l(1− p)l +

∑l<k/2

Nl(pl(1− p)k−l − pk−l(1− p)l

)+

12

(k

k/2

)pk/2(1− p)k/2Ik even

= I + II + III.

Note that I+III does not depend on the vector of weights, and represents

PBinomial(k, 1− p) ≤ k/2 = PBinomial(k, p) ≥ k/2.

Finally, since p ≤ 1/2,

II =∑l<k/2

Nl(pl(1− p)k−l − pk−l(1− p)l

)=

∑l<k/2

Nlpl(1− p)l

((1− p)k−2l − pk−2l

)≥ 0.

This term is zero if and only if Nl = 0 for all l < k/2. In other words,it vanishes if and only if no numerical minority of wi’s can sum to a ma-jority (as in the case (0.7, 0.2, 0.1), where 0.7 alone, a numerical minority,outweighs the others). But such cases are equivalent to ordinary k-nearestneighbor rules if k is odd. When k is even, and we add a tiny weight to onewi, as in (

1 + ε

k,1− ε/(k − 1)

k, . . . ,

1− ε/(k − 1)k

),

for small ε > 0, then no numerical minority can win either, and we havean optimal rule (II = 0). We have thus shown the following:

Theorem 5.3. (Bailey and Jain (1978)). Let L(w1, . . . , wk) be theasymptotic probability of error of the weighted k-nn rule with weights w1, . . .,wk. Let the k-nn rule be defined by (1/k, 1/k, . . . , 1/k) if k is odd, and by

(1/k, 1/k, . . . , 1/k) + ε(1,−1/(k − 1),−1/(k − 1), . . . ,−1/(k − 1))


for 0 < ε < 1/k when k is even. Denoting the asymptotic probability oferror by LkNN for the latter rule, we have

L(w1, . . . , wk) ≥ LkNN.

If Pη(X) = 1/2 < 1, then equality occurs if and only if every numericalminority of the wi’s carries less than half of the total weight.

The result states that standard k-nearest neighbor rules are to be pre-ferred in an asymptotic sense. This does not mean that for a particularsample size, one should steer clear of nonuniform weights. In fact, if k isallowed to vary with n, then nonuniform weights are advantageous (Royall(1966)).

Consider the space W of all weight vectors (w1, . . . , wk) with wi ≥ 0,∑ki=1 wi = 1. Is it totally ordered with respect to L(w1, . . . , wk) or not?

To answer this question, we must return to α(p) once again. The weightvector only influences the term II given there. Consider, for example, theweight vectors

(0.3, 0.22, 0.13, 0.12, 0.071, 0.071, 0.071, 0.017)

and (0.26, 0.26, 0.13, 0.12, 0.071, 0.071, 0.071, 0.017).

Numerical minorities are made up of one, two, or three components. Forboth weight vectors, N1 = 0, N2 = 1. However, N3 = 6 + 4 in the formercase, and N3 = 6+2 in the latter. Thus, the “II” term is uniformly smallerover all p < 1/2 in the latter case, and we see that for all distributions, thesecond weight vector is better. When the Nl’s are not strictly nested, sucha universal comparison becomes impossible, as in the example of Problem5.8. Hence, W is only partially ordered.

Unwittingly, we have also shown the following theorem:

Theorem 5.4. For all distributions,

L∗ ≤ · · · ≤ L(2k+1)NN ≤ L(2k−1)NN ≤ · · · ≤ L3NN ≤ LNN ≤ 2L∗.

Proof. It suffices once again to look at α(p). Consider the weight vectorw1 = · · · = w2k+1 = 1 (ignoring normalization) as for the (2k+ 1)-nn rule.The term II is zero, as N0 = N1 = · · · = Nk = 0. However, the (2k−1)-nnrule with vector w1 = · · · = w2k−1 = 1, w2k = w2k+1 = 0, has a nonzeroterm II, because N0 = · · · = Nk−1 = 0, yet Nk =

(2k−1k

)> 0. Hence,

L(2k+1)NN ≤ L(2k−1)NN. 2

Remark. We have strict inequality L(2k+1)NN < L(2k−1)NN whenever Pη(X) /∈0, 1, 1/2 > 0. When L∗ = 0, we have LNN = L3NN = L5NN = · · · = 0 aswell. 2

5.6 k-Nearest Neighbor Rules: Even k 85

5.6 k-Nearest Neighbor Rules: Even k

Until now we assumed throughout that k was odd, so that voting ties wereavoided. The tie-breaking procedure we follow for the 2k-nearest neighborrule is as follows:

gn(x) =

1 if

∑2ki=1 Y(i)(x) > k

0 if∑2ki=1 Y(i)(x) < k

Y(1)(x) if∑2ki=1 Y(i)(x) = k.

Formally, this is equivalent to a weighted 2k-nearest neighbor rule withweight vector (3, 2, 2, 2, . . . , 2, 2). It is easy to check from Theorem 5.3 thatthis is the asymptotically best weight vector. Even values do not decreasethe probability of error. In particular, we have the following:

Theorem 5.5. (Devijver (1978)). For all distributions, and all inte-gers k,

L(2k−1)NN = L(2k)NN.

Proof. Recall that LkNN may be written in the form LkNN = Eα(η(X)),where

α(η(x)) = limn→∞

Pg(k)n (X) 6= Y |X = x

is the pointwise asymptotic error probability of the k-nn rule g(k)

n . It isconvenient to consider Z1, . . . , Z2k i.i.d. −1, 1-valued random variableswith PZi = 1 = p = η(x), and to base the decision upon the sign of∑2ki=1 Zi. From the general formula for weighted nearest neighbor rules,

the pointwise asymptotic error probability of the (2k)-nn rule is

limn→∞

Pg(2k)n (X) 6= Y |X = x

= pP

2k∑i=1

Zi < 0

+ pP

2k∑i=1

Zi = 0, Z1 < 0

+(1− p)P

2k∑i=1

Zi > 0

+ (1− p)P

2k∑i=1

Zi = 0, Z1 > 0

= pP

2k∑i=2

Zi < 0

+ (1− p)P

2k∑i=2

Zi > 0

= limn→∞

Pg(2k−1)n (X) 6= Y |X = x

.

Therefore, L(2k)NN = L(2k−1)NN. 2

5.7 Inequalities for the Probability of Error 86

5.7 Inequalities for the Probability of Error

We return to the case when k is odd. Recall that

LkNN = E αk(η(X)) ,

where

αk(p) = min(p, 1− p) + |2p− 1|P

Binomial(k,min(p, 1− p)) >k

2

.

Since L∗ = E min(η(X), 1− η(X)), we may exploit this representationto obtain a variety of inequalities on LkNN−L∗. We begin with one that isvery easy to prove but perhaps not the strongest.

Theorem 5.6. For all odd k and all distributions,

LkNN ≤ L∗ +1√ke.

Proof. By the above representation,

LkNN − L∗ ≤ sup0≤p≤1/2

(1− 2p)PB >

k

2

(B is Binomial (k, p))

= sup0≤p≤1/2

(1− 2p)PB − kp

k>

12− p

≤ sup

0≤p≤1/2

(1− 2p)e−2k(1/2−p)2

(by the Okamoto-Hoeffding inequality—Theorem 8.1)

= sup0≤u≤1

ue−ku2/2

=1√ke. 2

Theorem 5.7. (Gyorfi and Gyorfi (1978)). For all distributionsand all odd k,

LkNN ≤ L∗ +

√2LNN

k.

Proof. We note that for p ≤ 1/2, with B binomial (k, p),

PB >

k

2

= P

B − kp > k

(12− p

)≤ E |B − kp|

k(1/2− p)(Markov’s inequality)


≤√

VarBk(1/2− p)

(Cauchy-Schwarz inequality)

=2√p(1− p)√k(1− 2p)

.

Hence,

LkNN − L∗ ≤ E

2√k

√η(X)(1− η(X))

≤ 2√

k

√Eη(X)(1− η(X)) (Jensen’s inequality)

=2√k

√LNN

2

=

√2LNN

k. 2

Remark. For large k, B is approximately normal (k, p(1 − p)), and thusE|B − kp| ≈

√kp(1− p)

√2/π, as the first absolute moment of a nor-

mal random variable is√

2/π (see Problem 5.11). Working this throughyields an approximate bound of

√LNN/(πk). The bound is proportional to√

LNN. This can be improved to L∗ if, instead of bounding it from aboveby Markov’s inequality, we directly approximate P B − kp > k (1/2− p)as shown below. 2

Theorem 5.8. (Devroye (1981b)). For all distributions and k ≥ 3odd,

LkNN ≤ L∗(

1 +γ√k

(1 +O(k−1/6)

)),

where γ = supr>0 2rPN > r = 0.33994241 . . ., N is normal (0, 1), andO(·) refers to k →∞. (Explicit constants are given in the proof.)

The constant γ in the proof cannot be improved. A slightly weaker boundwas obtained by Devijver (1979):

LkNN ≤ L∗ +1

22k′

(2k′

k′

)LNN (where k′ = dk/2e)

= L∗ + LNN

√2πk′

(1 + o(1)) (as k →∞, see Lemma A.3).

See also Devijver and Kittler (1982, p.102).


Lemma 5.6. (Devroye (1981b)). For p ≤ 1/2 and with k > 3 odd,

P

Binomial(k, p) >k

2

=

k!(k−12

)!(k−12

)!

∫ p

0

(x(1− x))(k−1)/2dx

≤ A

∫ √k−1

(1−2p)√k−1

e−z2/2dz,

where A ≤ 1√2π

(1 + 2

k + 34k2

).

Proof. Consider k i.i.d. uniform random variables on [0, 1]. The numberof values in [0, p] is binomial (k, p). The number exceeds k/2 if and onlyif the (k + 1)/2-th order statistic of the uniform cloud is at most p. Thelatter is beta ((k+1)/2, (k+1)/2) distributed, explaining the first equality(Problem 5.32). Note that we have written a discrete sum as an integral—insome cases, such tricks pay off handsome rewards. To show the inequality,replace x by 1

2

(1− z√

k−1

)and use the inequality 1− u ≤ e−u to obtain a

bound as shown with

A =1

2k√k − 1

× k!(k−12

)!(k−12

)!.

Finally,

A = PB =

k + 12

k + 1

2√k − 3

(B is binomial (k, 1/2))

≤√

k

2π k+12

k−12

k + 12√k − 1

(Problem 5.17)

=1√2π

√k(k + 1)k − 1

≤ 1√2π

(1 +

2k

+3

4k2

)(Problem 5.18). 2

Proof of Theorem 5.8. From earlier remarks,

LkNN − L∗ = E αk(η(X))−min(η(X), 1− η(X))

= E(

αk(η(X))min(η(X), 1− η(X))

− 1)

min(η(X), 1− η(X))

≤

(sup

0<p<1/2

1− 2pp

PB >

k

2

)L∗ (B is binomial (k, p)).

5.8 Behavior When L∗ Is Small 89

We merely bound the factor in brackets. Clearly, by Lemma 5.6,

LkNN − L∗ ≤ L∗

(sup

0<p<1/2

1− 2pp

A

∫ √k−1

(1−2p)√k−1

e−z2/2dz

).

Take a < 1 as the solution of(3/(ea2)

)3/2 1√2π(k−1)

= γ, which is possible

if k − 1 > 12π

1γ2

(3e

)6 = 2.4886858 . . .. Setting v = (1− 2p)√k − 1, we have

sup0<p<1/2

1− 2pp

∫ √k−1

(1−2p)√k−1

e−z2/2

√2π

dz

≤ max

(sup

0<v≤a√k−1

2v/√k − 1

1− v/√k − 1

∫ ∞

v

e−z2/2

√2π

dz ,

supa√k−1≤v<

√k−1

2p√k − 1

1− v/√k − 1

e−v2/2

√2π

)

≤ max(

γ

(1− a)√k − 1

,

√k − 1√2π

e−a2(k−1)/2

)

≤ max

(γ

(1− a)√k − 1

,

(3ea2

)3/2 1(k − 1)

√2π

)(use u3/2e−cu ≤ (3/(2ce))3/2 for all u > 0)

=γ

(1− a)√k − 1

.

Collect all bounds and note that a = O(k−1/6

). 2

5.8 Behavior When L∗ Is Small

In this section, we look more closely at LkNN when L∗ is small. Recallingthat LkNN = Eαk(η(X)) with

αk(p) = min(p, 1−p)+|1−2 min(p, 1−p)|P


2

for odd k, it is easily seen that LkNN = Eξk(min(η(X), 1 − η(X))) forsome function ξk. Because

min(p, 1− p) =1−

√1− 4p(1− p)

2,

5.8 Behavior When L∗ Is Small 90

we also have LkNN = Eψk(η(X)(1− η(X))) for some other function ψk.Worked-out forms of LkNN include

LkNN = E

∑j<k/2

(k

j

)η(X)j+1(1− η(X))k−j

+∑j>k/2

(k

j

)η(X)j(1− η(X))k−j+1

=

∑j<k/2

(k

j

)E(η(X)(1− η(X)))j+1

((1− η(X))k−2j−1 + η(X)k−2j−1

).

As pa + (1− p)a is a function of p(1− p) for integer a, this may be furtherreduced to simplified forms such as

LNN = E2η(X)(1− η(X)),

L3NN = Eη(X)(1− η(X))+ 4E

(η(X)(1− η(X)))2,

L5NN = Eη(X)(1− η(X))+ E

(η(X)(1− η(X)))2

+12E

(η(X)(1− η(X)))3.

The behavior of αk near zero is very informative. As p ↓ 0, we have

α1(p) = 2p(1− p) ∼ 2p,

α3(p) = p(1− p)(1 + 4p) ∼ p+ 3p2,

α5(p) ∼ p+ 10p3,

while for the Bayes error, L∗ = Emin(η(X), 1− η(X)) = Eα∞(η(X)),where α∞ = min(p, 1 − p) ∼ p as p ↓ 0. Assume that η(x) = p at all x.Then, as p ↓ 0,

LNN ∼ 2L∗ and L3NN ∼ L∗.

Moreover, LNN −L∗ ∼ L∗, L3NN −L∗ ∼ 3L∗2. Assume that L∗ = p = 0.01.Then L1 ≈ 0.02, whereas L3NN − L∗ ≈ 0.0003. For all practical purposes,the 3-nn rule is virtually perfect. For this reason, the 3-nn rule is highlyrecommended. Little is gained by considering the 5-nn rule when p is small,as L5NN − L∗ ≈ 0.00001.

Let ak be the smallest number such that αk(p) ≤ ak min(p, 1− p) for allp (the tangents in Figure 5.6). Then

LkNN = Eαk(η(X)) ≤ akEmin(η(X), (1− η(X))= akL

∗.

5.9 Nearest Neighbor Rules When L∗ = 0 91

This is precisely at the basis of the inequalities of Theorems 5.6 through5.8, where it was shown that ak = 1 +O(1/

√k).

Figure 5.6. αk(p) as a func-tion of p.

5.9 Nearest Neighbor Rules When L∗ = 0

From Theorem 5.4 we retain that if L∗ = 0, then LkNN = 0 for all k. In fact,then, for every fixed k, the k-nearest neighbor rule is consistent. Cover hasa beautiful example to illustrate this remarkable fact. L∗ = 0 implies thatη(x) ∈ 0, 1 for all x, and thus, the classes are separated. This does notimply that the support of X given Y = 0 is different from the support of Xgiven Y = 1. Take for example a random rational number from [0, 1] (e.g.,generate I, J independently and at random from the geometric distributionon 1, 2, 3, . . ., and set X = min(I, J)/max(I, J)). Every rational numberon [0, 1] has positive probability. Given Y = 1, X is as above, and givenY = 0, X is uniform on [0, 1]. Let PY = 1 = PY = 0 = 1/2. Thesupport of X is identical in both cases. As

η(x) =

1 if x is rational0 if x is irrational,

we see that L∗ = 0 and that the nearest neighbor rule is consistent. Ifsomeone shows us a number X drawn from the same distribution as thedata, then we may decide the rationality of X merely by looking at therationality of the nearest neighbor of X. Although we did not show this,the same is true if we are given any x ∈ [0, 1]:

limn→∞

Px is rational Y(1)(x) = 0 (X(1)(x) is not rational)

= limn→∞

Px is not rational Y(1)(x) = 1 (X(1)(x) is rational)

= 0 (see Problem 5.38).

5.11 The (k, l)-Nearest Neighbor Rule 92

5.10 Admissibility of the Nearest Neighbor Rule

The consistency theorems of Chapter 11 show us that we should take k =kn →∞ in the k-nn rule. The decreasing nature of LkNN corroborates this.Yet, there exist distributions for which for all n, the 1-nn rule is betterthan the k-nn rule for any k ≥ 3. This observation, due to Cover andHart (1967), rests on the following class of examples. Let S0 and S1 be twospheres of radius 1 centered at a and b, where ‖a − b‖ > 2. Given Y = 1,X is uniform on S1, while given Y = 0, X is uniform on S0, whereasPY = 1 = PY = 0 = 1/2. We note that given n observations, withthe 1-nn rule,

ELn = PY = 0, Y1 = · · · = Yn = 1+PY = 1, Y1 = · · · = Yn = 0 =12n.

For the k-nn rule, k being odd, we have

ELn = P

Y = 0,

n∑i=1

IYi=0 ≤ bk/2c

+P

Y = 1,

n∑i=1

IYi=1 ≤ bk/2c

= P Binomial(n, 1/2) ≤ bk/2c

=12n

bk/2c∑j=0

(n

j

)>

12n

when k ≥ 3.

Hence, the k-nn rule is worse than the 1-nn rule for every n when the distri-bution is given above. We refer to the exercises regarding some interestingadmissibility questions for k-nn rules.

5.11 The (k, l)-Nearest Neighbor Rule

In 1970, Hellman (1970) proposed the (k, l)-nearest neighbor rule, whichis identical to the k-nearest neighbor rule, but refuses to make a decisionunless at least l > k/2 observations are from the same class. Formally, weset

gn(x) =

1 if∑ki=1 Y(i)(x) ≥ l

0 if∑ki=1 Y(i)(x) ≤ k − l

−1 otherwise (no decision).

Define the pseudoprobability of error by

Ln = Pgn(X) /∈ −1, Y |Dn,

5.11 The (k, l)-Nearest Neighbor Rule 93

that is, the probability that we reach a decision and correctly classify X.Clearly, Ln ≤ Pgn(X) 6= Y |Dn, our standard probability of error. Thelatter inequality is only superficially interesting, as the probability of notreaching a decision is not taken into account in Ln. We may extend Theo-rem 5.2 to show the following (Problem 5.35):

Theorem 5.9. For the (k, l)-nearest neighbor rule, the pseudoprobabilityof error Ln satisfies

limn→∞

ELn = Eη(X)PBinomial(k, η(X)) ≤ k − l|X

+(1− η(X))PBinomial(k, η(X)) ≥ l|Xdef= Lk,l.

The above result is distribution-free. Note that the k-nearest neighborrule for odd k corresponds to Lk,(k+1)/2. The limit Lk,l by itself is notinteresting, but it was shown by Devijver (1979) that Lk,l holds informationregarding the Bayes error L∗.

Theorem 5.10. (Devijver (1979)). For all distributions and with kodd,

Lk,k ≤ Lk,k−1 ≤ · · · ≤ Lk,dk/2e+1 ≤ L∗ ≤ Lk,dk/2e = LkNN.

Also,LkNN + Lk,dk/2e+1

2≤ L∗ ≤ LkNN.

This theorem (for which we refer to Problem 5.34) shows that L∗ istightly sandwiched between LkNN, the asymptotic probability of error ofthe k-nearest neighbor rule, and the “tennis” rule which requires that thedifference of votes between the two classes among the k nearest neighborsbe at least two. If Ln is close to its limit, and if we can estimate Ln (seethe chapters on error estimation), then we may be able to use Devijver’sinequalities to obtain estimates of the Bayes error L∗. For additional results,see Loizou and Maybank (1987).

As a corollary of Devijver’s inequalities, we note that

LkNN − L∗ ≤LkNN − Lk,dk/2e+1

2.

We have

Lk,l = L∗ + E |1− 2 min(η(X), 1− η(X))|×PBinomial(k,min(η(X), 1− η(X))) ≥ l|X ,


and therefore

Lk,l − Lk,l+1

= E |1− 2 min(η(X), 1− η(X))|×PBinomial(k,min(η(X), 1− η(X))) = l|X

= E|1− 2 min(η(X), 1− η(X))|

×(k

l

)min(η(X), 1− η(X))l(1−min(η(X), 1− η(X)))k−l

≤

(k

l

)ll(k − l)k−l

kk

(because ul(1− u)k−l reaches its maximum on [0, 1] at u = l/k)

≤

√k

12πl(k − l)

(use(kl

)≤ kk

ll(k−l)k−l1√2π

√k

l(k−l) , by Stirling’s formula).

With l = dk/2e, we thus obtain

LkNN − L∗ ≤ 12√

2π

√k

dk/2ebk/2c

=

√k

2π(k2 − 1)≈ 0.398942√

k,

improving on Theorem 5.6. Various other inequalities may be derived inthis manner as well.


Problem 5.1. Let ‖ · ‖ be an arbitrary norm on Rd, and define the k-nearestneighbor rule in terms of the distance ρ(x, z) = ‖x − z‖. Show that Theorems5.1 and 5.2 remain valid. Hint: Only Stone’s lemma needs adjusting. The roleof cones C(x, π/6) used in the proof are now played by sets with the followingproperty: x and z belong to the same set if and only if∥∥∥∥ x

‖x‖ −z

‖z‖

∥∥∥∥ < 1.

Problem 5.2. Does there exist a distribution for which supn≥1 ELn > 1/2 forthe nearest neighbor rule?

Problem 5.3. Show that L3NN ≤ 1.32L∗ and that L5NN ≤ 1.22L∗.


Problem 5.4. Show that if C∗ is a compact subset of Rd and C is the supportset for the probability measure µ,

supx∈C∩C∗

‖X(1) − x‖ → 0

with probability one, where X(1) is the nearest neighbor of x among X1, . . . , Xn(Wagner (1971)).

Problem 5.5. Let µ be the probability measure of X given Y = 0, and let νbe the probability measure of X given Y = 1. Assume that X is real-valued andthat PY = 0 = PY = 1 = 1/2. Find a pair (ν, µ) such that

(1) support(µ) = support(ν);(2) L∗ = 0.

Conclude that L∗ = 0 does not tell us a lot about the support sets of µ and ν.

Problem 5.6. Consider the (2k+1)-nearest neighbor rule for distributions (X,Y )with η(x) ≡ p constant, and Y independent of X. This exercise explores the be-havior of L(2k+1)NN as p ↓ 0.

(1) For fixed integer l > 0, as p ↓ 0, show that

PBinomial(2k, p) ≥ l ∼(

2k

l

)pl.

(2) Use a convenient representation of L(2k+1)NN to conclude that as p ↓ 0,

L(2k+1)NN = p+

((2k

k

)+

(2k

k + 1

))pk+1 + o(pk+1).

Problem 5.7. Das Gupta and Lin (1980) proposed the following rule for datawith X ∈ R. Assume X is nonatomic. First, reorder X1, . . . , Xn, X accordingto increasing values, and denote the ordered set by X(1) < X(2) < · · · < X(i) <X < X(i+1) < · · · < X(n). The Yi’s are permuted so that Y(j) is the label ofX(j). Take votes among Y(i), Y(i+1), Y(i−1), Y(i+2), . . . until for the first timethere is agreement (Y(i−j) = Y(i+j+1)), at which time we decide that class, thatis, gn(X) = Y(i−j) = Y(i+j+1). This rule is invariant under monotone transforma-tions of the x-axis.

(1) If L denotes the asymptotic expected probability of error, show that forall nonatomic X,

L = E

η(X)(1− η(X)

1− 2η(X)(1− η(X)

.

(2) Show that L is the same as for the rule in which X ∈ Rd and we consider2-nn, 4-nn, 6-nn, etc. rules in turn, stopping at the first 2k-nn rule forwhich there is no voting tie. Assume for simplicity that X has a density(with a good distance-tie breaking rule, this may be dropped).

(3) Show that L − LNN ≥ (1/2)(L3NN − LNN), and thus that L ≥ (LNN +L3NN)/2.

(4) Show that L ≤ LNN. Hence, the rule performs somewhere in between the1-nn and 3-nn rules.


Problem 5.8. Let Y be independent of X, and η(x) ≡ p constant. Consider aweighted (2k + 1)-nearest neighbor rule with weights (2m+ 1, 1, 1, . . . , 1) (thereare 2k “ones”), where k− 1 ≥ m ≥ 0. For m = 0, we obtain the (2k+ 1)-nn rule.Let L(k,m) be the asymptotic probability of error.

(1) Using results from Problem 5.6, show that

L(k,m) = p+

(2k

k −m

)pk−m+1 +

(2k

k +m+ 1

)pk+m+1 + o(pk−m+1)

as p ↓ 0. Conclude that within this class of rules, for small p, the goodnessof a rule is measured by k −m.

(2) Let δ > 0 be small, and set p = 1/2 − δ. Show that if X is binomial(2k, 1/2− δ) and Z is binomial (2k, 1/2), then for fixed l, as δ ↓ 0,

PX ≥ l = PZ ≥ l − 2kδPZ = l+ o(δ2),

andPX ≤ l = PZ ≤ l+ 2kδPZ = l + 1+ o(δ2).

(3) Conclude that for fixed k,m as δ ↓ 0,

L(k,m) =1

2− 2δ2(kPZ = k +m+ kPZ = k +m+ 1

+PZ ≤ k +m −PZ ≥ k +m+ 1) + o(δ2).

(4) Take weight vector w with k fixed and m = b10√kc, and compare it with

weight vector w′ with k/2 components and m = b√k/2c as p ↓ 0 and

p ↑ 1/2. Assume that k is very large but fixed. In particular, show thatw is better as p ↓ 0, and w′ is better as p ↑ 1/2. For the last example,note that for fixed c > 0,

kPZ = k +m+ kPZ = k +m+ 1 ∼ 8√k

1√2πe−2c2 as k →∞

by the central limit theorem.(5) Conclude that there exist different weight vectors w,w′ for which there

exists a pair of distributions of (X,Y ) such that their asymptotic errorprobabilities are differently ordered. Thus, W is not totally ordered withrespect to the probability of error.

Problem 5.9. Patrick and Fisher (1970) find the k-th nearest neighbor in eachof the two classes and classify according to which is nearest. Show that their ruleis equivalent to a (2k − 1)-nearest neighbor rule.

Problem 5.10. Rabiner et al. (1979) generalize the rule of Problem 5.9 so as toclassify according to the average distance to the k-th nearest neighbor within eachclass. Assume that X has a density. For fixed k, find the asymptotic probabilityof error.

Problem 5.11. If N is normal (0, 1), then E|N | =√

2/π. Prove this.

Problem 5.12. Show that if L9NN = L(11)NN, then L(99)NN = L(111)NN.


Problem 5.13. Show that

L3NN ≤(

7√

7 + 17

27√

3+ 1

)L∗ ≈ 1.3155 . . . L∗

for all distributions (Devroye (1981b)). Hint: Find the smallest constant a suchthat L3NN ≤ L∗(1 + a) using the representation of L3NN in terms of the binomialtail.

Problem 5.14. Show that if X has a density f , then for all u > 0,

limn→∞

Pn1/d‖X(1)(X)−X‖ > u|X

= e−f(X)vud

with probability one, where v =∫S0,1

dx is the volume of the unit ball in Rd

(Gyorfi (1978)).

Problem 5.15. Consider a rule that takes a majority vote over all Yi’s for which‖Xi−x‖ ≤ (c/vn)1/d, where v =

∫S0,1

dx is the volume of the unit ball, and c > 0

is fixed. In case of a tie decide gn(x) = 0.(1) If X has a density f , show that lim infn→∞ ELn ≥ E

η(X)e−cf(X)

.

Hint: Use the obvious inequality ELn ≥ PY = 1, µn(SX,c/vn) = 0.(2) If Y is independent of X and η ≡ p > 1/2, then

Eη(X)e−cf(X)

L∗

= Ee−cf(X)

p

1− p↑ ∞

as p ↑ 1. Show this.(3) Conclude that

sup(X,Y ):L∗>0

lim infn→∞ ELnL∗

= ∞,

and thus that distribution-free bounds of the form limn→∞ ELn ≤c′L∗ obtained for k-nearest neighbor estimates do not exist for thesesimple rules (Devroye (1981a)).

Problem 5.16. Take an example with η(X) ≡ 1/2 − 1/(2√k), and show that

the bound LkNN − L∗ ≤ 1/√ke cannot be essentially bettered for large values of

k, that is, there exists a sequence of distributions (indexed by k) for which

LkNN − L∗ ≥ 1− o(1)√k

PN ≥ 1

as k →∞, where N is a normal (0, 1) random variable.

Problem 5.17. If B is binomial (n, p), then

supp

PB = i ≤√

n

2πi(n− i), 0 < i < n.


Problem 5.18. Show that for k ≥ 3,√k(k + 1)

k − 1≤(1 +

1

2k

)(1 +

3

2k

)= 1 +

2

k+

3

4k2.

Problem 5.19. Show that there exists a sequence of distributions of (X,Y ) (in-dexed by k) in which Y is independent of X and η(x) ≡ p (with p depending onk only) such that

lim infn→∞

(LkNN − L∗

L∗

)√k ≥ γ = 0.339942 . . . ,

where γ is the constant of Theorem 5.8 (Devroye (1981b)). Hint: Verify the proofof Theorem 5.8 but bound things from below. Slud’s inequality (see Lemma A.6in the Appendix) may be of use here.

Problem 5.20. Consider a weighted nearest neighbor rule with weights 1, ρ, ρ2, ρ3, . . .for ρ < 1. Show that the expected probability of error tends for all distributionsto a limit L(ρ). Hint: Truncate at k fixed but large, and argue that the tail hasasymptotically negligible weight.

Problem 5.21. Continued. With L(ρ) as in the previous exercise, show thatL(ρ) = LNN whenever ρ < 1/2.

Problem 5.22. Continued. Prove or disprove: as ρ increases from 1/2 to 1,L(ρ) decreases monotonically from LNN to L∗. (This question is difficult.)

Problem 5.23. Show that in the weighted nn rule with weights (1, ρ, ρ2), 0 <ρ < 1, the asymptotic probability of error is LNN if ρ < (

√5− 1)/2 and is L3NN

if ρ > (√

5− 1)/2.

Problem 5.24. Is there any k, other than one, for which the k-nn rule is ad-missible, that is, for which there exists a distribution of (X,Y ) such that ELnfor the k-nn rule is smaller than ELn for any k′-nn rule with k′ 6= k, for alln? Hint: This is difficult. Note that if this is to hold for all n, then it must holdfor the limits. From this, deduce that with probability one, η(x) ∈ 0, 1/2, 1 forany such distribution.

Problem 5.25. For every fixed n and odd k with n > 1000k, find a distributionof (X,Y ) such that ELn for the k-nn rule is smaller than ELn for any k′-nnrule with k′ 6= k, k′ odd. Thus, for a given n, no k can be a priori discarded fromconsideration.

Problem 5.26. Let X be uniform on [0, 1], η(x) ≡ x, and PY = 0 = PY =1 = 1/2. Show that for the nearest neighbor rule,

ELn =1

3+

3n+ 5

2(n+ 1)(n+ 2)(n+ 3)

(Cover and Hart (1967); Peterson (1970)).


Problem 5.27. For the nearest neighbor rule, if X has a density, then

|ELn −ELn+1| ≤1

n+ 1

(Cover (1968a)).

Problem 5.28. Let X have a density f ≥ c > 0 on [0, 1], and assume that f ′′′0

and f ′′′1 exist and are uniformly bounded. Show that for the nearest neighborrule, ELn = LNN +O(1/n2) (Cover (1968a)). For d-dimensional problems thisresult was generalized by Psaltis, Snapp, and Venkatesh (1994).

Problem 5.29. Show that LkNN ≤ (1 +√

2/k)L∗ is the best possible bound

of the form LkNN ≤ (1 + a/√k)L∗ valid simultaneously for all k ≥ 1 (Devroye

(1981b)).

Problem 5.30. Show that LkNN ≤ (1+√

1/k)L∗ for all k ≥ 3 (Devroye (1981b)).

Problem 5.31. Let x = (x(1), x(2)) ∈ R2. Consider the nearest neighbor rulebased upon vectors with components

(x3(1), x7(2), x(1)x(2)

). Show that this is

asymptotically not better than if we had used (x(1), x(2)). Show by example that(x2(1), x3(2), x6(1)x(2)

)may yield a worse asymptotic error probability than

(x(1), x(2)).

Problem 5.32. Uniform order statistics. Let U(1) < · · · < U(n) be orderstatistics of n i.i.d. uniform [0, 1] random variables. Show the following:

(1) U(k) is beta (k, n+ 1− k), that is, U(k) has density

f(x) =n!

(k − 1)!(n− k)!xk−1(1− x)n−k, 0 ≤ x ≤ 1.

(2)

EUa(k)

=

Γ(k + a)Γ(n+ 1)

Γ(k)Γ(n+ 1 + a), for any a > 0.

(3)

1− a

n≤

EUa(k)

(k/n)a

≤ 1 +ψ(a)

k

for a ≥ 1, where ψ(a) is a function of a only (Royall (1966)).

Problem 5.33. Dudani’s rule. Dudani (1976) proposes a weighted k-nn rulewhere Y(i)(x) receives weight

‖X(k)(x)− x‖ − ‖X(i)(x)− x‖, 1 ≤ i ≤ k.

Why is this roughly speaking equivalent to attaching weight 1 − (i/k)1/d to thei-th nearest neighbor if X has a density? Hint: If µ is the probability measureof X, then

µ(Sx,‖X(1)(x)−x‖

), . . . , µ

(Sx,‖X(k)(x)−x‖

)are distributed like U(1), . . . , U(k) where U(1) < · · · < U(n) are the order statisticsof n i.i.d. uniform [0, 1] random variables. Replace µ by a good local approxima-tion, and use results from the previous exercise.


Problem 5.34. Show Devijver’s theorem (Theorem 5.10) in two parts: first es-tablish the inequality L∗ ≥ Lk,dk/2e−1 for the tennis rule, and then establish themonotonicity.

Problem 5.35. Show Theorem 5.9 for the (k, l) nearest neighbor rule.

Problem 5.36. Let R be the asymptotic error probability of the (2, 2)-nearestneighbor rule. Prove that R = E2η(X)(1− η(X)) = LNN.

Problem 5.37. For the nearest neighbor rule, show that for all distributions,

limn→∞

Pgn(X) = 0, Y = 1 = limn→∞

Pgn(X) = 1, Y = 0

= Eη(X)(1− η(X))

(Devijver and Kittler (1982)). Thus, errors of both kinds are equally likely.

Problem 5.38. Let PY = 1 = PY = 0 = 1/2 and let X be a randomrational if Y = 1 (as defined in Section 5.9) such that every rational number haspositive probability, and let X be uniform [0, 1] if Y = 0. Show that for everyx ∈ [0, 1] not rational, PY(1)(x) = 1 → 0 as n → ∞, while for every x ∈ [0, 1]rational, PY(1)(x) = 0 → 0 as n→∞.

Problem 5.39. Let X1, . . . , Xn be i.i.d. and have a common density. Show thatfor fixed k > 0,

nP X3 is among the k nearest neighbors of X1 and X2 in X3, . . . , Xn → 0.

Show that the same result remains valid whenever k varies with n such thatk/√n→ 0.

Problem 5.40. Imperfect training. Let (X,Y, Z), (X1, Y1, Z1), . . . , (Xn, Yn, Zn)be a sequence of i.i.d. triples in Rd×0, 1×0, 1 with PY = 1|X = x = η(x)and PZ = 1|X = x = η′(x). Let Z(1)(X) be Zj if Xj is the nearest neighborof X among X1, . . . , Xn. Show that

limn→∞

PZ(1)(X) 6= Y = Eη(X) + η′(X)− 2η(X)η′(X)

(Lugosi (1992)).

Problem 5.41. Improve the bound in Lemma 5.3 to γd ≤ 3d − 1.

Problem 5.42. Show that if C(x1, π/6), . . . , C(xγd , π/6) is a collection of conescovering Rd, then γd ≥ 2d.

Problem 5.43. Recalling that LkNN = E αk(η(X)), where

αk(p) = min(p, 1− p) + |1− 2min(p, 1− p)|P


2

,

show that for every fixed p, PBinomial(k,min(p, 1− p)) > k

2

↓ 0 as k → ∞

(it is the monotonicity that is harder to show). How would you then prove thatlimk→∞ LkNN = L∗?


Problem 5.44. Show that the asymptotic error probability of the rule that de-cides gn(x) = Y(8)(x) is identical to that of the rule in which gn(x) = Y(3)(x).

Problem 5.45. Show that for all distributions L5NN = Eψ5(η(X)(1− η(X)),where ψ5(u) = u+ u2 + 12u3.

Problem 5.46. Show that for all distributions,

L5NN ≥LNN

2+L2

NN

4+

3L3NN

2

and that

L3NN ≥LNN

2+ L2

NN.

Problem 5.47. Let X(1) be the nearest neighbor of x among X1, . . . , Xn. Con-struct an example for which E‖X(1) − x‖ = ∞ for all x ∈ Rd. (Therefore, wehave to steer clear of convergence in the mean in Lemma 5.1.) Let X(1) be thenearest neighbor of X1 among X2, . . . , Xn. Construct a distribution such thatE‖X(1) −X1‖ = ∞ for all n.

Problem 5.48. Consider the weighted nearest neighbor rule with weights (w1, . . .,

wk). Define a new weight vector (w1, w2, . . . , wk−1, v1, . . . , vl), where∑l

i=1vi =

wk. Thus, the weight vectors are partially ordered by the operation “partition.”Assume that all weights are nonnegative. Let the asymptotic expected proba-bility of errors be L and L′, respectively. True or false: for all distributions of(X,Y ), L′ ≤ L.

Problem 5.49. Gabriel neighbors. Given X1, . . . , Xn ∈ Rd, we say that Xiand Xj are Gabriel neighbors if the ball centered at (Xi+Xj)/2 of radius ‖Xi−Xj‖/2 contains no Xk, k 6= i, j (Gabriel and Sokal (1969); Matula and Sokal(1980)). Clearly, if Xj is the nearest neighbor of Xi, then Xi and Xj are Gabrielneighbors. Show that if X has a density and X1, . . . , Xn are i.i.d. and drawnfrom the distribution of X, then the expected number of Gabriel neighbors of X1

tends to 2d as n→∞ (Devroye (1988c)).

Figure 5.7. The Gabriel graph

of 20 points on the plane is

shown: Gabriel neighbors are

connected by an edge. Note that

all circles straddling these edges

have no data point in their inte-

rior.

Problem 5.50. Gabriel neighbor rule. Define the Gabriel neighbor rule sim-ply as the rule that takes a majority vote over all Yi’s for the Gabriel neighborsof X among X1, . . . , Xn. Ties are broken by flipping a coin. Let Ln be the con-ditional probability of error for the Gabriel rule. Using the result of the previousexercise, show that if L∗ is the Bayes error, then


(1) limn→∞

ELn = 0 if L∗ = 0;

(2) lim supn→∞

ELn < LNN if L∗ > 0, d > 1;

(3) lim supn→∞

ELn ≤ cL∗ for some c < 2, if d > 1.

For (3), determine the best possible value of c. Hint: Use Theorem 5.8 and tryobtaining, for d = 2, a lower bound for PNX ≥ 3, where NX is the number ofGabriel neighbors of X among X1, . . . , Xn.


6Consistency

6.1 Universal Consistency

If we are given a sequence Dn = ((X1, Y1), . . . , (Xn, Yn)) of training data,the best we can expect from a classification function is to achieve the Bayeserror probability L∗. Generally, we cannot hope to obtain a function thatexactly achieves the Bayes error probability, but it is possible to construct asequence of classification functions gn, that is, a classification rule, suchthat the error probability

Ln = L(gn) = Pgn(X,Dn) 6= Y |Dn

gets arbitrarily close to L∗ with large probability (that is, for “most” Dn).This idea is formulated in the definitions of consistency :

Definition 6.1. (Weak and strong consistency). A classificationrule is consistent (or asymptotically Bayes-risk efficient) for a certain dis-tribution of (X,Y ) if

ELn = Pgn(X,Dn) 6= Y → L∗ as n→∞,

and strongly consistent if

limn→∞

Ln = L∗ with probability 1.

Remark. Consistency is defined as the convergence of the expected valueof Ln to L∗. Since Ln is a random variable bounded between L∗ and 1, this

6.2 Classification and Regression Estimation 104

convergence is equivalent to the convergence of Ln to L∗ in probability,which means that for every ε > 0

limn→∞

PLn − L∗ > ε = 0.

Obviously, since almost sure convergence always implies convergence inprobability, strong consistency implies consistency. 2

A consistent rule guarantees that by increasing the amount of data theprobability that the error probability is within a very small distance of theoptimal achievable gets arbitrarily close to one. Intuitively, the rule caneventually learn the optimal decision from a large amount of training datawith high probability. Strong consistency means that by using more datathe error probability gets arbitrarily close to the optimum for every trainingsequence except for a set of sequences that has zero probability altogether.

A decision rule can be consistent for a certain class of distributions of(X,Y ), but may not be consistent for others. It is clearly desirable to havea rule that is consistent for a large class of distributions. Since in manysituations we do not have any prior information about the distribution, itis essential to have a rule that gives good performance for all distributions.This very strong requirement of universal goodness is formulated as follows:

Definition 6.2. (Universal consistency). A sequence of decisionrules is called universally (strongly) consistent if it is (strongly) consistentfor any distribution of the pair (X,Y ).

In this chapter we show that such universally consistent classification rulesexist. At first, this may seem very surprising, for some distributions arevery “strange,” and seem hard to learn. For example, let X be uniformlydistributed on [0, 1] with probability 1/2, and let X be atomic on the ra-tionals with probability 1/2. For example, if the rationals are enumeratedr1, r2, r3, . . ., then PX = ri = 1/2i+1. Let Y = 1 if X is rational andY = 0 if X is irrational. Obviously, L∗ = 0. If a classification rule gn isconsistent, then the probability of incorrectly guessing the rationality of Xtends to zero. Note here that we cannot “check” whether X is rational ornot, but we should base our decision solely on the data Dn given to us. Oneconsistent rule is the following: gn(x,Dn) = Yk if Xk is the closest point tox among X1, . . . , Xn. The fact that the rationals are dense in [0, 1] makesthe statement even more surprising. See Problem 6.3.

6.2 Classification and Regression Estimation

In this section we show how consistency of classification rules can be de-duced from consistent regression estimation. In many cases the a posteriori

6.2 Classification and Regression Estimation 105

probability η(x) is estimated from the training data Dn by some functionηn(x) = ηn(x,Dn). In this case, the error probability L(gn) = Pgn(X) 6=Y |Dn of the plug-in rule

gn(x) =

0 if ηn(x) ≤ 1/21 otherwise

is a random variable. Then a simple corollary of Theorem 2.2 is as follows:

Corollary 6.1. The error probability of the classifier gn(x) defined abovesatisfies the inequality

L(gn)− L∗ ≤ 2∫Rd

|η(x)− ηn(x)|µ(dx) = 2E |η(X)− ηn(X)||Dn .

The next corollary follows from the Cauchy-Schwarz inequality.

Corollary 6.2. If

gn(x) =

0 if ηn(x) ≤ 1/21 otherwise,

then its error probability satisfies

Pgn(X) 6= Y |Dn − L∗ ≤ 2

√∫Rd

|η(x)− ηn(x)|2µ(dx).

Clearly, η(x) = PY = 1|X = x = EY |X = x is just the regressionfunction of Y on X. Therefore, the most interesting consequence of Theo-rem 2.2 is that the mere existence of a regression function estimate ηn(x)for which ∫

|η(x)− ηn(x)|2µ(dx) → 0

in probability or with probability one implies that the plug-in decision rulegn is consistent or strongly consistent, respectively.

Clearly, from Theorem 2.3, one can arrive at a conclusion analogousto Corollary 6.1 when the probabilities η0(x) = PY = 0|X = x andη1(x) = PY = 1|X = x are estimated from data separately by someη0,n and η1,n, respectively. Usually, a key part of proving consistency ofclassification rules is writing the rules in one of the plug-in forms, andshowing L1-convergence of the approximating functions to the a posterioriprobabilities. Here we have some freedom, as for any positive function τn(x),we have, for example,

gn(x) =

0 if η1,n(x) ≤ η0,n(x)1 otherwise, =

0 if η1,n(x)

τn(x) ≤ η0,n(x)τn(x)

1 otherwise.

6.3 Partitioning Rules 106

6.3 Partitioning Rules

Many important classification rules partitionRd into disjoint cellsA1, A2, . . .and classify in each cell according to the majority vote among the labels ofthe Xi’s falling in the same cell. More precisely,

gn(x) =

0 if∑ni=1 IYi=1IXi∈A(x) ≤

∑ni=1 IYi=0IXi∈A(x)

1 otherwise,

where A(x) denotes the cell containing x. The decision is zero if the numberof ones does not exceed the number of zeros in the cell where X falls, andvice versa. The partitions we consider in this section may change with n,and they may also depend on the points X1, . . . , Xn, but we assume thatthe labels do not play a role in constructing the partition. The next theoremis a general consistency result for such partitioning rules. It requires twoproperties of the partition: first, cells should be small enough so that localchanges of the distribution can be detected. On the other hand, cells shouldbe large enough to contain a large number of points so that averaging amongthe labels is effective. diam(A) denotes the diameter of a set A, that is,

diam(A) = supx,y∈A

‖x− y‖.

Let

N(x) = nµn(A(x)) =n∑i=1

IXi∈A(x)

denote the number of Xi’s falling in the same cell as x. The conditions ofthe theorem below require that a random cell—selected according to thedistribution of X—has a small diameter, and contains many points withlarge probability.

Theorem 6.1. Consider a partitioning classification rule as defined above.Then ELn → L∗ if

(1) diam(A(X)) → 0 in probability,(2) N(X) →∞ in probability.

Proof. Define η(x) = PY = 1|X = x. From Corollary 6.1 we recall thatwe need only show E|ηn(X)− η(X)| → 0, where

ηn(x) =1

N(x)

∑i:Xi∈A(x)

Yi.

Introduce η(x) = Eη(X)|X ∈ A(x). By the triangle inequality,

E|ηn(X)− η(X)| ≤ E|ηn(X)− η(X)|+ E|η(X)− η(X)|.

6.4 The Histogram Rule 107

By conditioning on the random variables IX1∈A(x), . . . , IXn∈A(x), it iseasy to see that N(x)ηn(x) is distributed as B(N(x), η(x)), a binomialrandom variable with parameters N(x) and η(x). Thus,

E|ηn(X)− η(X)||X, IX1∈A(X), . . . , IXn∈A(X)

≤ E∣∣∣∣B(N(X), η(X))

N(X)− η(X)

∣∣∣∣ IN(X)>0

∣∣∣∣X, IX1∈A(X), . . . , IXn∈A(X)

+ IN(X)=0

≤ E

√η(X)(1− η(X))

N(X)IN(X)>0

∣∣∣∣∣X, IX1∈A(X), . . . , IXn∈A(X)

+ IN(X)=0

by the Cauchy-Schwarz inequality. Taking expectations, we see that

E|ηn(X)− η(X)| ≤ E

1

2√N(X)

IN(X)>0

+ PN(X) = 0

≤ 12PN(X) ≤ k+

12√k

+ PN(X) = 0

for any k, and this can be made small, first by choosing k large enough andthen by using condition (2).

For ε > 0, find a uniformly continuous [0, 1]-valued function ηε on abounded set C and vanishing off C so that E |ηε(X)− η(X)| < ε. Next,we employ the triangle inequality:

E |η(X)− η(X)| ≤ E |η(X)− ηε(X)|+E |ηε(X)− ηε(X)|+E |ηε(X)− η(X)|

= I + II + III,

where ηε(x) = Eηε(X)|X ∈ A(x). Clearly, III < ε by choice of ηε. Sinceηε is uniformly continuous, we can find a θ = θ(ε) > 0 such that

II ≤ ε+ Pdiam(A(X)) > θ.

Therefore, II < 2ε for n large enough, by condition (1). Finally, I ≤ III <ε. Taken together these steps prove the theorem. 2

6.4 The Histogram Rule

In this section we describe the cubic histogram rule and show its universalconsistency by checking the conditions of Theorem 6.1. The rule partitions

6.4 The Histogram Rule 108

Rd into cubes of the same size, and makes a decision according to themajority vote among the Yi’s such that the corresponding Xi falls in thesame cube as X. Formally, let Pn = An1, An2, . . . be a partition of Rd

into cubes of size hn > 0, that is, into sets of the type∏di=1[kihn, (ki+1)hn),

where the ki’s are integers. For every x ∈ Rd let An(x) = Ani if x ∈ Ani.The histogram rule is defined by

gn(x) =

0 if∑ni=1 IYi=1IXi∈An(x) ≤

∑ni=1 IYi=0IXi∈An(x)

1 otherwise.

Figure 6.1. A cubic histogramrule: The decision is 1 in theshaded area.

Consistency of the histogram rule was established under some additionalconditions by Glick (1973). Universal consistency follows from the resultsof Gordon and Olshen (1978), (1980). A direct proof of strong universalconsistency is given in Chapter 9.

The next theorem establishes universal consistency of certain cubic his-togram rules.

Theorem 6.2. If hn → 0 and nhdn → ∞ as n → ∞, then the cubichistogram rule is universally consistent.

Proof. We check the two simple conditions of Theorem 6.1. Clearly, thediameter of each cell is

√dhd. Therefore condition (1) follows trivially. To

show condition (2), we need to prove that for any M < ∞, PN(X) ≤M → 0. Let S be an arbitrary ball centered at the origin. Then thenumber of cells intersecting S is not more than c1 +c2/h

d for some positiveconstants c1, c2. Then

PN(X) ≤M

≤∑

j:Anj∩S 6=∅

PX ∈ Anj , N(X) ≤M+ PX ∈ Sc

6.5 Stone’s Theorem 109

≤∑

j:Anj∩S 6=∅

µ(Anj)≤2M/n

µ(Anj) +∑

j:Anj∩S 6=∅

µ(Anj)>2M/n

µ(Anj)Pnµn(Anj) ≤M+ µ(Sc)

≤ 2Mn

(c1 +

c2hd

)+

∑j:Anj∩S 6=∅

µ(Anj)>2M/n

µ(Anj)Pµn(Anj)−Eµn(Anj) ≤M/n− µ(Anj)+ µ(Sc)

≤ 2Mn

(c1 +

c2hd

)+

∑j:Anj∩S 6=∅

µ(Anj)>2M/n

µ(Anj)Pµn(Anj)−Eµn(Anj) ≤

−µ(Anj)2

+ µ(Sc)

≤ 2Mn

(c1 +

c2hd

)+

∑j:Anj∩S 6=∅

µ(Anj)>2M/n

4µ(Anj)Varµn(Anj)

(µ(Anj))2 + µ(Sc)

(by Chebyshev’s inequality)

≤ 2Mn

(c1 +

c2hd

)+

∑j:Anj∩S 6=∅

µ(Anj)>2M/n

4µ(Anj)1

nµ(Anj)+ µ(Sc)

≤ 2M + 4n

(c1 +

c2hd

)+ µ(Sc)

→ µ(Sc),

because nhd → ∞. Since S is arbitrary, the proof of the theorem is com-plete. 2

6.5 Stone’s Theorem

A general theorem by Stone (1977) allows us to deduce universal consis-tency of several classification rules. Consider a rule based on an estimateof the a posteriori probability η of the form

ηn(x) =n∑i=1

IYi=1Wni(x) =n∑i=1

YiWni(x),

where the weights Wni(x) = Wni(x,X1, . . . , Xn) are nonnegative and sumto one:

n∑i=1

Wni(x) = 1.


The classification rule is defined as

gn(x) =

0 if∑ni=1 IYi=1Wni(x) ≤

∑ni=1 IYi=0Wni(x)

1 otherwise,

=

0 if∑ni=1 YiWni(x) ≤ 1/2

1 otherwise.

ηn is a weighted average estimator of η. It is intuitively clear that pairs(Xi, Yi) such that Xi is close to x should provide more information aboutη(x) than those far from x. Thus, the weights are typically much larger inthe neighborhood of X, so ηn is roughly a (weighted) relative frequency ofthe Xi’s that have label 1 among points in the neighborhood of X. Thus,ηn might be viewed as a local average estimator, and gn a local (weighted)majority vote. Examples of such rules include the histogram, kernel, andnearest neighbor rules. These rules will be studied in depth later.

Theorem 6.3. (Stone (1977)). Assume that for any distribution ofX, the weights satisfy the following three conditions:

(i) There is a constant c such that, for every nonnegative measurablefunction f satisfying Ef(X) <∞,

E

n∑i=1

Wni(X)f(Xi)

≤ cEf(X).

(ii) For all a > 0,

limn→∞

E

n∑i=1

Wni(X)I‖Xi−X‖>a

= 0.

(iii)

limn→∞

E

max1≤i≤n

Wni(X)

= 0.

Then gn is universally consistent.

Remark. Condition (ii) requires that the overall weight of Xi’s outside ofany ball of a fixed radius centered at X must go to zero. In other words,only points in a shrinking neighborhood of X should be taken into accountin the averaging. Condition (iii) requires that no single Xi has too large acontribution to the estimate. Hence, the number of points encountered inthe averaging must tend to infinity. Condition (i) is technical. 2

Proof. By Corollary 6.2 it suffices to show that for every distribution of(X,Y )

limn→∞

E(η(X)− ηn(X))2

= 0.


Introduce the notation

ηn(x) =n∑i=1

η(Xi)Wni(x).

Then by the simple inequality (a+ b)2 ≤ 2(a2 + b2) we have

E(η(X)− ηn(X))2

= E

((η(X)− ηn(X)) + (ηn(X)− ηn(X)))2

≤ 2

(E(η(X)− ηn(X))2

+ E

(ηn(X)− ηn(X))2

). (6.1)

Therefore, it is enough to show that both terms on the right-hand sidetend to zero. Since the Wni’s are nonnegative and sum to one, by Jensen’sinequality, the first term is

E(η(X)− ηn(X))2

= E

(

n∑i=1

Wni(X)(η(X)− η(Xi))

)2

≤ E

n∑i=1

Wni(X)(η(X)− η(Xi))2.

If the function 0 ≤ η∗ ≤ 1 is continuous with bounded support, then it isuniformly continuous as well: for every ε > 0, there is an a > 0 such thatfor ‖x1 − x‖ < a, |η∗(x1) − η∗(x)|2 < ε. Recall here that ‖x‖ denotes theEuclidean norm of a vector x ∈ Rd. Thus, since |η∗(x1)− η∗(x)| ≤ 1,

E

n∑i=1

Wni(X)(η∗(X)− η∗(Xi))2

≤ E

n∑i=1

Wni(X)I‖X−Xi‖≥a

+ E

n∑i=1

Wni(X)ε

→ ε,

by (ii). Since the set of continuous functions with bounded support is densein L2(µ), for every ε > 0 we can choose η∗ such that

E(η(X)− η∗(X))2

< ε.

By this choice, using the inequality (a + b + c)2 ≤ 3(a2 + b2 + c2) (whichfollows from the Cauchy-Schwarz inequality),

E(η(X)− ηn(X))2

≤ E

n∑i=1

Wni(X)(η(X)− η(Xi))2

6.6 The k-Nearest Neighbor Rule 112

≤ 3E

n∑i=1

Wni(X)((η(X)− η∗(X))2 + (η∗(X)− η∗(Xi))2

+(η∗(Xi)− η(Xi))2)

≤ 3E(η(X)− η∗(X))2

+3E

n∑i=1

Wni(X)(η∗(X)− η∗(Xi))2

+ 3cE(η(X)− η∗(X))2

,

where we used (i). Therefore,

lim supn→∞

E(η(X)− ηn(X))2

≤ 3ε(1 + 1 + c).

To handle the second term of the right-hand side of (6.1), observe that

E (Yi − η(Xi))(Yj − η(Xj))|X,X1, . . . , Xn = 0 for all i 6= j

by independence. Therefore,

E(ηn(X)− ηn(X))2

= E

(

n∑i=1

Wni(X)(η(Xi)− Yi)

)2

=n∑i=1

n∑j=1

E Wni(X)(η(Xi)− Yi)Wnj(X)(η(Xj)− Yj)

=n∑i=1

EW 2ni(X)(η(Xi)− Yi)2

≤ E

n∑i=1

W 2ni(X)

≤ E

max1≤i≤n

Wni(X)n∑j=1

Wnj(X)

= E

max

1≤i≤nWni(X)

→ 0

by (iii), and the theorem is proved. 2

6.6 The k-Nearest Neighbor Rule

In Chapter 5 we discussed asymptotic properties of the k-nearest neighborrule when k remains fixed as the sample size n grows. In such cases the

6.6 The k-Nearest Neighbor Rule 113

expected probability of error converges to a number between L∗ and 2L∗.In this section we show that if k is allowed to grow with n such thatk/n → 0, the rule is weakly universally consistent. The proof is a verysimple application of Stone’s theorem. This result, appearing in Stone’spaper (1977), was the first universal consistency result for any rule. Strongconsistency, and many other different aspects of the k-nn rule are studiedin Chapters 11 and 26.

Recall the definition of the k-nearest neighbor rule: first the data areordered according to increasing Euclidean distances of the Xj ’s to x:

(X(1)(x), Y(1)(x)), . . . , (X(n)(x), Y(n)(x)),

that is,X(i)(x) is the i-th nearest neighbor of x among the pointsX1, . . . , Xn.Distance ties are broken by comparing indices, that is, in case of ‖Xi−x‖ =‖Xj − x‖, Xi is considered to be “closer” to x if i < j.

The k-nn classification rule is defined as

gn(x) =

0 if∑ki=1 IY(i)(x)=1 ≤

∑ki=1 IY(i)(x)=0

1 otherwise.

In other words, gn(x) is a majority vote among the labels of the k nearestneighbors of x.

Theorem 6.4. (Stone (1977)). If k → ∞ and k/n → 0, then for alldistributions ELn → L∗.

Proof. We proceed by checking the conditions of Stone’s weak convergencetheorem (Theorem 6.3). The weight Wni(X) in Theorem 6.3 equals 1/k iffXi is among the k nearest neighbors of X, and equals 0 otherwise.

Condition (iii) is obvious since k →∞. For condition (ii) observe that

E

n∑i=1

Wni(X)I‖Xi−X‖>ε

→ 0

holds wheneverP‖X(k)(X)−X‖ > ε

→ 0,

where X(k)(x) denotes the k-th nearest neighbor of x among X1, . . . , Xn.But we know from Lemma 5.1 that this is true for all ε > 0 wheneverk/n→ 0.

Finally, we consider condition (i). We have to show that for any nonneg-ative measurable function f with Ef(X) <∞,

E

n∑i=1

1kIXi is among the k nearest neighbors of Xf(Xi)

≤ E cf(X)

for some constant c. But we have shown in Lemma 5.3 that this inequalityalways holds with c = γd. Thus, condition (i) is verified. 2

6.7 Classification Is Easier Than Regression Function Estimation 114

6.7 Classification Is Easier Than RegressionFunction Estimation

Once again assume that our decision is based on some estimate ηn of thea posteriori probability function η, that is,

gn(x) =

0 if ηn(x) ≤ 1/21 otherwise.

The bounds of Theorems 2.2, 2.3, and Corollary 6.2 point out that if ηnis a consistent estimate of η, then the resulting rule is also consistent. Forexample, writing Ln = Pgn(X) 6= Y |Dn, we have

ELn − L∗ ≤ 2√

E (ηn(X)− η(X))2,

that is, L2-consistent estimation of the regression function η leads to con-sistent classification, and in fact, this is the main tool used in the proofof Theorem 6.3. While the said bounds are useful for proving consistency,they are almost useless when it comes to studying rates of convergence.As Theorem 6.5 below shows, for consistent rules rates of convergence ofPgn(X) 6= Y to L∗ are always orders of magnitude better than rates ofconvergence of

√E (η(X)− ηn(X))2 to zero.

Figure 6.2. The difference between the error probabilities growsroughly in proportion to the shaded area. Elsewhere ηn(x) does notneed to be close η(x).

Pattern recognition is thus easier than regression function estimation.This will be a recurring theme—to achieve acceptable results in patternrecognition, we can do more with smaller sample sizes than in regressionfunction estimation. This is really just a consequence of the fact that less isrequired in pattern recognition. It also corroborates our belief that patternrecognition is dramatically different from regression function estimation,and that it deserves separate treatment in the statistical community.


cannot be universally bounded. In fact, ρn may tend to zero arbitrarilyslowly (see Problem 6.5). On the other hand, ρn may tend to zero extremelyquickly. In Problems 6.6 and 6.7 and in the theorem below, upper boundson ρn are given that may be used in deducing rate-of-convergence results.Theorem 6.6, in particular, states that ELn−L∗ tends to zero as fast as thesquare of the L2 error of the regression estimate, i.e., E

(ηn(X)− η(X))2

,

whenever L∗ = 0. Just how slowly ρn tends to zero depends upon twothings, basically: (1) the rate of convergence of ηn to η, and (2) the behaviorof P|η(X)− 1/2| ≤ ε, η(X) 6= 1/2 as a function of ε when ε ↓ 0 (i.e., thebehavior of η(x) at those x’s where η(x) ≈ 1/2).

Theorem 6.6. Assume that L∗ = 0, and consider the decision

gn(x) =


ThenPgn(X) 6= Y ≤ 4E

(ηn(X)− η(X))2

.

Proof. By Theorem 2.2,

Pgn(X) 6= Y = 2E|η(X)− 1/2|Ign(X) 6=g∗(X)

= 2E

|η(X)− 1/2|Ign(X) 6=Y

(since g∗(X) = Y by the assumption L∗ = 0)

≤ 2√

E (ηn(X)− η(X))2√

Pgn(X) 6= Y (by the Cauchy-Schwarz inequality).

Dividing both sides by√

Pgn(X) 6= Y yields the result. 2

The results above show that the bounds of Theorems 2.2, 2.3, and Corol-lary 6.2 may be arbitrarily loose, and the error probability converges to L∗

faster than the L2-error of the regression estimate converges to zero. Insome cases, consistency may even occur without convergence of E|ηn(X)−η(X)| to zero. Consider for example a strictly separable distribution, thatis, a distribution such that there exist two sets A,B ⊂ Rd with

infx∈A,y∈B

‖x− y‖ ≥ δ > 0

for some δ > 0, and having the property that

PX ∈ A|Y = 1 = PX ∈ B|Y = 0 = 1.

In such cases, there is a version of η that has η(x) = 1 on A and η(x) = 0on B. We say version because η is not defined on sets of measure zero. For


such strictly separable distributions, L∗ = 0. Let η be 1/2 − ε on B and1/2 + ε on A. Then, with

g(x) =

0 if η(x) ≤ 1/21 otherwise, =

0 if x ∈ B1 if x ∈ A,

we have Pg(X) 6= Y = L∗ = 0. Yet,

2E|η(X)− η(X)| = 1− 2ε

is arbitrarily close to one.In a more realistic example, we consider the kernel rule (see Chapter 10),

gn(x) =


in which

ηn(x) =∑ni=1 YiK(x−Xi)∑ni=1K(x−Xi)

,

where K is the standard normal density in Rd:

K(u) =1

(2π)d/2e−‖u‖

2/2.

Assume that A and B consist of one point each, at distance δ from eachother—that is, the distribution of X is concentrated on two points. IfPY = 0 = PY = 1 = 1/2, we see that

limn→∞

1n

n∑i=1

K(x−Xi) =K(0) +K(δ)

2with probability one

at x ∈ A ∪B, by the law of large numbers. Also,

limn→∞

1n

n∑i=1

YiK(x−Xi) =K(0)/2 if x ∈ AK(δ)/2 if x ∈ B with probability one.

Thus,

limn→∞

ηn(x) =

K(0)

K(0)+K(δ) if x ∈ AK(δ)

K(0)+K(δ) if x ∈ Bwith probability one.

Hence, as η(x) = 1 on A and η(x) = 0 on B,

limn→∞

2E|η(X)− ηn(X)|

= 212

(∣∣∣∣1− K(0)K(0) +K(δ)

∣∣∣∣+ ∣∣∣∣0− K(δ)K(0) +K(δ)

∣∣∣∣)=

2K(δ)K(0) +K(δ)

.

6.8 Smart Rules 118

Yet, L∗ = 0 and Pgn(X) 6= Y → 0. In fact, if Dn denotes the trainingdata,

limn→∞

Pgn(X) 6= Y |Dn = L∗ with probability one,

and

limn→∞

2E|ηn(X)− η(X)|

∣∣∣Dn

=

2K(δ)K(0) +K(δ)

with probability one.

This shows very strongly that for any δ > 0, for many practical classificationrules, we do not need convergence of ηn to η at all!

As all the consistency proofs in Chapters 6 through 11 rely on the con-vergence of ηn to η, we will create unnecessary conditions for some distri-butions, although it will always be possible to find distributions of (X,Y )for which the conditions are needed—in the latter sense, the conditions ofthese universal consistency results are not improvable.

6.8 Smart Rules

A rule is a sequence of mappings gn : Rd ×(Rd × 0, 1

)n → 0, 1. Mostrules are expected to perform better when n increases. So, we say that arule is smart if for all distributions of (X,Y ), EL(gn) is nonincreasing,where

L(gn) = Pgn(X,Dn) 6= Y |Dn.

Some dumb rules are smart, such as the (useless) rule that, for each n,takes a majority over all Yi’s, ignoring the Xi’s. This follows from the factthat

P

n∑i=1

(2Yi − 1) > 0, Y = 0 orn∑i=1

(2Yi − 1) ≤ 0, Y = 1

is monotone in n. This is a property of the binomial distribution (see Prob-lem 6.12). A histogram rule with a fixed partition is smart (Problem 6.13).The 1-nearest neighbor rule is not smart. To see this, let (X,Y ) be (0, 1)and (Z, 0) with probabilities p and 1− p, respectively, where Z is uniformon [−1000, 1000]. Verify that for n = 1, ELn = 2p(1− p), while for n = 2,

ELn = 2p(1− p)2(

12

+E|Z|4000

)+ p2(1− p) + (1− p)2p

= 2p(1− p)(

5(1− p)8

+12

),

which is larger than 2p(1 − p) whenever p ∈ (0, 1/5). This shows thatin all these cases it is better to have n = 1 than n = 2. Similarly, the


standard kernel rule—discussed in Chapter 10—with fixed h is not smart(see Problems 6.14, 6.15).

The error probabilities of the above examples of smart rules do notchange dramatically with n. However, change is necessary to guaranteeBayes risk consistency. At the places of change—for example when hnjumps to a new value in the histogram rule—the monotonicity may belost. This leads to the conjecture that no universally consistent rule can besmart.


Problem 6.1. Let the i.i.d. random variables X1, . . . , Xn be distributed on Rd

according to the density f . Estimate f by fn, a function of x and X1, . . . , Xn,and assume that

∫|fn(x)− f(x)|dx→ 0 in probability (or with probability one).

Then show that there exists a consistent (or strongly consistent) classificationrule whenever the conditional densities f0 and f1 exist.

Problem 6.2. Histogram density estimation. Let X1, . . . , Xn be i.i.d. ran-dom variables in Rd with density f . Let Pn be a partition of Rd into cubes ofsize hn, and define the histogram density estimate by

fn(x) =1

nhdn

n∑i=1

IXi∈An(x),

where An(x) is the set in Pn that contains x. Prove that the estimate is universallyconsistent in L1 if hn → 0 and nhdn →∞ as n→∞, that is, for any f the L1 errorof the estimate

∫|fn(x)−f(x)|dx converges to zero in probability, or equivalently,

limn→∞

E

∫|fn(x)− f(x)|dx

= 0.

Hint: The following suggestions may be helpful(1) E

∫|fn − f |

≤ E

∫|fn −Efn|

+∫|Efn − f |.

(2) E∫

|fn −Efn|

=∑

j|µ(Anj)− µn(Anj)|.

(3) First show∫|Efn− f | → 0 for uniformly continuous f , and then extend

it to arbitrary densities.

Problem 6.3. Let X be uniformly distributed on [0, 1] with probability 1/2, andlet X be atomic on the rationals with probability 1/2 (e.g., if the rationals areenumerated r1, r2, r3, . . ., then PX = ri = 1/2i+1). Let Y = 1 if X is rationaland Y = 0 if X is irrational. Give a direct proof of consistency of the 1-nearestneighbor rule. Hint: Given Y = 1, the conditional distribution of X is discrete.Thus, for every ε > 0, there is an integer k such that given Y = 1, X equals oneof k rationals with probability at least 1 − ε. Now, if n is large enough, everypoint in this set captures data points with label 1 with large probability. Also,for large n, the space between these points is filled with data points labeled withzeros.


Problem 6.4. Prove the consistency of the cubic histogram rule by checking theconditions of Stone’s theorem. Hint: To check (i), first bound Wni(x) by

IXi∈An(x)/

n∑j=1

IXj∈An(x) + 1/n.

Since

E

n∑i=1

1

nf(Xi)

= Ef(X),

it suffices to show that there is a constant c′ > 0 such that for any nonnegativefunction f with Ef(X) <∞,

E

n∑i=1

f(Xi)IXi∈An(X)∑n

j=1IXj∈An(X)

≤ c′Ef(X).

In doing so, you may need to use Lemma A.2 (i). To prove that condition (iii)holds, write

E

max

1≤i≤nWni(X)

≤ 1

n+ PX ∈ Sc+

∑j:Anj∩S 6=∅

E

IX∈Anj

1

nµn(Anj)Iµn(Anj)>0

and use Lemma A.2 (ii).

Problem 6.5. Let an be a sequence of positive numbers converging to zero.Give an example of an a posteriori probability function η, and a sequence offunctions ηn approximating η such that

lim infn→∞

Pgn(X) 6= Y − L∗

an√

E (ηn(X)− η(X))2> 0,

where

gn(x) =


Thus, the rate of convergence in Theorem 6.5 may be arbitrarily slow. Hint:Define η = 1/2+b(x), where b(x) is a very slowly increasing nonnegative function.

Problem 6.6. Let δ > 0, and assume that |η(x) − 1/2| ≥ δ for all x. Considerthe decision

gn(x) =


Prove that

Pgn(X) 6= Y − L∗ ≤2E(ηn(X)− η(X))2

δ

.

This shows that the rate of convergence implied by the inequality of Theorem6.6 may be preserved for very general classes of distributions.


Problem 6.7. Assume that L∗ = 0, and consider the decision

gn(x) =


Show that for all 1 ≤ p <∞,

Pgn(X) 6= Y ≤ 2pE |ηn(X)− η(X)|p .

Hint: Proceed as in the proof of Theorem 6.6, but use Holder’s inequality.

Problem 6.8. Theorem 6.5 cannot be generalized to the L1 error. In particular,show by example that it is not always true that

limn→∞

ELn − L∗

E |ηn(X)− η(X)| = 0

when E |ηn(X)− η(X)| → 0 as n → ∞ for some regression function estimateηn. Thus, the inequality (Corollary 6.2)

ELn − L∗ ≤ 2E |ηn(X)− η(X)|

cannot be universally improved.

Problem 6.9. Let η′ : Rd → [0, 1], and define g(x) = Iη′(x)>1/2. Assume thatthe random variable X satisfies that Pη′(X) = 1/2 = 0. Let η1, η2, . . . be asequence of functions such that E|η′(X)− ηn(X)| → 0 as n→∞. Prove thatL(gn) → L(g) for all distributions of (X,Y ) satisfying the condition on X above,where gn(x) = Iηn(x)>1/2 (Lugosi (1992)).

Problem 6.10. A lying teacher. Sometimes the training labels Y1, . . . , Yn arenot available, but can only be observed through a noisy binary channel. Still, wewant to decide on Y . Consider the following model. Assume that the Yi’s in thetraining data are replaced by the i.i.d. binary-valued random variables Zi, whosedistribution is given by

PZi = 1|Yi = 0, Xi = x = PZi = 1|Yi = 0 = p < 1/2,

PZi = 0|Yi = 1, Xi = x = PZi = 0|Yi = 1 = q < 1/2.

Consider the decision

g(x) =

0 if η′(x) ≤ 1/21 otherwise,

where η′(x) = PZ1 = 1|X1 = x. Show that

Pg(X) 6= Y ≤ L∗(

1 +2|p− q|

1− 2max(p, q)

).

Use Problem 6.9 to conclude that if the binary channel is symmetric (i.e., p =q < 1/2), and Pη′(X) = 1/2 = 0, then L1-consistent estimation leads to aconsistent rule, in spite of the fact that the labels Yi were not available in thetraining sequence (Lugosi (1992)).


Problem 6.11. Develop a discrimination rule which has the property

limn→∞

ELn = ρ = E√

η(X)(1− η(X)),

for all distributions such that X has a density. Note: clearly, since ρ ≥ L∗, thisrule is not universally consistent, but it will aid you in “visualizing” the Matushitaerror!

Problem 6.12. If Zn is binomial (n, p) and Z is Bernoulli (p), independent ofZn, then show that PZn > n/2, Z = 0+PZn ≤ n/2, Z = 1 is nonincreasingin n.

Problem 6.13. Let gn be the histogram rule based on a fixed partition P. Showthat gn is smart.

Problem 6.14. Show that the kernel rule with gaussian kernel and h = 1, d = 1,is not smart (kernel rules are discussed in Chapter 10). Hint: Consider n = 1and n = 2 only.

Problem 6.15. Show that the kernel rule on R, with K(x) = I[−1,1](x), andh ↓ 0, such that nh→∞, is not smart.

Problem 6.16. Conjecture: no universally consistent rule is smart.

Problem 6.17. A rule gn : Rd×(Rd × 0, 1

)nis called symmetric if gn(x,Dn) =

gn(x,D′n) for every x, and every training sequence Dn, where D′

n is an arbitrarypermutation of the pairs (Xi, Yi) in Dn. Any nonsymmetric rule gn may be sym-metrized by taking a majority vote at every x ∈ Rd over all gn(x,D′

n), obtainedby the n! permutations of Dn. It may intuitively be expected that symmetrizedrules perform better. Prove that this is false, that is, exhibit a distribution and anonsymmetric classifier gn such that its expected probability of error is smallerthan that of the symmetrized version of gn. Hint: Take g3(x,D3) = 1− Y1.


7Slow Rates of Convergence

In this chapter we consider the general pattern recognition problem: Giventhe observation X and the training data Dn = ((X1, Y1), . . . , (Xn, Yn)) ofindependent identically distributed random variable pairs, we estimate thelabel Y by the decision

gn(X) = gn(X,Dn).

The error probability is

Ln = PY 6= gn(X)|Dn.

Obviously, the average error probability ELn = PY 6= gn(X) is com-pletely determined by the distribution of the pair (X,Y ), and the clas-sifier gn. We have seen in Chapter 6 that there exist classification rulessuch as the cubic histogram rule with properly chosen cube sizes suchthat limn→∞ELn = L∗ for all possible distributions. The next questionis whether there are classification rules with ELn tending to the Bayes riskat a specified rate for all distributions. Disappointingly, such rules do notexist.

7.1 Finite Training Sequence

The first negative result shows that for any classification rule and for anyfixed n, there exists a distribution such that the difference between the error

7.1 Finite Training Sequence 124

probability of the rule and L∗ is larger than 1/4. To explain this, note thatfor fixed n, we can find a sufficiently complex distribution for which thesample size n is hopelessly small.

Theorem 7.1. (Devroye (1982b)). Let ε > 0 be an arbitrarily smallnumber. For any integer n and classification rule gn, there exists a distri-bution of (X,Y ) with Bayes risk L∗ = 0 such that

ELn ≥ 1/2− ε.

Proof. First we construct a family of distributions of (X,Y ). Then weshow that the error probability of any classifier is large for at least onemember of the family. For every member of the family, X is uniformlydistributed on the set 1, 2, . . . ,K of positive integers

pi = PX = i =

1/K if i ∈ 1, . . . ,K0 otherwise,

where K is a large integer specified later. Now, the family of distributions of(X,Y ) is parameterized by a number b ∈ [0, 1), that is, every b determines adistribution as follows. Let b ∈ [0, 1) have binary expansion b = 0.b0b1b2 . . .,and define Y = bX . As the label Y is a function of X, there exists a perfectdecision, and thus L∗ = 0. We show that for any decision rule gn there isa b such that if Y = bX , then gn has very poor performance. Denote theaverage error probability corresponding to the distribution determined byb, by Rn(b) = ELn.

The proof of the existence of a bad distribution is based on the so-calledprobabilistic method. Here the key trick is the randomization of b. Define arandom variable B which is uniformly distributed in [0, 1) and independentof X and X1, . . . , Xn. Then we may compute the expected value of therandom variable Rn(B). Since for any decision rule gn,

supb∈[0,1)

Rn(b) ≥ ERn(B),

a lower bound for ERn(B) proves the existence of a b ∈ [0, 1) whosecorresponding error probability exceeds the lower bound.

Since B is uniformly distributed in [0, 1), its binary extension B =0.B1B2 . . . is a sequence of independent binary random variables withPBi = 0 = PBi = 1 = 1/2. But

ERn(B)= Pgn(X,Dn) 6= Y = P gn(X,Dn) 6= BX= P gn(X,X1, BX1 , . . . , Xn, BXn) 6= BX

7.2 Slow Rates 125

= E P gn(X,X1, BX1 , . . . , Xn, BXn) 6= BX |X,X1, . . . , Xn

≥ 12P X 6= X1, X 6= X2, . . . , X 6= Xn ,

since if X 6= Xi for all i = 1, 2, . . . , n, then given X,X1, . . . , Xn, Y = BXis conditionally independent of gn(X,Dn) and Y takes values 0 and 1 withprobability 1/2. But clearly,

P X 6= X1, X 6= X2, . . . , X 6= Xn|X = P X 6= X1|Xn = (1− 1/K)n.

In summary,

supb∈[0,1)

Rn(b) ≥12(1− 1/K)n.

The lower bound tends to 1/2 as K →∞. 2

Theorem 7.1 states that even though we have rules that are universallyconsistent, that is, they asymptotically provide the optimal performance forany distribution, their finite sample performance is always extremely badfor some distributions. This means that no classifier guarantees that with asample size of (say) n = 108 we get within 1/4 of the Bayes error probabilityfor all distributions. However, as the bad distribution depends upon n,Theorem 7.1 does not allow us to conclude that there is one distributionfor which the error probability is more than L∗+1/4 for all n. Indeed, thatwould contradict the very existence of universally consistent rules.

7.2 Slow Rates

The next question is whether a certain universal rate of convergence to L∗

is achievable for some classifier. For example, Theorem 7.1 does not excludethe existence of a classifier such that for every n, ELn − L∗ ≤ c/n for alldistributions, for some constant c depending upon the actual distribution.The next negative result is that this cannot be the case. Theorem 7.2 belowstates that the error probability ELn of any classifier is larger than (say)L∗ + c/(log log log n) for every n for some distribution, even if c dependson the distribution. (This can be seen by considering that by Theorem 7.2,there exists a distribution of (X,Y ) such that ELn ≥ L∗+1/

√log log log n

for every n.) Moreover, there is no sequence of numbers an converging tozero such that there is a classification rule with error probability below L∗

plus an for all distributions.Thus, in practice, no classifier assures us that its error probability is

close to L∗, unless the actual distribution is known to be a member of arestricted class of distributions. Now, it is easily seen that in the proofof both theorems we could take X to have uniform distribution on [0, 1],

7.2 Slow Rates 126

or any other density (see Problem 7.2). Therefore, putting restrictions onthe distribution of X alone does not suffice to obtain rate-of-convergenceresults. For such results, one needs conditions on the a posteriori probabilityη(x) as well. However, if only training data give information about the jointdistribution, then theorems with extra conditions on the distribution havelittle practical value, as it is impossible to detect whether, for example, thea posteriori probability η(x) is twice differentiable or not.

Now, the situation may look hopeless, but this is not so. Simply put, theBayes error is too difficult a target to shoot at.

Weaker versions of Theorem 7.2 appeared earlier in the literature. FirstCover (1968b) showed that for any sequence of classification rules, for se-quences an converging to zero at arbitrarily slow algebraic rates (i.e.,as 1/nδ for arbitrarily small δ > 0), there exists a distribution such thatELn ≥ L∗ + an infinitely often. Devroye (1982b) strengthened Cover’s re-sult allowing sequences tending to zero arbitrarily slowly. The next resultasserts that ELn > L∗ + an for every n.

Theorem 7.2. Let an be a sequence of positive numbers converging tozero with 1/16 ≥ a1 ≥ a2 ≥ . . .. For every sequence of classification rules,there exists a distribution of (X,Y ) with L∗ = 0, such that

ELn ≥ an

for all n.

This result shows that universally good classification rules do not exist.Rate of convergence studies for particular rules must necessarily be accom-panied by conditions on (X,Y ). That these conditions too are necessarilyrestrictive follows from examples suggested in Problem 7.2. Under certainregularity conditions it is possible to obtain upper bounds for the ratesof convergence for the probability of error of certain rules to L∗. Then itis natural to ask what the fastest achievable rate is for the given class ofdistributions. A theory for regression function estimation was worked outby Stone (1982). Related results for classification were obtained by Mar-ron (1983). In the proof of Theorem 7.2 we will need the following simplelemma:

Lemma 7.1. For any monotone decreasing sequence an of positive num-bers converging to zero with a1 ≤ 1/16, a probability distribution (p1, p2, . . .)may be found such that p1 ≥ p2 ≥ . . ., and for all n

∞∑i=n+1

pi ≥ max (8an, 32npn+1) .

Proof. It suffices to look for pi’s such that∞∑

i=n+1

pi ≥ max (8an, 32npn) .

7.2 Slow Rates 127

These conditions are easily satisfied. For positive integers u < v, define thefunction H(v, u) =

∑v−1i=u 1/i. First we find a sequence 1 = n1 < n2 < . . .

of integers with the following properties:

(a) H(nk+1, nk) is monotonically increasing,

(b) H(n2, n1) ≥ 32,

(c) 8ank≤ 1/2k for all k ≥ 1.

Note that (c) may only be satisfied if an1 = a1 ≤ 1/16. To this end, defineconstants c1, c2, . . . by

ck =32

2kH(nk+1, nk), k ≥ 1,

so that the ck’s are decreasing in k, and

132

∞∑k=1

ckH(nk+1, nk) =∞∑k=1

12k

= 1.

For n ∈ [nk, nk+1), we define pn = ck/(32n). We claim that these numbershave the required properties. Indeed, pn is decreasing, and

∞∑n=1

pn =∞∑k=1

ck32H(nk+1, nk) = 1.

Finally, if n ∈ [nk, nk+1), then

∞∑i=n+1

pi ≥∞∑

j=k+1

cj32H(nj+1, nj) =

∞∑j=k+1

12j

=12k.

Clearly, on the one hand, by the monotonicity of H(nk+1, nk), 1/2k ≥ ck =32npn. On the other hand, 1/2k ≥ 8ank

≥ 8an. This concludes the proof.2

Proof of Theorem 7.2. We introduce some notation. Let b = 0.b1b2b3 . . .be a real number on [0, 1] with the shown binary expansion, and let Bbe a random variable uniformly distributed on [0, 1] with expansion B =0.B1B2B3 . . .. Let us restrict ourselves to a random variable X with

PX = i = pi, i ≥ 1,

where p1 ≥ p2 ≥ . . . > 0, and∑∞i=n+1 pi ≥ max (8an, 32npn+1) for every

n. That such pi’s exist follows from Lemma 7.1. Set Y = bX . As Y isa function of X, we see that L∗ = 0. Each b ∈ [0, 1) however describes adifferent distribution. With b replaced by B we have a random distribution.

7.2 Slow Rates 128

Introduce the short notation ∆n = ((X1, BX1), . . . , (Xn, BXn)), and define

Gni = gn(i,∆n). If Ln(B) denotes the probability of error Pgn(X,∆n) 6=Y |B,X1, . . . , Xn for the random distribution, then we note that we maywrite

Ln(B) =∞∑i=1

piIGni 6=Bi.

If Ln(b) is the probability of error for a distribution parametrized by b,then

supb

infn

ELn(b)2an

≥ supb

E

infn

Ln(b)2an

≥ E

infn

Ln(B)2an

= E

E

infn

Ln(B)2an

∣∣∣∣X1, X2, . . .

.

We consider only the conditional expectation for now. We have

E

infn

Ln(B)2an

∣∣∣∣X1, X2, . . .

≥ P

∞⋂n=1

Ln(B) ≥ 2an

∣∣∣∣∣X1, X2, . . .

≥ 1−∞∑n=1

P Ln(B) < 2an|X1, X2, . . .

= 1−∞∑n=1

P Ln(B) < 2an|X1, X2, . . . , Xn

= 1−∞∑n=1

E P Ln(B) < 2an|∆n|X1, X2, . . . , Xn .

We bound the conditional probabilities inside the sum:

P Ln(B) < 2an|∆n

≤ P

∑i/∈X1,...,Xn

piIGni 6=Bi < 2an

∣∣∣∣∣∣∆n

(and, noting that Gni, X1, . . . , Xn

are all functions of ∆n, we have:)

= P

∑i/∈X1,...,Xn

piIBi=1 < 2an

∣∣∣∣∣∣∆n

7.2 Slow Rates 129

≤ P

∞∑i=n+1

piIBi=1 < 2an

(since the pi’s are decreasing, by stochastic dominance)

= P

∞∑i=n+1

piBi < 2an

.

Now everything boils down to bounding these probabilities from above. Weproceed by Chernoff’s bounding technique. The idea is the following: Forany random variable X, and s > 0, by Markov’s inequality,

PX ≥ ε = PesX ≥ esε ≤EesX

esε

.

By cleverly choosing s one can often obtain very sharp bounds. For morediscussion and examples of Chernoff’s method, refer to Chapter 8. In ourcase,

P

∞∑i=n+1

piBi < 2an

≤ Ee2san−s

∑∞i=n+1

piBi

= e2san

∞∏i=n+1

(12

+12e−spi

)

≤ e2san

∞∏i=n+1

12

(2− spi +

s2p2i

2

)(since e−x ≤ 1− x+ x2/2 for x ≥ 0)

≤ exp

(2san +

∞∑i=n+1

(−spi

2+s2p2

i

4

))(since 1− x ≤ e−x)

≤ exp(

2san −sΣ2

+s2pn+1Σ

4

)(where Σ =

∑∞i=n+1 pi)

= exp(−1

4(4an − Σ)2

Σpn+1

)(by taking s = Σ−4an

pn+1Σ, and the fact that Σ > 4an)


≤ exp(− 1

16Σ

pn+1

)(since Σ ≥ 8an)

≤ e−2n (since Σ ≥ 32pn+1n).

Thus, we conclude that

supb

infn

ELn(b)2an

≥ 1−∞∑n=1

e−2n =e2 − 2e2 − 1

>12,

so that there exists a b for which ELn(b) ≥ an for all n. 2


Problem 7.1. Extend Theorem 7.2 for distributions with 0 < L∗ < 1/2: showthat if an is a sequence of positive numbers as in Theorem 7.2, then for anyclassification rule there is a distribution such that ELn−L∗ ≥ an for every n forwhich L∗ + an < 1/2.

Problem 7.2. Prove Theorems 7.1 and 7.2, under one of the following additionalassumptions, which make the case that one will need very restrictive conditionsindeed to study rates of convergence.

(1) X has a uniform density on [0, 1).(2) X has a uniform density on [0, 1) and η is infinitely many times contin-

uously differentiable on [0, 1).(3) η is unimodal in x ∈ R2, that is, η(λx) decreases as λ > 0 increases for

any x ∈ R2.(4) η is 0, 1-valued,X isR2-valued, and the set x : η(x) = 1 is a compact

convex set containing the origin.

Problem 7.3. There is no super-classifier. Show that for every sequence ofclassification rules gn there is a universally consistent sequence of rules g′n,such that for some distribution of (X,Y ),

Pgn(X) 6= Y > Pg′n(X) 6= Y

for all n.

Problem 7.4. The next two exercises are intended to demonstrate that theweaponry of pattern recognition can often be successfully used for attackingother statistical problems. For example, a consequence of Theorem 7.2 is thatestimating infinite discrete distributions is hard. Consider the problem of es-timating a distribution (p1, p2, . . .) on the positive integers 1, 2, 3, . . . from asample X1, . . . , Xn of i.i.d. random variables with PX1 = i = pi, i ≥ 1. Showthat for any decreasing sequence an of positive numbers converging to zerowith a1 ≤ 1/16, and any estimate pi,n, there exists a distribution such that

E

∞∑i=1

|pi − pi,n|

≥ an.


Hint: Consider a classification problem with L∗ = 0, PY = 0 = 1/2, andX concentrated on 1, 2, . . .. Assume that the class-conditional probabilities

p(0)i = PX = i|Y = 0 and p

(1)i = PX = i|Y = 1 are estimated from two

i.i.d. samples X(0)1 , . . . , X

(0)n and X

(1)1 , . . . , X

(1)n , distributed according to p(0)

i and p(1)

i , respectively. Use Theorem 2.3 to show that for the classification ruleobtained from these estimates in a natural way,

ELn ≤1

2E

∞∑i=1

|p(0)i − p

(0)i,n|+

∞∑i=1

|p(1)i − p

(1)i,n|

,

therefore the lower bound of Theorem 7.2 can be applied.

Problem 7.5. A similar slow-rate result appears in density estimation. Considerthe problem of estimating a density f on R, from an i.i.d. sample X1, . . . , Xnhaving density f . Show that for any decreasing sequence an of positive numbersconverging to zero with a1 ≤ 1/16, and any density estimate fn, there exists adistribution such that


E

∫|f(x)− fn(x)|dx

≥ an.

This result was proved by Birge (1986) using a different—and in our view much

more complicated—argument. Hint: Put pi =∫ i+1

if(x)dx and pi,n =

∫ i+1

ifn(x)dx

and apply Problem 7.4.


8Error Estimation

8.1 Error Counting

Estimating the error probability Ln = Pgn(X) 6= Y |Dn of a classificationfunction gn is of essential importance. The designer always wants to knowwhat performance can be expected from a classifier. As the designer doesnot know the distribution of the data—otherwise there would not be anyneed to design a classifier—it is important to find error estimation methodsthat work well without any condition on the distribution of (X,Y ). Thismotivates us to search for distribution-free performance bounds for errorestimation methods.

Suppose that we want to estimate the error probability of a classifier gndesigned from the training sequenceDn = ((X1, Y1), . . . , (Xn, Yn)). Assumefirst that a testing sequence

Tm = ((Xn+1, Yn+1), ..., (Xn+m, Yn+m))

is available, which is a sequence of i.i.d. pairs that are independent of (X,Y )and Dn, and that are distributed as (X,Y ). An obvious way to estimate Lnis to count the number of errors that gn commits on Tm. The error-countingestimator Ln,m is defined by the relative frequency

Ln,m =1m

m∑j=1

Ign(Xn+j) 6=Yn+j.

8.2 Hoeffding’s Inequality 134

The estimator is clearly unbiased in the sense that

ELn,m|Dn

= Ln,

and the conditional distribution ofmLn,m, given the training dataDn, is bi-nomial with parameters m and Ln. This makes analysis easy, for propertiesof the binomial distribution are well known. One main tool in the analysisis Hoeffding’s inequality, which we will use many many times throughoutthis book.

8.2 Hoeffding’s Inequality

The following inequality was proved for binomial random variables by Cher-noff (1952) and Okamoto (1958). The general format is due to Hoeffding(1963):

Theorem 8.1. (Hoeffding (1963)). Let X1, . . . , Xn be independentbounded random variables such that Xi falls in the interval [ai, bi] withprobability one. Denote their sum by Sn =

∑ni=1Xi. Then for any ε > 0

we havePSn −ESn ≥ ε ≤ e−2ε2/

∑n

i=1(bi−ai)

2

andPSn −ESn ≤ −ε ≤ e−2ε2/

∑n

i=1(bi−ai)

2

.

The proof uses a simple auxiliary inequality:

Lemma 8.1. Let X be a random variable with EX = 0, a ≤ X ≤ b. Thenfor s > 0,

EesX

≤ es

2(b−a)2/8.

Proof. Note that by convexity of the exponential function

esx ≤ x− a

b− aesb +

b− x

b− aesa for a ≤ x ≤ b.

Exploiting EX = 0, and introducing the notation p = −a/(b− a) we get

EesX ≤ b

b− aesa − a

b− aesb

=(1− p+ pes(b−a)

)e−ps(b−a)

def= eφ(u),


where u = s(b−a), and φ(u) = −pu+log(1−p+peu). But by straightforwardcalculation it is easy to see that the derivative of φ is

φ′(u) = −p+p

p+ (1− p)e−u,

therefore φ(0) = φ′(0) = 0. Moreover,

φ′′(u) =p(1− p)e−u

(p+ (1− p)e−u)2≤ 1

4.

Thus, by Taylor series expansion with remainder, for some θ ∈ [0, u],

φ(u) = φ(0) + uφ′(0) +u2

2φ′′(θ) ≤ u2

8=s2(b− a)2

8. 2

Proof of Theorem 8.1. The proof is based on Chernoff’s boundingmethod (Chernoff (1952)): by Markov’s inequality, for any nonnegative ran-dom variable X, and any ε > 0,

PX ≥ ε ≤ EXε.

Therefore, if s is an arbitrary positive number, then for any random variableX,

PX ≥ ε = PesX ≥ esε ≤ EesX

esε.

In Chernoff’s method, we find an s > 0 that minimizes the upper boundor makes the upper bound small. In our case, we have

PSn −ESn ≥ ε

≤ e−sεE

exp

(s

n∑i=1

(Xi −EXi)

)

= e−sεn∏i=1

Ees(Xi−EXi)

(by independence)

≤ e−sεn∏i=1

es2(bi−ai)

2/8 (by Lemma 8.1)

= e−sεes2∑n

i=1(bi−ai)

2/8

= e−2ε2/∑n

i=1(bi−ai)

2

(by choosing s = 4ε/∑ni=1(bi − ai)2).

The second inequality is proved analogously. 2


The two inequalities in Theorem 8.1 may be combined to get

P|Sn −ESn| ≥ ε ≤ 2e−2ε2/∑n

i=1(bi−ai)

2

.

Now, we can apply this inequality to get a distribution-free performancebound for the counting error estimate:

Corollary 8.1. For every ε > 0,

P|Ln,m − Ln| > ε

∣∣∣Dn

≤ 2e−2mε2 .

The variance of the estimate can easily be computed using the fact that,conditioned on the data Dn, mLn,m is binomially distributed:

E(

Ln,m − Ln

)2∣∣∣∣Dn

=Ln(1− Ln)

m≤ 1

4m.

These are just the types of inequalities we want, for these are valid for anydistribution and data size, and the bounds do not even depend on gn.

Consider a special case in which all the Xi’s take values on [−c, c] andhave zero mean. Then Hoeffding’s inequality states that

P Sn/n > ε ≤ e−nε2/(2c2).

This bound, while useful for ε larger than c/√n, ignores variance informa-

tion. When VarXi c2, it is indeed possible to outperform Hoeffding’sinequality. In particular, we have:

Theorem 8.2. (Bennett (1962) and Bernstein (1946)). Let X1, . . .,Xn be independent real-valued random variables with zero mean, and as-sume that Xi ≤ c with probability one. Let

σ2 =1n

n∑i=1

VarXi.

Then, for any ε > 0,

P

1n

n∑i=1

Xi > ε

≤ exp

(−nε

2c

((1 +

σ2

2cε

)log(

1 +2cεσ2

)− 1))

(Bennett (1962)), and

P

1n

n∑i=1

Xi > ε

≤ exp

(− nε2

2σ2 + 2cε/3

)(Bernstein (1946)).

The proofs are left as exercises (Problem 8.2). We note that Bernstein’sinequality kicks in when ε is larger than about max (σ/

√n, c/

√n). It is

typically better than Hoeffding’s inequality when σ c.

8.4 Selecting Classifiers 137

8.3 Error Estimation Without Testing Data

A serious problem concerning the practical applicability of the estimateintroduced above is that it requires a large, independent testing sequence.In practice, however, an additional sample is rarely available. One usuallywants to incorporate all available (Xi, Yi) pairs in the decision function. Insuch cases, to estimate Ln, we have to rely on the training data only. Thereare well-known methods that we will discuss later that are based on cross-validation (or leave-one-out) (Lunts and Brailovsky (1967); Stone (1974));and holdout, resubstitution, rotation, smoothing, and bootstrapping (Efron(1979), (1983)), which may be employed to construct an empirical risk fromthe training sequence, thus obviating the need for a testing sequence. (SeeKanal (1974), Cover and Wagner (1975), Toussaint (1974a), Glick (1978),Hand (1986), Jain, Dubes, and Chen (1987), and McLachlan (1992) forsurveys, discussion, and empirical comparison.)

Analysis of these methods, in general, is clearly a much harder problem,as gn can depend on Dn in a rather complicated way. If we construct someestimator Ln fromDn, then it would be desirable to obtain distribution-freebounds on

P|Ln − Ln| > ε

,

or onE|Ln − Ln|q

for some q ≥ 1. Conditional probabilities and expectations given Dn aremeaningless, since everything is a function of Dn. Here, however, we haveto be much more careful as we do not want Ln to be optimistically biasedbecause the same data are used both for training and testing.

Distribution-free bounds for the above quantities would be extremelyhelpful, as we usually do not know the distribution of (X,Y ). While forsome rules such estimates exist—we will exhibit several avenues in Chapters22, 23, 24, 25, 26, and 31—it is disappointing that a single error estimationmethod cannot possibly work for all discrimination rules. It is thereforeimportant to point out that we have to consider (gn, Ln) pairs—for everyrule one or more error estimates must be found if possible, and vice versa,for every error estimate, its limitations have to be stated. Secondly, rulesfor which no good error estimates are known should be avoided. Luckily,most popular rules do not fall into this category. On the other hand, provendistribution-free performance guarantees are rarely available—see Chapters23, and 24 for examples.

8.4 Selecting Classifiers

Probably the most important application of error estimation is in the se-lection of a classification function from a class C of functions. If a class C


of classifiers is given, then it is tempting to pick the one that minimizesan estimate of the error probability over the class. A good method shouldpick a classifier with an error probability that is close to the minimal errorprobability in the class. Here we require much more than distribution-freeperformance bounds of the error estimator for each of the classifiers in theclass. Problem 8.8 demonstrates that it is not sufficient to be able to esti-mate the error probability of all classifiers in the class. Intuitively, if we canestimate the error probability for the classifiers in C uniformly well, thenthe classification function that minimizes the estimated error probability islikely to have an error probability that is close to the best in the class. Tocertify this intuition, consider the following situation: Let C be a class ofclassifiers, that is, a class of mappings of the form φ : Rd → 0, 1. Assumethat the error count

Ln(φ) =1n

n∑j=1

Iφ(Xj) 6=Yj

is used to estimate the error probability L(φ) = Pφ(X) 6= Y of eachclassifier φ ∈ C. Denote by φ∗n the classifier that minimizes the estimatederror probability over the class:

Ln(φ∗n) ≤ Ln(φ) for all φ ∈ C.

Then for the error probability

L(φ∗n) = P φ∗n(X) 6= Y |Dn

of the selected rule we have:

Lemma 8.2. (Vapnik and Chervonenkis (1974c); see also Devroye(1988b)).

L(φ∗n)− infφ∈C

L(φ) ≤ 2 supφ∈C

|Ln(φ)− L(φ)|,

|Ln(φ∗n)− L(φ∗n)| ≤ supφ∈C

|Ln(φ)− L(φ)|.

Proof.


L(φ) = L(φ∗n)− Ln(φ∗n) + Ln(φ∗n)− infφ∈C

L(φ)

≤ L(φ∗n)− Ln(φ∗n) + supφ∈C

|Ln(φ)− L(φ)|

≤ 2 supφ∈C

|Ln(φ)− L(φ)|.

The second inequality is trivially true. 2

We see that upper bounds for supφ∈C |Ln(φ)−L(φ)| provide us with upperbounds for two things simultaneously:


(1) An upper bound for the suboptimality of φ∗n within C, that is, a boundfor L(φ∗n)− infφ∈C L(φ).

(2) An upper bound for the error |Ln(φ∗n) − L(φ∗n)| committed whenLn(φ∗n) is used to estimate the probability of error L(φ∗n) of the se-lected rule.

In other words, by bounding supφ∈C |Ln(φ) − L(φ)|, we kill two flies atonce. It is particularly useful to know that even though Ln(φ∗n) is usuallyoptimistically biased, it is within given bounds of the unknown probabilityof error with φ∗n, and that no other test sample is needed to estimate thisprobability of error. Whenever our bounds indicate that we are close tothe optimum in C, we must at the same time have a good estimate of theprobability of error, and vice versa.

As a simple, but interesting application of Lemma 8.2 we consider thecase when the class C contains finitely many classifiers.

Theorem 8.3. Assume that the cardinality of C is bounded by N . Thenwe have for all ε > 0,

P

supφ∈C

|Ln(φ)− L(φ)| > ε

≤ 2Ne−2nε2 .

Proof.

P

supφ∈C

|Ln(φ)− L(φ)| > ε

≤

∑φ∈C

P|Ln(φ)− L(φ)| > ε

≤ 2Ne−2nε2 ,

where we used Hoeffding’s inequality, and the fact that the random variablenLn(φ) is binomially distributed with parameters n and L(φ). 2

Remark. Distribution-free properties. Theorem 8.3 shows that theproblem studied here is purely combinatorial. The actual distribution ofthe data does not play a role at all in the upper bounds. 2

Remark. Without testing data. Very often, a class of rules C of theform φn(x) = φn(x,Dn) is given, and the same data Dn are used to se-lect a rule by minimizing some estimates Ln(φn) of the error probabilitiesL(φn) = Pφn(X) 6= Y |Dn. A similar analysis can be carried out in thiscase. In particular, if φ∗n denotes the selected rule, then we have similar toLemma 8.2:

8.5 Estimating the Bayes Error 140

Theorem 8.4.

L(φ∗n)− infφn∈C

L(φn) ≤ 2 supφn∈C

|Ln(φn)− L(φn)|,

and|Ln(φ∗n)− L(φ∗n)| ≤ sup

φn∈C|Ln(φn)− L(φn)|.

If C is finite, then again, similar to Theorem 8.3, we have for example

PL(φ∗n)− inf

φn∈CL(φn) > ε

≤ N sup

φn∈CP|Ln(φn)− L(φn)| > ε/2

. 2

8.5 Estimating the Bayes Error

It is also important to have a good estimate of the optimal error probabilityL∗. First of all, if L∗ is large, we would know beforehand that any rule isgoing to perform poorly. Then perhaps the information might be used toreturn to the feature selection stage. Also, a comparison of estimates of Lnand L∗ gives us an idea how much room is left for improvement. Typically,L∗ is estimated by an estimate of the error probability of some consistentclassification rule (see Fukunaga and Kessel (1971), Chen and Fu (1973),Fukunaga and Hummels (1987), and Garnett and Yau (1977)). Clearly, ifthe estimate Ln we use is consistent in the sense that Ln − Ln → 0 withprobability one as n→∞, and the rule is strongly consistent, then

Ln → L∗

with probability one. In other words, we have a consistent estimate of theBayes error probability. There are two problems with this approach. Thefirst problem is that if our purpose is comparing L∗ with Ln, then usingthe same estimate for both of them does not give any information. Theother problem is that even though for many classifiers, Ln − Ln can beguaranteed to converge to zero rapidly, regardless what the distribution of(X,Y ) is (see Chapters 23 and 24), in view of the results of Chapter 7, therate of convergence of Ln to L∗ using such a method may be arbitrarilyslow. Thus, we cannot expect good performance for all distributions fromsuch a method. The question is whether it is possible to come up with amethod of estimating L∗ such that the difference Ln − L∗ converges tozero rapidly for all distributions. Unfortunately, there is no method thatguarantees a certain finite sample performance for all distributions. Thisdisappointing fact is reflected in the following negative result:

Theorem 8.5. For every n, for any estimate Ln of the Bayes error prob-ability L∗, and for every ε > 0, there exists a distribution of (X,Y ), such

8.5 Estimating the Bayes Error 141

thatE|Ln − L∗|

≥ 1

4− ε.

Proof. For a fixed n, we construct a family F of distributions, and showthat for at least one member of the family, E

|Ln − L∗|

≥ 1

4−ε. The fam-ily contains 2m + 1 distributions, where m is a large integer specified later.In all cases, X1, . . . , Xn are drawn independently by a uniform distributionfrom the set 1, . . . ,m. Let B0, B1, B2, . . . , Bn be i.i.d. Bernoulli randomvariables, independent of the Xi’s, with PBi = 0 = PBi = 1 = 1/2.For the first member of the family F , let Yi = Bi for i = 1, . . . , n. Thus,for this distribution, L∗ = 1/2. The Bayes error for the other 2m mem-bers of the family is zero. These distributions are determined by m binaryparameters a1, a2, . . . , am ∈ 0, 1 as follows:

η(i) = PY = 1|X = i = ai.

In other words, Yi = aXifor every i = 1, . . . , n. Clearly, L∗ = 0 for these

distributions. Note also that all distributions with X distributed uniformlyon 1, . . . ,m and L∗ = 0 are members of the family. Just as in the proofsof Theorems 7.1 and 7.2, we randomize over the family F of distributions.However, the way of randomization is different here. The trick is to useB0, B1, B2, . . . , Bn in randomly picking a distribution. (Recall that theserandom variables are just the labels Y1, . . . , Yn in the training sequence forthe first distribution in the family.) We choose a distribution randomly, asfollows: If B0 = 0, then we choose the first member of F (the one withL∗ = 1/2). If B0 = 1, then the labels of the training sequence are given by

Yi =Bi if Xi 6= X1, Xi 6= X2, . . . , Xi 6= Xi−1

Bj if j < i is the smallest index such that Xi = Xj .

Note that in case of B0 = 1, for any fixed realization b1, . . . , bn ∈ 0, 1of B1, . . . , Bn, the Bayes risk is zero. Therefore, the distribution is in thefamily F .

Now, let A be the event that all theXi’s are different. Observe that underA, Ln is a function of X1, . . . , Xn, B1, . . . , Bn only, but not B0. Therefore,

supF

E|Ln − L∗|

≥ E

|Ln − L∗|

(with B0, B1, . . . , Bn random)

≥ EIA|Ln − L∗|

= E

IA

(IB0=0

∣∣∣∣Ln − 12

∣∣∣∣+ IB0=1

∣∣∣Ln − 0∣∣∣)

= EIA

12

(∣∣∣∣Ln − 12

∣∣∣∣+ ∣∣∣Ln − 0∣∣∣)


≥ EIA

14

=

14PA.

Now, if we pick m large enough, PA can be as close to 1 as desired.Hence,

supE|Ln − L∗|

≥ 1

4,

where the supremum is taken over all distributions of (X,Y ). 2


Problem 8.1. Let B be a binomial random variable with parameters n and p.Show that

PB > ε ≤ eε−np−ε log(ε/np) (ε > np)

andPB < ε ≤ eε−np−ε log(ε/np) (ε < np) .

(Chernoff (1952).) Hint: Proceed by Chernoff’s bounding method.

Problem 8.2. Prove the inequalities of Bennett and Bernstein given in Theorem8.2. To help you, we will guide you through different stages:

(1) Show that for any s > 0, and any random variable X with EX = 0,EX2 = σ2, X ≤ c,

EesX≤ ef(σ

2/c2),

where

f(u) = log(

1

1 + ue−csu +

u

1 + uecs).

(2) Show that f ′′(u) ≤ 0 for u ≥ 0.(3) By Chernoff’s bounding method, show that

P

1

n

n∑i=1

Xi ≥ ε

≤ e

−snε+∑n

i=1f(VarXi/c2)

.

(4) Show that f(u) ≤ f(0) + uf ′(0) = (ecs − 1− cs)u.(5) Using the bound of (4), find the optimal value of s and derive Bennett’s

inequality.

Problem 8.3. Use Bernstein’s inequality to show that if B is a binomial (n, p)random variable, then for ε > 0,

PB

n− p > εp

≤ e−3npε2/8

and

PB

n− p < −εp

≤ e−3npε2/8.


Problem 8.4. Let X1, . . . , Xn be independent binary-valued random variableswith PXi = 1 = 1 − PXi = 0 = pi. Set p = (1/n)

∑n

i=1pi and Sn =∑n

i=1Xi. Prove that

PSn − np ≥ nε ≤ e−npε2/3 and PSn − np ≤ −nε ≤ e−npε

2/2

(Angluin and Valiant (1979), see also Hagerup and Rub (1990)). Compare theresults with Bernstein’s inequality for this case. Hint: Put s = log(1 + ε) ands = − log(1 − ε) in the Chernoff bounding argument. Prove and exploit theelementary inequalities

− ε2

2≤ ε− (1 + ε) log(1 + ε) ≤ − ε

2

3, ε ∈ [0, 1],

and

− ε2

2≥ ε− (1 + ε) log(1 + ε), ε ∈ (−1, 0].

Problem 8.5. Let B be a Binomial (n, p) random variable. Show that for p ≤a < 1,

PB > an ≤((

p

a

)a (1− p

1− a

)1−a)n

≤((

p

a

)aea−p

)n.

Show that for 0 < a < p the same upper bounds hold for PB ≤ an (Karp(1988), see also Hagerup and Rub (1990)). Hint: Use Chernoff’s method withparameter s and set s = log(a(1− p)/(p(1− a))).

Problem 8.6. Let B be a Binomial (n, p) random variable. Show that if p ≥ 1/2,

PB − np ≥ nε < e− nε2

2p(1−p) ,

and if p ≤ 1/2,

PB − np ≤ −nε < e− nε2

2p(1−p)

(Okamoto (1958)). Hint: Use Chernoff’s method, and the inequality

x logx

p+ (1− x) log

1− x

1− p≥ (x− p)2

2p(1− p)

for 1/2 ≤ p ≤ x ≤ 1, and for 0 ≤ x ≤ p ≤ 1/2.

Problem 8.7. Let B be a Binomial (n, p) random variable. Prove that

P√B −√np ≥ ε

√n < e−2nε2 ,

andP√B −√np ≤ −ε

√n < e−nε

2

(Okamoto (1958)). Hint: Use Chernoff’s method, and the inequalities

x logx

p+ (1− x) log

1− x

1− p≥ 2

(√x−√p

)2x ∈ [p, 1],

and

x logx

p+ (1− x) log

1− x

1− p≥(√

x−√p)2

x ∈ [0, p],


Problem 8.8. Give a class C of decision functions of the form φ : Rd → 0, 1(i.e., the training data do not play any role in the decision) such that for everyε > 0

supφ∈C

P|Ln(φ)− L(φ)| > ε

≤ 2e−2nε2

for every distribution, where Ln(φ) is the error-counting estimate of the errorprobability L(φ) = Pφ(X) 6= Y of decision φ, and at the same time, if Fn is

the class of mappings φ∗n minimizing the error count Ln(φ) over the class C, thenthere exists one distribution such that

P

supφ∈Fn

L(φ)− infφ∈C

L(φ) = 1

= 1

for all n.

Problem 8.9. Let C be a class of classifiers, that is, a class of mappings of theform φn(x,Dn) = φn(x). Assume that an independent testing sequence Tm isgiven, and that the error count

Ln,m(φn) =1

m

m∑j=1

Iφn(Xn+j) 6=Yn+j

is used to estimate the error probability L(φn) = Pφn(X) 6= Y |Dn of eachclassifier φn ∈ C. Denote by φ∗n,m the classifier that minimizes the estimatederror probability over the class. Prove that for the error probability

L(φ∗n,m) = Pφ∗n,m(X) 6= Y

∣∣Dnof the selected rule we have

L(φ∗n,m)− infφn∈C

L(φn) ≤ 2 supφn∈C

|Ln,m(φn)− L(φn)|,

|Ln,m(φ∗n,m)− L(φ∗n,m)| ≤ supφn∈C

|Ln,m(φn)− L(φn)|.

Also, if C is of finite cardinality with |C| = N , then

P

supφn∈C

|Ln,m(φn)− L(φn)| > ε

∣∣∣∣Dn ≤ 2Ne−2mε2 .

Problem 8.10. Show that if a rule gn is consistent, then we can always find anestimate of the error such that E|Ln − Ln|q → 0 for all q > 0. Hint: Split thedata sequence Dn and use the second half to estimate the error probability ofgdn/2e.

Problem 8.11. Open-ended problem. Is there a rule for which no error esti-mate works for all distributions? More specifically, is there a sequence of classi-fication rules gn such that for all n large enough,

infLn

supX,Y

E(Ln − Ln)2 ≥ c

for some constant c > 0, where the infimum is taken over all possible errorestimates? Are such rules necessarily inconsistent?


Problem 8.12. Consider the problem of estimating the asymptotic probabilityof error of the nearest neighbor rule LNN = 2Eη(X)(1 − η(X)). Show that

for every n, for any estimate Ln of LNN, and for every ε > 0, there exists adistribution of (X,Y ), such that

E|Ln − LNN|

≥ 1

4− ε.


9The Regular Histogram Rule

In this chapter we study the cubic histogram rule. Recall that this rulepartitionsRd into cubes of the same size, and gives the decision according tothe number of zeros and ones among the Yi’s such that the correspondingXi

falls in the same cube as X. Pn = An1, An2, . . . denotes a partition of Rd

into cubes of size hn > 0, that is, into sets of the type∏di=1[kihn, (ki+1)hn),

where the ki’s are integers, and the histogram rule is defined by

gn(x) =

0 if∑ni=1 IYi=1IXi∈An(x) ≤

∑ni=1 IYi=0IXi∈An(x)

1 otherwise,

where for every x ∈ Rd, An(x) = Ani if x ∈ Ani. That is, the decisionis zero if the number of ones does not exceed the number of zeros in thecell in which x falls. Weak universal consistency of this rule was shown inChapter 6 under the conditions hn → 0 and nhdn → ∞ as n → ∞. Thepurpose of this chapter is to introduce some techniques by proving stronguniversal consistency of this rule. These techniques will prove very useful inhandling other problems as well. First we introduce the method of boundeddifferences.

9.1 The Method of Bounded Differences

In this section we present a generalization of Hoeffding’s inequality, due toMcDiarmid (1989). The result will equip us with a powerful tool to handlecomplicated functions of independent random variables. This inequality

9.1 The Method of Bounded Differences 147

follows by results of Hoeffding (1963) and Azuma (1967) who observedthat Theorem 8.1 can be generalized to bounded martingale difference se-quences. The inequality has found many applications in combinatorics, aswell as in nonparametric statistics (see McDiarmid (1989) and Devroye(1991a) for surveys).

Let us first recall the notion of martingales. Consider a probability space(Ω,F ,P).

Definition 9.1. A sequence of random variables Z1, Z2, . . . is called amartingale if

E Zi+1|Z1, . . . , Zi = Zi with probability one

for each i > 0.Let X1, X2, . . . be an arbitrary sequence of random variables. Z1, Z2, . . .

is called a martingale with respect to the sequence X1, X2, . . . if for everyi > 0, Zi is a function of X1, . . . , Xi, and

E Zi+1|X1, . . . , Xi = Zi with probability one.

Obviously, if Z1, Z2, . . . is a martingale with respect to X1, X2, . . ., thenZ1, Z2, . . . is a martingale, since

E Zi+1|Z1, . . . , Zi = E E Zi+1|X1, . . . , XiZ1, . . . , Zi= E Zi|Z1, . . . , Zi= Zi.

The most important examples of martingales are sums of independentzero-mean random variables. Let U1, U2, . . . be independent random vari-ables with zero mean. Then the random variables

Si =i∑

j=1

Uj , i > 0,

form a martingale (see Problem 9.1). Martingales share many properties ofsums of independent variables. Our purpose here is to extend Hoeffding’sinequality to martingales. The role of the independent random variables isplayed here by a so-called martingale difference sequence.

Definition 9.2. A sequence of random variables V1, V2, . . . is a martingaledifference sequence if

E Vi+1|V1, . . . , Vi = 0 with probability one

for every i > 0.


A sequence of random variables V1, V2, . . . is called a martingale differ-ence sequence with respect to a sequence of random variables X1, X2, . . . iffor every i > 0 Vi is a function of X1, . . . , Xi, and

E Vi+1|X1, . . . , Xi = 0 with probability one.

Again, it is easily seen that if V1, V2, . . . is a martingale difference se-quence with respect to a sequence X1, X2, . . . of random variables, then itis a martingale difference sequence. Also, any martingale Z1, Z2, . . . leadsnaturally to a martingale difference sequence by defining

Vi = Zi − Zi−1

for i > 0.The key result in the method of bounded differences is the following in-

equality that relaxes the independence assumption in Theorem 8.1, allowingmartingale difference sequences:

Theorem 9.1. (Hoeffding (1963), Azuma (1967)). Let X1, X2, . . . bea sequence of random variables, and assume that V1, V2, . . . is a martingaledifference sequence with respect to X1, X2, . . .. Assume furthermore thatthere exist random variables Z1, Z2, . . . and nonnegative constants c1, c2, . . .such that for every i > 0 Zi is a function of X1, . . . , Xi−1, and

Zi ≤ Vi ≤ Zi + ci with probability one.

Then for any ε > 0 and n

P

n∑i=1

Vi ≥ ε

≤ e−2ε2/

∑n

i=1c2i

and

P

n∑i=1

Vi ≤ −ε

≤ e−2ε2/

∑n

i=1c2i .

The proof is a rather straightforward extension of that of Hoeffding’sinequality. First we need an analog of Lemma 8.1:

Lemma 9.1. Assume that the random variables V and Z satisfy with prob-ability one that EV |Z = 0, and for some function f and constant c ≥ 0

f(Z) ≤ V ≤ f(Z) + c.

Then for every s > 0EesV |Z

≤ es

2c2/8.


The proof of the lemma is left as an exercise (Problem 9.2).

Proof of Theorem 9.1. As in the proof of Hoeffding’s inequality, weproceed by Chernoff’s bounding method. Set Sk =

∑ki=1 Vi. Then for any

s > 0

PSn ≥ ε ≤ e−sεEesSn

= e−sεE

esSn−1E

esVn |X1, . . . , Xn−1

≤ e−sεE

esSn−1

es

2c2n/8 (by Lemma 9.1)

≤ e−sεes2∑n

i=1c2i /8 (iterate previous argument)

= e−2ε2/∑n

i=1c2i (choose s = 4ε/

∑ni=1 c

2i ).

The second inequality is proved analogously. 2

Now, we are ready to state the main inequality of this section. It is a largedeviation-type inequality for functions of independent random variablessuch that the function is relatively robust to individual changes in thevalues of the random variables. The condition of the function requires thatby changing the value of its i-th variable, the value of the function can notchange by more than a constant ci.

Theorem 9.2. (McDiarmid (1989)). Let X1, . . . , Xn be independent ran-dom variables taking values in a set A, and assume that f : An → Rsatisfies

supx1,...,xn,x′i∈A

|f(x1, . . . , xn)−f(x1, . . . , xi−1, x′i, xi+1, . . . , xn)| ≤ ci , 1 ≤ i ≤ n .

Then for all ε > 0

P f(X1, . . . , Xn)−Ef(X1, . . . , Xn) ≥ ε ≤ e−2ε2/∑n

i=1c2i ,

and

P Ef(X1, . . . , Xn)− f(X1, . . . , Xn) ≥ ε ≤ e−2ε2/∑n

i=1c2i .

Proof. Define V = f(X1, . . . , Xn) − Ef(X1, . . . , Xn). Introduce V1 =EV |X1 −EV , and for k > 1,

Vk = EV |X1 . . . , Xk −EV |X1, . . . , Xk−1,

so that V =∑nk=1 Vk. Clearly, V1, . . . , Vn form a martingale difference

sequence with respect to X1, . . . , Xn. Define the random variables

Hk(X1, . . . , Xk) = E f(X1, . . . , Xn)|X1, . . . , Xk


and

Vk = Hk(X1, . . . , Xk)−∫Hk(X1, . . . , Xk−1, x)Fk(dx),

where the integration is with respect to Fk, the probability measure of Xk.Introduce the random variables

Wk = supu

(Hk(X1, . . . , Xk−1, u)−

∫Hk(X1, . . . , Xk−1, x)Fk(dx)

),

and

Zk = infv

(Hk(X1, . . . , Xk−1, v)−

∫Hk(X1, . . . , Xk−1, x)Fk(dx)

).

Clearly, Zk ≤ Vk ≤ Wk with probability one. Since for every k Zk isa function of X1, . . . , Xk−1, we can apply Theorem 9.1 directly to V =∑nk=1 Vk, if we can show that Wk − Zk ≤ ck. But this follows from

Wk − Zk = supu

supv

(Hk(X1, . . . , Xk−1, u)−Hk(X1, . . . , Xk−1, v))

≤ ck,

by the condition of the theorem. 2

Clearly, if the Xi’s are bounded, then the choice f(x1, . . . , xn) =∑ni=1 xi

yields Hoeffding’s inequality. Many times the inequality can be used tohandle very complicated functions of independent random variables withgreat elegance. For examples in nonparametric statistics, see Problems 9.3,9.6, 10.3.

Similar methods to those used in the proof of Theorem 9.2 may be usedto bound the variance Varf(X1, . . . , Xn). Other inequalities for the vari-ance of general functions of independent random variables were derived byEfron and Stein (1981) and Steele (1986).

Theorem 9.3. (Devroye (1991a)). Assume that the conditions of The-orem 9.2 hold. Then

Varf(X1, . . . , Xn) ≤14

n∑i=1

c2i .

Proof. Using the notations of the proof of Theorem 9.2, we have to showthat

VarV ≤ 14

n∑i=1

c2i .

9.2 Strong Universal Consistency 151

Observe that

VarV = EV 2

= E

(

n∑i=1

Vi

)2

= E

n∑i=1

V 2i

+ 2

∑1≤i<j≤n

E ViVj

= E

n∑i=1

V 2i

,

where in the last step we used the martingale property in the followingway: for i < j we have

EViVj |X1, . . . , Xj−1= ViEVj |X1, . . . , Xj−1= 0 with probability one.

Thus, the theorem follows if we can show that

EV 2i |X1, . . . , Xi−1 ≤

c2i4.

Introducing Wi and Zi as in the proof of Theorem 9.2, we see that withprobability one Zi ≤ Vi ≤ Zi + ci. Since Zi is a function of X1, . . . , Xi−1,therefore, conditioned on X1, . . . , Xi−1, Vi is a zero mean random variabletaking values in the interval [Zi, Zi+ ci]. But an arbitrary random variableU taking values in an interval [a, b] has variance not exceeding

E(U − (a+ b)/2)2

≤ (b− a)2

4,

so that

EV 2i |X1, . . . , Xi−1 ≤

c2i4,

which concludes the proof. 2

9.2 Strong Universal Consistency

The purpose of this section is to prove strong universal consistency of thehistogram rule. This is the first such result that we mention. Later we will


prove the same property for other rules, too. The theorem, stated here forcubic partitions, is essentially due to Devroye and Gyorfi (1983). For moregeneral sequences of partitions, see Problem 9.7. An alternative proof ofthe theorem based on the Vapnik-Chervonenkis inequality will be givenlater—see the remark following Theorem 17.2.

Theorem 9.4. Assume that the sequence of partitions Pn satisfies thefollowing two conditions as n→∞:

hn → 0

andnhdn →∞.

For any distribution of (X,Y ), and for every ε > 0 there is an integer n0

such that for n > n0, for the error probability Ln of the histogram rule

PLn − L∗ > ε ≤ 2e−nε2/32.

Thus, the cubic histogram rule is strongly universally consistent.

Proof. Define

η∗n(x) =∑ni=1 YiIXi∈An(x)

nµ(An(x)).

Clearly, the decision based on η∗n,

gn(x) =

0 if

∑n

i=1YiIXi∈An(x)

nµ(An(x)) ≤∑n

i=1(1−Yi)IXi∈An(x)

nµ(An(x))

1 otherwise

is just the histogram rule. Therefore, by Theorem 2.3, it suffices to provethat for n large enough,

P∫

|η(x)− η∗n(x)|µ(dx) >ε

2

≤ e−nε

2/32.

Decompose the difference as

|η(x)−η∗n(x)| = E|η(x)−η∗n(x)|+(|η(x)−η∗n(x)|−E|η(x)−η∗n(x)|

). (9.1)

The convergence of the first term on the right-hand side implies weak con-sistency of the histogram rule. The technique we use to bound this termis similar to that which we already saw in the proof of Theorem 6.3. Forcompleteness, we give the details here. However, new ideas have to appearin our handling of the second term.


We begin with the first term. Since the set of continuous functions withbounded support is dense in L1(µ), it is possible to find a continuous func-tion of bounded support r(x) such that∫

|η(x)− r(x)|µ(dx) < ε/16.

Note that r(x) is uniformly continuous. Introduce the function

r∗n(x) =Er(X)IX∈An(x)

µ(An(x))

.

Then we can further decompose the first term on the right-hand side of(9.1) as

E|η(x)− η∗n(x)|≤ |η(x)− r(x)|+ |r(x)− r∗n(x)|

+|r∗n(x)−Eη∗n(x)|+ E|Eη∗n(x)− η∗n(x)|. (9.2)

We proceed term by term:

First term: The integral of |η(x)− r(x)| (with respect to µ) is smallerthan ε/16 by the definition of r(x).

Second term: Using Fubini’s theorem we have∫|r(x)− r∗n(x)|µ(dx)

=∑

Aj :µ(Aj) 6=0

∫Aj

∣∣∣∣∣r(x)− Er(X)IX∈Aj

µ(Aj)

∣∣∣∣∣µ(dx)

=∑

Aj :µ(Aj) 6=0

1µ(Aj)

∫Aj

∣∣r(x)µ(Aj)−Er(X)IX∈Aj

∣∣µ(dx)

=∑

Aj :µ(Aj) 6=0

1µ(Aj)

∫Aj

∣∣∣∣∣r(x)∫Aj

µ(dy)−∫Aj

r(y)µ(dy)

∣∣∣∣∣µ(dx)

≤∑

Aj :µ(Aj) 6=0

1µ(Aj)

∫Aj

∫Aj

|r(x)− r(y)|µ(dx)µ(dy).

As r(x) is uniformly continuous, if hn is small enough, then |r(x)− r(y)| <ε/16 for every x, y ∈ A for any cell A ∈ Pn. Then the double integral inthe above expression can be bounded from above as follows:∫

Aj

∫Aj

|r(x)− r(y)|µ(dx)µ(dy) ≤ εµ2(Aj)/16.


Note that we used the condition hn → 0 here. Summing over the cells weget ∫

|r(x)− r∗n(x)|µ(dx) ≤ ε/16.

Third term: We have∫|r∗n(x)−Eη∗n(x)|µ(dx) =

∑Aj∈Pn

∣∣Er(X)IX∈Aj − Y IX∈Aj∣∣

=∑

Aj∈Pn

∣∣∣∣∣∫Aj

r(x)µ(dx)−∫Aj

η(x)µ(dx)

∣∣∣∣∣≤

∫|r(x)− η(x)|µ(dx) < ε/16.

Fourth term: Our aim is to show that for n large enough,

E∫|Eη∗n(x)− η∗n(x)|µ(dx) < ε/16.

To this end, let S be an arbitrary large ball centered at the origin. Denoteby mn the number of cells of the partition Pn that intersect S. Clearly,mn is proportional to 1/hdn as hn → 0. Using the notation νn(A) =1n

∑ni=1 IYi=1,Xi∈A, it is clear that νn(A) =

∫Aη∗n(x)µ(dx). Now, we can

write

E∫|Eη∗n(x)− η∗n(x)|µ(dx)

= E∑j

∫An,j

|Eη∗n(x)− η∗n(x)|µ(dx)

= E∑j

|Eνn(An,j)− νn(An,j)|

≤ E∑

j:An,j∩S 6=∅

|Eνn(An,j)− νn(An,j)|+ 2µ(Sc)

(where Sc denotes the complement of S)

≤∑

j:An,j∩S 6=∅

√E |Eνn(An,j)− νn(An,j)|2 + 2µ(Sc)


≤ mn1mn

∑j:An,j∩S 6=∅

√µ(An,j)

n+ 2µ(Sc)


≤ mn

√1mn

∑j:An,j∩S 6=∅ µ(An,j)

n+ 2µ(Sc)

(by Jensen’s inequality)

≤√mn

n+ 2µ(Sc)

≤ ε/16

if n and the radius of S are large enough, since mn/n converges to zeroby the condition nhdn → ∞, and µ(Sc) can be made arbitrarily small bychoice of S.

We have proved for the first term on the right-hand side of (9.1) that forn large enough,

E∫|η(x)− η∗n(x)|µ(dx) < ε/4.

Finally, we handle the second term on the right-hand side of (9.1) byobtaining an exponential bound for∫

|η(x)− η∗n(x)|µ(dx)−E∫|η(x)− η∗n(x)|µ(dx)

using Theorem 9.2. Fix the training data (x1, y1), . . . , (xn, yn) ∈ Rd×0, 1,and replace (xi, yi) by (xi, yi) changing the value of η∗n(x) to η∗ni(x). Thenη∗n(x)− η∗ni(x) differs from zero only on An(xi) and An(xi), and thus∫

|η(x)− η∗n(x)|µ(dx)−∫|η(x)− η∗ni(x)|µ(dx)

≤∫|η∗n(x)− η∗ni(x)|µ(dx)

≤(

1nµ(An(xi))

µ(An(xi)) +1

nµ(An(xi))µ(An(xi))

)≤ 2n.

So by Theorem 9.2, for sufficiently large n,

P∫

|η(x)− η∗n(x)|µ(dx) >ε

2

≤ P

∫|η(x)− η∗n(x)|µ(dx)−E

∫|η(x)− η∗n(x)|µ(dx) >

ε

4

≤ e−nε

2/32. 2

Remark. Strong universal consistency follows from the exponential boundon the probability PLn − L∗ > ε via the Borel-Cantelli lemma. Theinequality in Theorem 9.4 may seem universal in nature. However, it is


distribution-dependent in a surreptitious way because its range of validity,n ≥ n0, depends heavily on ε, hn, and the distribution. We know thatdistribution-free upper bounds could not exist anyway, in view of Theorem7.2. 2


Problem 9.1. Let U1, U2, . . . be independent random variables with zero mean.Show that the random variables Si =

∑i

j=1Uj i > 0 form a martingale.

Problem 9.2. Prove Lemma 9.1.

Problem 9.3. Let X1, . . . , Xn be real valued i.i.d. random variables with distri-bution function F (x), and corresponding empirical distribution function Fn(x) =1n

∑n

i=1IXi≤x. Denote the Kolmogorov-Smirnov statistic by

Vn = supx∈R

|Fn(x)− F (x)|.

Use Theorem 9.2 to show that

P|Vn −EVn| ≥ ε ≤ 2e−2nε2 .

Compare this result with Theorem 12.9. (None of them implies the other.)Also, consider a class A of subsets of Rd. Let Z1, . . . , Zn be i.i.d. random

variables in Rd with common distribution PZ1 ∈ A = ν(A), and consider therandom variable

Wn = supA∈A

|νn(A)− ν(A)|,

where νn(A) = 1n

∑n

i=1IZi∈A denotes the standard empirical measure of A.

Prove thatP|Wn −EWn| ≥ ε ≤ 2e−2nε2 .

Compare this result with Theorem 12.5, and note that this result is true even ifs(A, n) = 2n for all n.

Problem 9.4. The lazy histogram rule. Let Pn = An1, An2, . . . be a se-quence of partitions satisfying the conditions of the convergence Theorem 9.4.Define the lazy histogram rule as follows:

gn(x) = Yj , x ∈ Ani,

where Xj is the minimum-index point among X1, . . . , Xn for which Xj ∈ Ani. Inother words, we ignore all but one point in each set of the partition. If Ln is theconditional probability of error for the lazy histogram rule, then show that forany distribution of (X,Y ),

lim supn→∞

ELn ≤ 2L∗.


Problem 9.5. Assume that Pn = P = A1, . . . , Ak is a fixed partition into ksets. Consider the lazy histogram rule defined in Problem 9.4 based on P. Showthat for all distributions of (X,Y ), limn→∞ ELn exists and satisfies

limn→∞

ELn =

k∑i=1

2pi(1− pi)µ(Ai),

where µ is the probability measure for X, and pi =∫Aiη(X)µ(dx)/µ(Ai). Show

that the limit of the probability of error EL′n for the ordinary histogram rule is

limn→∞

EL′n =

k∑i=1

min(pi, 1− pi)µ(Ai),

and show that

limn→∞

ELn ≤ 2 limn→∞

EL′n.

Problem 9.6. Histogram density estimation. Let X1, . . . , Xn be i.i.d. ran-dom variables in Rd with density f . Let P be a partition of Rd, and define thehistogram density estimate by

fn(x) =1

nλ(A(x))

n∑i=1

IXi∈A(x),

where A(x) is the set in P that contains x and λ is the Lebesgue measure. Provefor the L1-error of the estimate that

P

∣∣∣∣∫ |fn(x)− f(x)|dx−E


∣∣∣∣ > ε

≤ 2e−nε

2/2.

(Devroye (1991a).) Conclude that weak L1-consistency of the estimate impliesstrong consistency (Abou-Jaoude (1976a; 1976c), see also Problem 6.2).

Problem 9.7. General partitions. Extend the consistency result of Theorem9.4 for sequences of general, not necessarily cubic partitions. Actually, cells ofthe partitions need not even be hyperrectangles. Assume that the sequence ofpartitions Pn satisfies the following two conditions. For every ball S centeredat the origin

limn→∞

maxi:Ani∩S 6=∅

(sup

x,y∈Ani

‖x− y‖)

= 0

and

limn→∞

1

n

∣∣∣i : Ani ∩ S 6= ∅∣∣∣ = 0.

Prove that the corresponding histogram classification rule is strongly universallyconsistent.

Problem 9.8. Show that for cubic histograms the conditions of Problem 9.7 onthe partition are equivalent to the conditions hn → 0 and nhdn →∞, respectively.


Problem 9.9. Linear scaling. Partition Rd into congruent rectangles of theform

[k1h1, (k1 + 1)h1)× · · · × [kdhd, (kd + 1)hd),

where k1, . . . , kd are integers, and h1, . . . , hd > 0 denote the size of the edges of therectangles. Prove that the corresponding histogram rule is strongly universallyconsistent if hi → 0 for every i = 1, . . . , d, and nh1h2 . . . hd → ∞ as n → ∞.Hint: This is a Corollary of Problem 9.7.

Problem 9.10. Nonlinear scaling. Let F1, . . . , Fd : R → R be invertible,strictly monotone increasing functions. Consider the partition of Rd, whose cellsare rectangles of the form

[F−11 (k1h1), F

−11 ((k1 + 1)h1))× . . .× [F−1

d (kdhd), F−1d ((kd + 1)hd)).

(See Problem 9.9.) Prove that the histogram rule corresponding to this partitionis strongly universally consistent under the conditions of Problem 9.9. Hint: UseProblem 9.7.

Problem 9.11. Necessary and sufficient conditions for the bias. A se-quence of partitions Pn is called µ-approximating if for every measurable set A,for every ε > 0, and for all sufficiently large n there is a set An ∈ σ(Pn), (σ(P) de-notes the σ-algebra generated by cells of the partition P) such that µ(An4A) < ε.Prove that the bias term

∫|η(x)−Eη∗n(x)|µ(dx) converges to zero for all distribu-

tions of (X,Y ) if and only if the sequence of partitions Pn is µ-approximatingfor every probability measure µ on Rd (Abou-Jaoude (1976a)). Conclude thatthe first condition of Problem 9.7 implies that Pn is µ-approximating for everyprobability measure µ. (Csiszar (1973)).

Problem 9.12. Necessary and sufficient conditions for the variation.Assume that for every probability measure µ onRd, every measurable set A, everyc > 0, and every ε > 0, there is an N(ε, c, A, µ), such that for all n > N(ε, c, A, µ),∑

j:µ(An,j∩A)≤c/n

µ(An,j ∩A) < ε.

Prove that the variation term∫|Eη∗n(x) − η∗n(x)|µ(dx) converges to zero for all

distributions of (X,Y ) if and only if the sequence of partitions Pn satisfies thecondition above (Abou-Jaoude (1976a)).

Problem 9.13. The ε-effective cardinality m(P, µ,A, ε) of a partition with re-spect to the probability measure µ, restricted to a set A is the minimum numberof sets in P such that the union of the remaining sets intersected with A hasµ-measure less than ε. Prove that the sequence of partitions Pn satisfies thecondition of Problem 9.12 if and only if for every ε > 0,

limn→∞

m(Pn, µ,A, ε)n

= 0

(Barron, Gyorfi, and van der Meulen (1992)).


Problem 9.14. In R2, partition the plane by taking three fixed points not on aline, x, y and z. At each of these points, partitionR2 by considering k equal sectorsof angle 2π/k each. Sets in the histogram partition are obtained as intersectionsof cones. Is the induced histogram rule strongly universally consistent? If yes,state the conditions on k, and if no, provide a counterexample.

Problem 9.15. Partition R2 into shells of size h each. The i-th shell containsall points at distance d ∈ [(i− 1)h, ih) from the origin. Let h → 0 and nh → ∞as n→∞. Consider the histogram rule. As n→∞, to what does ELn converge?


10Kernel Rules

Histogram rules have the somewhat undesirable property that the rule isless accurate at borders of cells of the partition than in the middle of cells.Looked at intuitively, this is because points near the border of a cell shouldhave less weight in a decision regarding the cell’s center. To remedy thisproblem, one might introduce the moving window rule, which is smootherthan the histogram rule. This classifier simply takes the data points withina certain distance of the point to be classified, and decides according tomajority vote. Working formally, let h be a positive number. Then themoving window rule is defined as

gn(x) =

0 if∑ni=1 IYi=0,Xi∈Sx,h ≥

∑ni=1 IYi=1,Xi∈Sx,h

1 otherwise,

where Sx,h denotes the closed ball of radius h centered at x.It is possible to make the decision even smoother by giving more weight

to closer points than to more distant ones. Let K : Rd → R be a kernelfunction, which is usually nonnegative and monotone decreasing along raysstarting from the origin. The kernel classification rule is given by

gn(x) =

0 if∑ni=1 IYi=0K

(x−Xi

h

)≥∑ni=1 IYi=1K

(x−Xi

h

)1 otherwise.

10. Kernel Rules 161

Figure 10.1. The movingwindow rule in R2. The deci-sion is 1 in the shaded area.

The number h is called the smoothing factor, or bandwidth. It provides someform of distance weighting.

Figure 10.2. Kernel rule onthe real line. The figure shows∑ni=1(2Yi − 1)K((x − Xi)/h)

for n = 20, K(u) = (1 −u2)I|u|≤1 (the Epanechnikovkernel), and three smooth-ing factors h. One definitelyundersmooths and one over-smooths. We took p = 1/2,and the class-conditional den-sities are f0(x) = 2(1−x) andf1(x) = 2x on [0, 1].

Clearly, the kernel rule is a generalization of the moving window rule, sincetaking the special kernel K(x) = Ix∈S0,1 yields the moving window rule.This kernel is sometimes called the naıve kernel. Other popular kernelsinclude the Gaussian kernel, K(x) = e−‖x‖

2; the Cauchy kernel, K(x) =

1/(1 + ‖x‖d+1); and the

10.1 Consistency 162

Epanechnikov kernel K(x) = (1 − ‖x‖2)I‖x‖≤1, where ‖ · ‖ denotes Eu-clidean distance.

Figure 10.3. Various kernels on R.

Kernel-based rules are derived from the kernel estimate in density estima-tion originally studied by Parzen (1962), Rosenblatt (1956), Akaike (1954),and Cacoullos (1965) (see Problems 10.2 and 10.3); and in regression esti-mation, introduced by Nadaraya (1964; 1970), and Watson (1964). For par-ticular choices ofK, rules of this sort have been proposed by Fix and Hodges(1951; 1952), Sebestyen (1962), Van Ryzin (1966), and Meisel (1969). Sta-tistical analysis of these rules and/or the corresponding regression functionestimate can be found in Nadaraya (1964; 1970), Rejto and Revesz (1973),Devroye and Wagner (1976b; 1980a; 1980b), Greblicki (1974; 1978b; 1978a),Krzyzak and Pawlak (1984b), and Devroye and Krzyzak (1989). Usage ofCauchy kernels in discrimination is investigated by Arkadjew and Braver-man (1966), Hand (1981), and Coomans and Broeckaert (1986).

10.1 Consistency

In this section we demonstrate strong universal consistency of kernel-basedrules under general conditions on h and K. Let h > 0 be a smoothing fac-tor depending only on n, and let K be a kernel function. If the conditionaldensities f0, f1 exist, then weak and strong consistency follow from Prob-lems 10.2 and 10.3, respectively, via Problem 2.11. We state the universalconsistency theorem for a large class of kernel functions, namely, for allregular kernels.

Definition 10.1. The kernel K is called regular if it is nonnegative, andthere is a ball S0,r of radius r > 0 centered at the origin, and constant b > 0such that K(x) ≥ bIS0,r

and∫

supy∈x+S0,rK(y)dx <∞.

We provide three informative exercises on regular kernels (Problems10.18, 10.19, 10.20). In all cases, regular kernels are bounded and inte-grable. The last condition holds whenever K is integrable and uniformly


continuous. Introduce the short notation Kh(x) = 1hK(xh ). The next the-

orem states strong universal consistency of kernel rules. The theorem isessentially due to Devroye and Krzyzak (1989). Under the assumption thatX has a density, it was proven by Devroye and Gyorfi (1985) and Zhao(1989).

Theorem 10.1. (Devroye and Krzyzak (1989)). Assume that K isa regular kernel. If

h→ 0 and nhd →∞ as n→∞,

then for any distribution of (X,Y ), and for every ε > 0 there is an integern0 such that for n > n0 for the error probability Ln of the kernel rule

PLn − L∗ > ε ≤ 2e−nε2/(32ρ2),

where the constant ρ depends on the kernel K and the dimension only.Thus, the kernel rule is strongly universally consistent.

Clearly, naıve kernels are regular, and moving window rules are thusstrongly universally consistent. For the sake of readability, we give the prooffor this special case only, and leave the extension to regular kernels to thereader—see Problems 10.14, 10.15, and 10.16. Before we embark on theproof in the next section, we should warn the reader that Theorem 10.1 isof no help whatsoever regarding the choice of K or h. One possible solutionis to derive explicit upper bounds for the probability of error as a functionof descriptors of the distribution of (X,Y ), and of K,n and h. Minimizingsuch bounds with respect to K and h will lead to some expedient choices.Typically, such bounds would be based upon the inequality

ELn − L∗

≤ E∫

|(1− p)f0(x)− pn0fn0(x)|dx+∫|pf1(x)− pn1fn1(x)|dx

(see Chapter 6), where f0, f1 are the class densities, fn0, fn1 are their ker-nel estimates (see Problem 10.2) (1 − p) and p are the class probabilities,and pn0, pn1 are their relative-frequency estimates. Bounds for the expectedL1-error in density estimation may be found in Devroye (1987) for d = 1and Holmstrom and Klemela (1992) for d > 1. Under regularity conditionson the distribution, the choice h = cn−d/(d+4) for some constant is asymp-totically optimal in density estimation. However, c depends upon unknowndistributional parameters. Rather than following this roundabout process,we ask the reader to be patient and to wait until Chapter 25, where westudy automatic kernel rules, i.e., rules in which h, and sometimes K aswell, is picked by the data without intervention from the statistician.


It is still too early to say meaningful things about the choice of a kernel.The kernel density estimate

fn(x) =1nhd

n∑i=1

K

(x−Xi

h

)based upon an i.i.d. sample X1, . . . , Xn drawn from an unknown densityf is clearly a density in its own right if K ≥ 0 and

∫K = 1. Also, there

are certain popular choices of K that are based upon various optimalitycriteria. In pattern recognition, the story is much more confused, as thereis no compelling a priori reason to pick a function K that is nonnegativeor integrable. Let us make a few points with the trivial case n = 1. Takingh = 1, the kernel rule is given by

g1(x) =

0 if Y1 = 0,K(x−X1) ≥ 0 or if Y1 = 1,K(x−X1) ≤ 01 otherwise.

If K ≥ 0, then gn(x) = 0 if Y1 = 0, or if Y1 = 1 and K(x − X1) = 0. Aswe would obviously like gn(x) = 0 if and only if Y1 = 0, it seems necessaryto insist on K > 0 everywhere. However, this restriction makes the kernelestimate nonlocal in nature.

For n = 1 and d = 1, consider next a negative-valued kernel such as theHermite kernel

K(x) = (1− x2)e−x2.

Figure 10.4. Hermite kernel.

It is easy to verify that K(x) ≥ 0 if and only if |x| ≤ 1. Also,∫K = 0.

Nevertheless, we note that it yields a simple rule:

g1(x) =

0 if Y1 = 0, |x−X1| ≤ 1 or if Y1 = 1, |x−X1| ≥ 11 otherwise.

If we have a biatomic distribution for X, with equally likely atoms at 0 and2, and η(0) = 0 and η(2) = 1 (i.e., Y = 0 if X = 0 and Y = 1 if X = 2),then L∗ = 0 and the probability of error for this kernel rule (L1) is 0 aswell. Note also that for all n, gn = g1 if we keep the same K. Consider nowany positive kernel in the same example. If X1, . . . , Xn are all zero, thenthe decision is gn(x) = 0 for all x. Hence Ln ≥ 1

2PX1 = · · · = Xn =1/2n+1 > 0. Our negative zero-integral kernel is strictly better for all n thanany positive kernel! Such kernels should not be discarded without furtherthought. In density estimation, negative-valued kernels are used to reduce


the bias under some smoothness conditions. Here, as shown above, there isan additional reason—negative weights given to points far away from theXi’s may actually be beneficial.

Staying with the same example, if K > 0 everywhere, then

EL1 = PY1 = 0, Y = 1+ PY1 = 1, Y = 0 = 2Eη(X)E1− η(X),

which may be 1/2 (if Eη(X) = 1/2) even if L∗ = 0 (which happens whenη ∈ 0, 1 everywhere). For this particular example, we would have ob-tained the same result even if K ≡ 1 everywhere. With K ≡ 1, we simplyignore the Xi’s and take a majority vote among the Yi’s (with K ≡ −1, itwould be a minority vote!):

gn(x) =

0 if∑ni=1 IYi=0 ≥

∑ni=1 IYi=1

1 otherwise.

Let Nn be the number of Yi’s equal to zero. As Nn is binomial (n, 1 − p)with p = Eη(X) = PY = 1, we see that

ELn = pPNn ≥

n

2

+ (1− p)P

Nn <

n

2

→ min(p, 1− p),

simply by invoking the law of large numbers. Thus, ELn → min(p, 1− p).As in the case with n = 1, the limit is 1/2 when p = 1/2, even thoughL∗ = 0 when η ∈ 0, 1 everywhere. It is interesting to note the followingthough:

EL1 = 2p(1− p)

= 2min(p, 1− p) (1−min(p, 1− p))

≤ 2 min(p, 1− p)

= 2 limn→∞

ELn.

The expected error with one observation is at most twice as bad as theexpected error with an infinite sequence. We have seen various versions ofthis inequality at work in many instances such as the nearest neighbor rule.

Let us apply the inequality for EL1 to each part in a fixed partition P ofRd. On each of the k sets A1, . . . , Ak of P, we apply a simple majority voteamong the Yi’s, as in the histogram rule. If we define the lazy histogramrule as the one in which in each set Ai, we assign the class according tothe Yj for which Xj ∈ Ai and j is the lowest such index (“the first pointto fall in Ai”). It is clear (see Problems 9.4 and 9.5) that

limn→∞

ELLAZY,n ≤ 2 limn→∞

ELn

= 2k∑i=1

1µ(Ai)

∫Ai

η(x)µ(dx)∫Ai

(1− η(x))µ(dx),


where Ln is the probability of error for the ordinary histogram rule. Again,the vast majority of observations is barely needed to reach a good decision.

Just for fun, let us return to a majority vote rule, now applied to thefirst three observations only. With p = PY = 1, we see that

EL3 = p((1− p)3 + 3(1− p)2p

)+ (1− p)

(3(1− p)p2 + p3

)by just writing down binomial probabilities. Observe that

EL3 = p(1− p) (1 + 4p(1− p))

≤ min(p, 1− p) (1 + 4min(p, 1− p))

= limn→∞

ELn(1 + 4 lim

n→∞4ELn

).

If limn→∞ELn is small to start with, e.g., limn→∞ELn = 0.01, thenEL3 ≤ 0.01 × 1.04 = 0.0104. In such cases, it just does not pay to takemore than three observations.

Kernels with fixed smoothing factors have no local sensitivity and, exceptin some circumstances, have probabilities of error that do not converge toL∗. The universal consistency theorem makes a strong case for decreasingsmoothing factors—there is no hope in general of approaching L∗ unlessdecisions are asymptotically local.

The consistency theorem describes kernel rules with h → 0: these rulesbecome more and more local in nature as n → ∞. The necessity of localrules is not apparent from the previous biatomic example. However, it isclear that if we consider a distribution in which given Y = 0, X is uniformon δ, 3δ, . . . , (2k+1)δ, and given Y = 1,X is uniform on 2δ, 4δ, . . . , 2kδ,that is, with the two classes intimately interwoven, a kernel rule with K ≥ 0of compact support [−1, 1], and h < δ < 1 will have

Ln ≤ I⋃2k

i=1Ni=0

,where Ni is the number of Xj ’s at the i-th atom. Hence ELn goes to zeroexponentially fast. If in the above example we assign X by a geometricdistribution on δ, δ3, δ5, . . . when Y = 0, and by a geometric distributionon δ2, δ4, δ6, . . . when Y = 1, then to obtain ELn → 0, it is necessary thath→ 0 (see Problem 10.1).

Remark. It is worthwhile to investigate what happens for negative-valuedkernels K when h→ 0, nhd →∞ and K has compact support. Every deci-sion becomes an average over many local decisions. If µ has a density f , thenat almost all points x, f may be approximated very nicely by

∫Sx,δ

f/λ(Sx,δ)for small δ > 0, where Sx,δ is the closed ball of radius δ about x. This im-plies, roughly speaking, that the number of weighted votes from class 0observations in a neighborhood of x is about (1− η(x))f(x)nhd

∫K, while

for class 1 the weight is about η(x)f(x)nhd∫K. The correct decision is

10.2 Proof of the Consistency Theorem 167

nearly always made for nhd large enough provided that∫K > 0. See

Problem 10.4 on why kernels with∫K < 0 should be avoided. 2

10.2 Proof of the Consistency Theorem

In the proof we can proceed as for the histogram. The crucial differenceis captured in the following covering lemmas. Let βd denote the minimumnumber of balls of radius 1/2 that cover the ball S0,1. If K is the naıvekernel, then ρ = βd in Theorem 10.1.

Lemma 10.1. (Covering lemma). If K(x) = Ix∈S0,1, then for anyy ∈ Rd, h > 0, and probability measure µ,∫

Kh(x− y)∫Kh(x− z)µ(dz)

µ(dx) ≤ βd.

Proof. Cover the ball Sy,h by βd balls of radius h/2. Denote their centersby x1, . . . , xβd

. Then x ∈ Sxi,h/2 implies Sxi,h/2 ⊂ Sx,h and thus

µ(Sx,h) ≥ µ(Sxi,h/2).

We may write∫Kh(x− y)∫

Kh(x− z)µ(dz)µ(dx) =

∫Ix∈Sy,h

µ(Sx,h)µ(dx)

≤βd∑i=1

∫ Ix∈Sxi,h/2

µ(Sx,h)µ(dx)

≤βd∑i=1

∫ Ix∈Sxi,h/2

µ(Sxi,h/2)µ(dx)

= βd. 2

Lemma 10.2. Let 0 < h ≤ R <∞, and let S ⊂ Rd be a ball of radius R.Then for every probability measure µ,∫

S

1√µ(Sx,h)

µ(dx) ≤(

1 +R

h

)d/2cd,

where cd depends upon the dimension d only.


Proof. Cover S with balls of radius h/2, centered at center points of aregular grid of dimension h/(2

√d) × . . . × h/(2

√d). Denote these centers

by x1, . . . , xm, where m is the number of balls that cover S. Clearly,

m ≤ volume(S0,R+h)volume(grid cell)

=Vd(R+ h)d(h/(2

√d))d (Vd is the volume of the unit ball in Rd)

≤(

1 +R

h

)dc′d,

where the constant c′d depends upon the dimension only. Every x getscovered at most k1 times where k1 depends upon d only. Then we have∫

S

1√µ(Sx,h)

µ(dx) ≤m∑i=1

∫Sxi,h/2

1√µ(Sx,h)

µ(dx)

=m∑i=1

∫ Ix∈Sxi,h/2√µ(Sx,h)

µ(dx)

≤m∑i=1

∫ Ix∈Sxi,h/2√µ(Sxi,h/2)

µ(dx)

(by the same argument as in Lemma 10.1)

≤m∑i=1

√µ(Sxi,h/2)

≤

√√√√mm∑i=1

µ(Sxi,h/2)


≤√k1m

≤

√(1 +

R

h

)dcd,

where cd depends upon the dimension only. 2

Proof of Theorem 10.1. Define

ηn(x) =

∑nj=1 YjKh(x−Xj)nEKh(x−X)

.


Since the decision rule can be written as

gn(x) =

0 if

∑nj=1 YjKh(x−Xj)nEKh(x−X)

≤∑nj=1(1− Yj)Kh(x−Xj)

nEKh(x−X)1 otherwise,

by Theorem 2.3, what we have to prove is that for n large enough

P∫

|η(x)− ηn(x)|µ(dx) >ε

2

≤ e−nε

2/(32ρ2).

We use a decomposition as in the proof of strong consistency of the his-togram rule:

|η(x)− ηn(x)|

= E|η(x)− ηn(x)|+(|η(x)− ηn(x)| −E|η(x)− ηn(x)|

). (10.1)

To handle the first term on the right-hand side, fix ε′ > 0, and let r : Rd →R be a continuous function of bounded support satisfying∫

|η(x)− r(x)|µ(dx) < ε′.

Obviously, we can choose the function r such that 0 ≤ r(x) ≤ 1 for allx ∈ Rd. Then we have the following simple upper bound:

E|η(x)− ηn(x)|

≤ |η(x)− r(x)|+∣∣∣∣r(x)− E r(X)Kh(x−X)

EKh(x−X)

∣∣∣∣+∣∣∣∣E r(X)Kh(x−X)

EKh(x−X)−Eηn(x)

∣∣∣∣+ E|Eηn(x)− ηn(x)|.(10.2)

Next we bound the integral of each term on the right-hand side of theinequality above.

First term: By the definition of r,∫|η(x)− r(x)|µ(dx) < ε′.

Second term: Since r(x) is continuous and zero outside of a boundedset, it is also uniformly continuous, that is, there exists a δ > 0 such that‖x− y‖ < δ implies |r(x)− r(y)| < ε′. Also, r(x) is bounded. Thus,∫ ∣∣∣∣r(x)− E r(X)Kh(x−X)

EKh(x−X)

∣∣∣∣µ(dx)

=∫ ∣∣∣∣r(x)− ∫ r(y)Kh(x− y)µ(dy)

EKh(x−X)

∣∣∣∣µ(dx)


≤∫ ∫

Sx,δ

Kh(x− y)EKh(x−X)

|r(x)− r(y)|µ(dy)µ(dx)

+∫ ∫

Scx,δ


|r(x)− r(y)|µ(dy)µ(dx)

≤∫ (∫

Sx,δ


µ(dy) supz∈Sx,δ

|r(x)− r(z)|

)µ(dx)

+∫ (∫

Scx,δ


µ(dy)

)µ(dx).

In the last step we used the fact that supx,y |r(x) − r(y)| ≤ 1. Clearly,we have

∫Sx,δ

Kh(x−y)EKh(x−X)µ(dy) ≤ 1, and by the uniform continuity of r(x),

supz∈Sx,δ|r(x)− r(z)| < ε′. Thus, the first term at the end of the chain of

inequalities above is bounded by ε′. The second term converges to zero sinceh < δ for all n large enough, which in turn implies

∫Sc

x,δ

Kh(x−y)EKh(x−X)µ(dy) =

0. (This is obvious for the naıve kernel. For regular kernels, convergenceto zero follows from Problem 10.15.) The convergence of the integral (withrespect to µ(dx)) follows from the dominated convergence theorem. In sum-mary, we have shown that

lim supn→∞

∫ ∣∣∣∣r(x)− E r(X)Kh(x−X)EKh(x−X)

∣∣∣∣µ(dx) ≤ ε′.

Third term:∫ ∣∣∣∣E r(X)Kh(x−X)EKh(x−X)

−Eηn(x)∣∣∣∣µ(dx)

=∫ ∣∣∣∣∫ (r(y)− η(y))Kh(x− y)µ(dy)

EKh(x−X)

∣∣∣∣µ(dx)

≤∫ ∫

|r(y)− η(y)| Kh(x− y)∫Kh(x− z)µ(dz)

µ(dy)µ(dx)

=∫ (∫


µ(dx))|r(y)− η(y)|µ(dy)

(by Fubini’s theorem)

≤∫ρ|r(y)− η(y)|µ(dy) ≤ ρε′,

where in the last two steps we used the covering lemma (see Lemma 10.1 forthe naıve kernel, and Problem 10.14 for general kernels), and the definitionof r(x). ρ is the constant appearing in the covering lemma:


Fourth term: We show that

E∫

|Eηn(x)− ηn(x)|µ(dx)→ 0.

For the naıve kernel, we have

E |Eηn(x)− ηn(x)|

≤√

E |Eηn(x)− ηn(x)|2

=

√√√√√E(∑n

j=1(YjKh(x−Xj)−EY Kh(x−X)))2

n2(EKh(x−X))2

=

√√√√E

(Y Kh(x−X)−EY Kh(x−X))2

n(EKh(x−X))2

≤

√√√√E

(Y Kh(x−X))2

n(EKh(x−X))2≤

√√√√E

(Kh(x−X))2

n(EKh(x−X))2

≤

√EK((x−X)/h)

n(EK((x−X)/h))2≤

√1

nµ(Sx,h),

where we used the Cauchy-Schwarz inequality, and properties of the naıvekernel. Extension to regular kernels is straightforward.

Next we use the inequality above to show that the integral converges tozero. Divide the integral over Rd into two terms, namely an integral overa large ball S centered at the origin, of radius R > 0, and an integral overSc. For the integral outside of the ball we have∫

Sc

E |Eηn(x)− ηn(x)|µ(dx) ≤ 2∫Sc

Eηn(x)µ(dx) → 2∫Sc

η(x)µ(dx)

with probability one as n → ∞, which can be shown in the same way weproved ∫

Rd

Eηn(x)µ(dx) →∫Rd

η(x)µ(dx)

(see the first, second, and third terms of (10.2)). Clearly, the radius R ofthe ball S can be chosen such that 2

∫Sc η(x)µ(dx) < ε/4. To bound the

integral over S we employ Lemma 10.2:∫S

E |Eηn(x)− ηn(x)|µ(dx) ≤ 1√n

∫S

1√µ(Sx,h)

µ(dx)

(by the inequality obtained above)

10.3 Potential Function Rules 172

≤ 1√n

(1 +

R

h

)d/2cd

→ 0 (since by assumption nhd →∞).

Therefore, if n is sufficiently large, then for the first term on the right-handside of (10.1) we have

E∫

|η(x)− ηn(x)|µ(dx)< ε′(ρ+ 3) = ε/4

if we take ε′ = ε/(4ρ+ 12).It remains to show that the second term on the right-hand side of (10.1)

is small with large probability. To do this, we use McDiarmid’s inequality(Theorem 9.2) for∫

|η(x)− ηn(x)|µ(dx)−E∫

|η(x)− ηn(x)|µ(dx).

Fix the training data at ((x1, y1), . . . , (xn, yn)) and replace the i-th pair(xi, yi) by (xi, yi), changing the value of ηn(x) to η∗ni(x). Clearly, by thecovering lemma (Lemma 10.1),∫

|η(x)− ηn(x)|µ(dx)−∫|η(x)− η∗ni(x)|µ(dx)

≤∫|ηn(x)− η∗ni(x)|µ(dx) ≤ sup

y∈Rd

∫2Kh(x− y)nEKh(x−X)

µ(dx)

≤ 2ρn.

So by Theorem 9.2,

P∫


2

≤ P

∫|η(x)− ηn(x)|µ(dx)−E

∫|η(x)− ηn(x)|µ(dx)

>ε

4

≤ e−nε

2/(32ρ2).

The proof is now completed. 2

10.3 Potential Function Rules

Kernel classification rules may be formulated in terms of the so-called po-tential function rules. These rules were originally introduced and studied


by Bashkirov, Braverman and Muchnik (1964), Aizerman, Braverman andRozonoer (1964c; 1964b; 1964a; 1970), Braverman (1965), and Bravermanand Pyatniskii (1966). The original idea was the following: put a unit ofpositive electrical charge at every data point Xi, where Yi = 1, and a unitof negative charge, at data points Xi where Yi = 0. The resulting potentialfield defines an intuitively appealing rule: the decision at a point x is one ifthe potential at that point is positive, and zero if it is negative. This idealeads to a rule that can be generalized to obtain rules of the form

gn(x) =

0 if fn(x) ≤ 01 otherwise,

where

fn(x) =n∑i=1

rn,i(Dn)Kn,i(x,Xi),

where the Kn,i’s describe the potential field around Xi, and the rn,i’s aretheir weights. Rules that can be put into this form are often called potentialfunction rules. Here we give a brief survey of these rules.

Kernel rules. Clearly, kernel rules studied in the previous section arepotential function rules with

Kn,i(x, y) = K

(x− y

hn

), and rn,i(Dn) = 2Yi − 1.

Here K is a fixed kernel function, and h1, h2, . . . is a sequence of positivenumbers.

Histogram rules. Similarly, histogram rules (see Chapters 6 and 9) canbe put in this form, by choosing


Kn,i(x, y) = Iy∈An(x), and rn,i(Dn) = 2Yi − 1.

Recall that An(x) denotes the cell of the partition in which x falls.

Polynomial discriminant functions. Specht (1967) suggested apply-ing a polynomial expansion to the kernel K

(x−yh

). This led to the choice

Kn,i(x, y) =k∑j=1

ψj(x)ψj(y), and rn,i(Dn) = 2Yi − 1,

where ψ1, . . . , ψk are fixed real-valued functions on Rd. When these func-tions are polynomials, the corresponding classifier gn is called a polynomialdiscriminant function. The potential function rule obtained this way is ageneralized linear rule (see Chapter 17) with

fn(x) =k∑j=1

an,jψj(x),

where the coefficients an,j depend on the data Dn only, through

an,j =n∑i=1

(2Yi − 1)ψj(Xi).

This choice of the coefficients does not necessarily lead to a consistent rule,unless the functions ψ1, . . . , ψk are allowed to change with n, or k is allowedto vary with n. Nevertheless, the rule has some computational advantagesover kernel rules. In many practical situations there is enough time topreprocess the data Dn, but once the observation X becomes known, thedecision has to be made very quickly. Clearly, the coefficients an,1, . . . , an,ncan be computed by knowing the training data Dn only, and if the val-ues ψ1(X), . . . , ψk(X) are easily computable, then fn(X) can be computedmuch more quickly than in a kernel-based decision, where all n terms ofthe sum have to be computed in real time, if no preprocessing is done.However, using preprocessing of the data may also help with kernel rules,especially when d = 1. For a survey of computational speed-up with kernelmethods, see Devroye and Machell (1985).

Recursive kernel rules. Consider the choice

Kn,i(x, y) = K

(x− y

hi

), and rn,i(Dn) = 2Yi − 1. (10.3)

Observe that the only difference between this and the ordinary kernel ruleis that in the expression of Kn,i, the smoothing parameter hn is replacedwith hi. With this change, we can compute the rule recursively by observingthat

fn+1(x) = fn(x) + (2Yn+1 − 1)K(x−Xn+1

hn+1

).


The computational advantage of this rule is that if one collects additionaldata, then the rule does not have to be entirely recomputed. It can beadjusted using the formula above. Consistency properties of this rule werestudied by Devroye and Wagner (1980b), Krzyzak and Pawlak (1984a),Krzyzak (1986), and Greblicki and Pawlak (1987). Several similar recursivekernel rules have been studied in the literature. Wolverton and Wagner(1969b), Greblicki (1974), and Krzyzak and Pawlak (1983), studied thesituation when

Kn,i(x, y) =1hdiK

(x− y

hi

)and rn,i(Dn) = 2Yi − 1. (10.4)

The corresponding rule can be computed recursively by

fn+1(x) = fn(x) + (2Yn+1 − 1)1

hdn+1

K

(x−Xn+1

hn+1

).

Motivated by stochastic approximation methods (see Chapter 17), Revesz(1973) suggested and studied the rule obtained from

fn+1(x) = fn(x) +1

n+ 1(2Yn+1 − 1− fn(x))

1hdn+1

K

(x−Xn+1

hn+1

).

A similar rule was studied by Gyorfi (1981):

fn+1(x) = fn(x) +1

n+ 1(2Yn+1 − 1− fn(x))K

(x−Xn+1

hn+1

).


Problem 10.1. Let K be a nonnegative kernel with compact support on [−1, 1].Show that for some distribution, h→ 0 is necessary for consistency of the kernelrule. To this end, consider the following example. Given Y = 0, X has a geometricdistribution on δ, δ3, δ5, . . ., and given Y = 1, X has a geometric distribution onδ2, δ4, δ6, . . .. Then show that to obtain ELn → L∗ = 0, it is necessary thath→ 0.

Problem 10.2. Kernel density estimation. Let X1, . . . , Xn be i.i.d. randomvariables in Rd with density f . Let K be a kernel function integrating to one,and let hn > 0 be a smoothing factor. The kernel density estimate is defined by

fn(x) =1

nhdn

n∑i=1

K(x−Xihn

)(Rosenblatt (1956), Parzen (1962)). Prove that the estimate is weakly universallyconsistent in L1 if hn → 0 and nhdn →∞ as n→∞. Hint: Proceed as in Problem6.2.


Problem 10.3. Strong consistency of kernel density estimation. LetX1, . . . , Xn be i.i.d. random variables in Rd with density f . Let K be a nonneg-ative function integrating to one (a kernel) and h > 0 a smoothing factor. As inthe previous exercise, the kernel density estimate is defined by

fn(x) =1

nhd

n∑i=1

K(x−Xih

).

Prove for the L1-error of the estimate that

P

∣∣∣∣∫ |fn(x)− f(x)|dx−E


∣∣∣∣ > ε

≤ 2e−nε

2/2

(Devroye (1991a)). Conclude that weak L1-consistency of the estimate impliesstrong consistency (see Problem 10.2). This is a way to show that weak andstrong L1-consistencies of the kernel density estimate are equivalent (Devroye(1983).) Also, for d = 1, since if K is nonnegative, then E

∫|fn(x) − f(x)|dx

cannot converge to zero faster than n−2/5 for any density (see Devroye and Gyorfi(1985)), therefore, the inequality above implies that for any density

limn→∞


E∫|fn(x)− f(x)|dx

= 0

with probability one (Devroye (1988d)). This property is called the relative sta-bility of the L1 error. It means that the asymptotic behavior of the L1-error isthe same as that of its expected value. Hint: Use McDiarmid’s inequality.

Problem 10.4. If∫K < 0, show that under the assumption that µ has a density

f , and that h→ 0, nhd →∞, the kernel rule has

limn→∞

ELn = E max(η(X), 1− η(X)) = 1− L∗.

Thus, the rule makes the wrong decisions, and such kernels should be avoided.Hint: You may use the fact that for the kernel density estimate with kernel Lsatisfying

∫L = 1, ∫

|fn(x)− f(x)|dx→ 0

with probability one, if h→ 0 and nhd →∞ (see Problems 10.2 and 10.3).

Problem 10.5. Consider a devilish kernel that attaches counterproductive weightto the origin:

K(x) =

−1 if ‖x‖ ≤ 1/31 if 1/3 < ‖x‖ ≤ 10 if ‖x‖ > 1.

Assume that h→ 0, yet nhd →∞. Assume that L∗ = 0. Show that Ln → 0 withprobability one. Concession: if you find that you can’t handle the universality,try first proving the statement for strictly separable distributions.

Problem 10.6. Show that for the distribution depicted in Figure 10.2, the kernelrule with kernel K(u) = (1−u2)I|u|≤1 is consistent whenever h, the smoothingfactor, remains fixed and 0 < h ≤ 1/2.


Problem 10.7. The limit for fixed h. Consider a kernel rule with fixed h ≡ 1,and fixed kernel K. Find a simple argument that proves

limn→∞

ELn = L∞,

where L∞ is the probability of error for the decision g∞ defined by

g∞(x) =

0 if EK(x−X)(2η(X)− 1) ≤ 01 if EK(x−X)(2η(X)− 1) > 0.

Find a distribution such that for the window kernel, L∞ = 1/2, yet L∗ = 0. Isthere such a distribution for any kernel? Hint: Try proving a convergence resultat each x by invoking the law of large numbers, and then replace x by X.

Problem 10.8. Show that the conditions hn → 0 and nhd → ∞ of Theorem10.1 are not necessary for consistency, that is, exhibit a distribution such thatthe kernel rule is consistent with hn = 1, and exhibit another distribution forwhich the kernel rule is consistent with hn ∼ 1/n1/d.

Problem 10.9. Prove that the conditions hn → 0 and nhd → ∞ of Theorem10.1 are necessary for universal consistency, that is, show that if one of theseconditions are violated then there is a distribution for which the kernel rule isnot consistent (Krzyzak (1991)).

Problem 10.10. This exercise provides an argument in favor of monotonicityof the kernel K. In R2, find a nonatomic distribution for (X,Y ), and a positivekernel with

∫K > 0, K vanishing off S0,δ for some δ > 0, such that for all h > 0,

and all n, the kernel rule has ELn = 1/2, while L∗ = 0. This result says that thecondition K(x) ≥ bIS0,δ for some b > 0 in the universal consistency theoremcannot be abolished altogether.

Problem 10.11. With K as in the previous problem, and taking h = 1, showthat

limn→∞

ELn = L∗

under the following conditions:(1) K has compact support vanishing off S0,δ, K ≥ 0, and K ≥ bIS0,ε for

some ε > 0.(2) We say that we have agreement on Sx,δ when for all z ∈ Sx,δ, either

η(z) ≤ 1/2, or η(z) ≥ 1/2. We ask that PAgreement on SX,δ = 1.

Problem 10.12. The previous exercise shows that at points where there is agree-ment, we make asymptotically the correct decision with kernels with fixed smooth-ing factor. Let D be the set x : η(x) = 1/2, and let the δ-neighborhood of Dbe defined by Dδ = y : ‖y − x‖ ≤ δ for some x ∈ D. Let µ be the probabilitymeasure for X. Take K,h as in the previous exercise. Noting that x /∈ Dδ meansthat we have agreement on Sx,δ, show that for all distributions of (X,Y ),

lim supn→∞

ELn ≤ L∗ + µ(Dδ).


Figure 10.5. δ-neighborhood of

a set D.

Problem 10.13. Continuation. Clearly, µ(Dδ) → 0 as δ → 0 when µ(D) = 0.Convince yourself that µ(D) = 0 for most problems. If you knew how fast µ(Dδ)tended to zero, then the previous exercise would enable you to pick h as a functionof n such that h→ 0 and such that the upper bound for ELn obtained by analogyfrom the previous exercise is approximately minimal. If in Rd, D is the surface ofthe unit ball, X has a bounded density f , and η is Lipschitz, determine a boundfor µ(Dδ). By considering the proof of the universal consistency theorem, showhow to choose h such that

ELn − L∗ = O(

1

n1/(2+d)

).

Problem 10.14. Extension of the covering lemma (Lemma 10.1) to reg-ular kernels. Let K be a regular kernel, and let µ be an arbitrary probabilitymeasure. Prove that there exists a finite constant ρ = ρ(K) only depending uponK such that for any y and h∫


µ(dx) ≤ ρ

(Devroye and Krzyzak (1989)). Hint: Prove this by checking the following:(1) First take a bounded overlap cover of Rd with translates of S0,r/2, where

r > 0 is the constant appearing in the definition of a regular kernel. Thiscover has an infinite number of member balls, but every x gets coveredat most k1 times where k1 depends upon d only.

(2) The centers of the balls are called xi, i = 1, 2, . . . . The integral conditionon K implies that

∞∑i=1

supx∈xi+S0,r/2

K(x) ≤ k1∫S0,r/2

dx

∫sup

y∈x+S0,r/2

K(y)dx ≤ k2

for another finite constant k2.(3) Show that

Kh(x− y) ≤∞∑i=1

supx∈y+hxi+S0,rh/2

Kh(x− y)I[x∈y+hxi+S0,rh/2],

and∫Kh(x− z)µ(dz) ≥ bµ(y + hxi + S0,rh/2) (x ∈ y + hxi + S0,rh/2).


From (c) conclude∫Kh(x− y)∫

Kh(x− z)µ(dz)µ(dx)

≤∞∑i=1

∫x∈y+hxi+S0,rh/2

supz∈hxi+S0,rh/2Kh(z)

bµ(y + hxi + S0,rh/2)µ(dx)

=

∞∑i=1

µ(y + hxi + S0,rh/2) supz∈hxi+S0,rh/2Kh(z)

bµ(y + hxi + S0,rh/2)

=1

b

∞∑i=1

supz∈hxi+S0,rh/2

Kh(z) ≤k2

b,

where k2 depends on K and d only.

Problem 10.15. Let K be a regular kernel, and let µ be an arbitrary probabilitymeasure. Prove that for any δ > 0

limh→0

supy

∫Kh(x− y)I||x−y||≥δ∫

Kh(x− z)µ(dz)µ(dx) = 0.

Hint: Substitute Kh(z) in the proof of Problem 10.15 by Kh(z)I(||z|| ≥ δ) andnotice that

supy

∫Kh(x− y)I||x−y||≥δ∫

Kh(x− z)µ(dz)µ(dx) ≤

∞∑i=1

supz∈hxi+S0,rh/2

Kh(z)I(||z|| ≥ δ) → 0

as h→ 0.

Problem 10.16. Use Problems 10.14 and 10.15 to extend the proof of Theorem10.1 for arbitrary regular kernels.

Problem 10.17. Show that the constant βd in Lemma 10.1 is never more than4d.

Problem 10.18. Show that if L ≥ 0 is a bounded function that is monotonicallydecreasing on [0,∞) with the property that

∫ud−1L(u)du <∞, and if K : Rd →

[0,∞) is a function with K(x) ≤ L(‖x‖), then K is regular.

Problem 10.19. Find a kernel K ≥ 0 that is monotonically decreasing alongrays (i.e., K(rx) ≤ K(x) for all x ∈ Rd and all r ≥ 1) such that K is notregular. (This exercise is intended to convince you that it is very difficult to findwell-behaved kernels that are not regular.)

Problem 10.20. Let K(x) = L(‖x‖) for some bounded function L ≥ 0. Showthat K is regular if L is decreasing on [0,∞) and

∫K(x)dx <∞. Conclude that

the Gaussian and Cauchy kernels are regular.

Problem 10.21. Regularity of the kernel is not necessary for universal consis-tency. Investigate universal consistency with a nonintegrable kernel—that is,


for which∫K(x)dx = ∞—such as K(x) = 1/(1 + |x|). Greblicki, Krzyzak,

and Pawlak (1984) proved consistency of the kernel rule with smoothing fac-tor hn satisfying hn → 0 and nhdn → ∞ if the kernel K satisfies the follow-ing conditions: K(x) ≥ cI‖x‖≤1 for some c > 0 and for some c1, c2 > 0,c1H(‖x‖) ≤ K(x) ≤ c2H(‖x‖), where H is a nonincreasing function on [0,∞)with udH(u) → 0 as u→∞.

Problem 10.22. Consider the kernel rule with kernel K(x) = 1/‖x‖r, r > 0.Such kernels are useless for atomic distributions unless we take limits and definegn as usual when x /∈ S, the collection of points z with Xi = Xj = z for some pair(i 6= j). For x ∈ S, we take a majority vote over the Yi’s for which Xi = x. Discussthe weak universal consistency of this rule, which has the curious property thatgn is invariant to the smoothing factor h—so, we might as well set h = 1 withoutloss of generality. Note also that for r ≥ d,

∫S0,1

K(x)dx = ∞, and for r ≤ d,∫Sc0,1K(x)dx = ∞, where S0,1 is the unit ball of Rd centered at the origin. In

particular, if r ≤ d, by considering X uniform on S0,1 if Y = 1 and X uniformon the surface of S0,1 if Y = 0, show that even though L∗ = 0, the probabilityof error of the rule may tend to a nonzero limit for certain values of PY = 1.Hence, the rule is not universally consistent. For r ≥ d, prove or disprove the weakuniversal consistency, noting that the rules’ decisions are by-and-large based onthe few nearest neighbors.

Problem 10.23. Assume that the class densities coincide, that is, f0(x) = f1(x)for every x ∈ R, and assume p = PY = 1 > 1/2. Show that the expectedprobability of error of the kernel rule with K ≡ 1 is smaller than that with anyunimodal regular kernel for every n and h small enough. Exhibit a distributionsuch that the kernel rule with a symmetric kernel such that K(|x|) is monotoneincreasing has smaller expected error probability than that with any unimodalregular kernel.

Problem 10.24. Scaling. Assume that the kernel K can be written into thefollowing product form of one-dimensional kernels:

K(x) = K(x(1), . . . , x(d)) =

d∏i=1

Ki(x(i)).

Assume also that K is regular. One can use different smoothing factors along thedifferent coordinate axes to define a kernel rule by

gn(x) =

0 if

n∑i=1

(2Yi − 1)

d∏j=1

Kj

(x(j) −X

(j)i

hjn

)≤ 0

1 otherwise,

where X(j)i denotes the j-th component of Xi. Prove that gn is strongly univer-

sally consistent if hin → 0 for all i = 1, . . . , d, and nh1nh2n . . . hdn →∞.

Problem 10.25. Let K : [0,∞) → [0,∞) be a function, and Σ a symmetricpositive definite d × d matrix. For x ∈ Rd define K′(x) = K

(xTΣx

). Find

conditions on K such that the kernel rule with kernel K′ is universally consistent.


Problem 10.26. Prove that the recursive kernel rule defined by (10.4) is stronglyuniversally consistent if K is a regular kernel, hn → 0, and nhdn →∞ as n→∞(Krzyzak and Pawlak (1983)).

Problem 10.27. Show that the recursive kernel rule of (10.3) is strongly univer-sally consistent whenever K is a regular kernel, limn→∞ hn = 0, and

∑∞n=1

hdn =∞ (Greblicki and Pawlak (1987)). Note: Greblicki and Pawlak (1987) showedconvergence under significantly weaker assumptions on the kernel. They assumethat K(x) ≥ cI‖x‖≤1 for some c > 0 and that for some c1, c2 > 0, c1H(‖x‖) ≤K(x) ≤ c2H(‖x‖), where H is a nonincreasing function on [0,∞) with udH(u) →0 as u→∞. They also showed that under the additional assumption

∫K(x)dx <

∞ the following conditions on hn are necessary and sufficient for universal con-sistency: ∑n

i=1hdi Ihi>a∑n

i=1hdi

→ 0 for all a > 0 and

∞∑i=1

hdi = ∞.

Problem 10.28. Open-ended problem. Let PY = 1 = 1/2. Given Y = 1,let X be uniformly distributed on [0, 1]. Given Y = 0, let X be atomic on therationals with the following distribution: let X = V/W , where V and W areindependent identically distributed, and PV = i = 1/2i, i ≥ 1. Consider thekernel rule with the window kernel. What is the behavior of the smoothing factorh∗n that minimizes the expected probability of error ELn?


11Consistency of the k-Nearest NeighborRule

In Chapter 5 we discuss results about the asymptotic behavior of k-nearestneighbor classification rules, where the value of k—the number of neighborstaken into account at the decision—is kept at a fixed number as the size ofthe training data n increases. This choice leads to asymptotic error prob-abilities smaller than 2L∗, but no universal consistency. In Chapter 6 weshowed that if we let k grow to infinity as n→∞ such that k/n→ 0, thenthe resulting rule is weakly consistent. The main purpose of this chapteris to demonstrate strong consistency, and to discuss various versions of therule.

We are not concerned here with the data-based choice of k—that subjectdeserves a chapter of its own (Chapter 26). We are also not tackling theproblem of the selection of a suitable—even data-based—metric. At theend of this chapter and in the exercises, we draw the attention to 1-nearestneighbor relabeling rules, which combine the computational comfort of the1-nearest neighbor rule with the asymptotic performance of variable-k near-est neighbor rules.

Consistency of k-nearest neighbor classification, and corresponding re-gression and density estimation has been studied by many researchers. SeeFix and Hodges (1951; 1952), Cover (1968a), Stone (1977), Beck (1979),Gyorfi and Gyorfi (1975), Devroye (1981a; 1982b), Collomb (1979; 1980;1981), Bickel and Breiman (1983), Mack (1981), Stute (1984), Devroye andGyorfi (1985), Bhattacharya and Mack (1987), Zhao (1987), and Devroye,Gyorfi, Krzyzak, and Lugosi (1994).

11.1 Strong Consistency 183

Recall the definition of the k-nearest neighbor rule: first reorder the data

(X(1)(x), Y(1)(x)), . . . , (X(n)(x), Y(n)(x))

according to increasing Euclidean distances of theXj ’s to x. In other words,X(i)(x) is the i-th nearest neighbor of x among the points X1, . . . , Xn. Ifdistance ties occur, a tie-breaking strategy must be defined. If µ is abso-lutely continuous with respect to the Lebesgue measure, that is, it has adensity, then no ties occur with probability one, so formally we break tiesby comparing indices. However, for general µ, the problem of distance tiesturns out to be important, and its solution is messy. The issue of tie break-ing becomes important when one is concerned with convergence of Ln withprobability one. For weak universal consistency, it suffices to break ties bycomparing indices.

The k-nn classification rule is defined as

gn(x) =

0 if∑ki=1 IY(i)(x)=1 ≤

∑ki=1 IY(i)(x)=0

1 otherwise.

In other words, gn(x) is a majority vote among the labels of the k nearestneighbors of x.

11.1 Strong Consistency

In this section we prove Theorem 11.1. We assume the existence of a densityfor µ, so that we can avoid messy technicalities necessary to handle distanceties. We discuss this issue briefly in the next section.

The following result implies strong consistency whenever X has an ab-solutely continuous distribution. The result was proved by Devroye andGyorfi (1985), and Zhao (1987). The proof presented here basically ap-pears in Devroye, Gyorfi, Krzyzak, and Lugosi (1994), where strong uni-versal consistency is proved under an appropriate tie-breaking strategy (seediscussion later). Some of the main ideas appeared in the proof of the stronguniversal consistency of the regular histogram rule (Theorem 9.4).

Theorem 11.1. (Devroye and Gyorfi (1985), Zhao (1987)). As-sume that µ has a density. If k → ∞ and k/n → 0 then for every ε > 0there is an n0 such that for n > n0

PLn − L∗ > ε ≤ 2e−nε2/(72γ2

d),

where γd is the minimal number of cones centered at the origin of angleπ/6 that cover Rd. (For the definition of a cone, see Chapter 5.) Thus, thek-nn rule is strongly consistent.


Remark. At first glance the upper bound in the theorem does not seem todepend on k. It is n0 that depends on the sequence of k’s. What we reallyprove is the following: for every ε > 0 there exists a β0 ∈ (0, 1) such thatfor any β < β0 there is an n0 such that if n > n0, k > 1/β, and k/n < β,the exponential inequality holds. 2

For the proof we need a generalization of Lemma 5.3. The role of this cov-ering lemma is analogous to that of Lemma 10.1 in the proof of consistencyof kernel rules.

Lemma 11.1. (Devroye and Gyorfi (1985)). Let

Ba(x′) =x : µ(Sx,‖x−x′‖) ≤ a

.

Then for all x′ ∈ Rd

µ(Ba(x′)) ≤ γda.

Proof. For x ∈ Rd let C(x, s) ⊂ Rd be a cone of angle π/6 centeredat x. The cone consists of all y with the property that either y = x orangle(y − x, s) ≤ π/6, where s is a fixed direction. If y, y′ ∈ C(x, s), and‖x − y‖ < ‖x − y′‖, then ‖y − y′‖ < ‖x − y′‖. This follows from a simplegeometric argument in the vector space spanned by x, y and y′ (see theproof of Lemma 5.3).

Now, let C1, . . . , Cγdbe a collection of cones centered at x with different

central direction covering Rd. Then

µ(Ba(x′)) ≤γd∑i=1

µ(Ci ∩Ba(x′)).

Let x∗ ∈ Ci ∩Ba(x′). Then by the property of the cones mentioned abovewe have

µ(Ci ∩ Sx′,‖x′−x∗‖ ∩Ba(x′)) ≤ µ(Sx∗,‖x′−x∗‖) ≤ a,

where we use the fact that x∗ ∈ Ba(x′). Since x∗ is arbitrary,

µ(Ci ∩Ba(x′)) ≤ a,

which completes the proof of the lemma. 2

An immediate consequence of the lemma is that the number of pointsamong X1, . . . , Xn such that X is one of their k nearest neighbors is notmore than a constant times k.Corollary 11.1.

n∑i=1

IX is among the k NN’s of Xi in X1,...,Xn,X−Xi ≤ kγd.


Proof. Apply Lemma 11.1 with a = k/n and let µ be the empiricalmeasure µn of X1, . . . , Xn, that is, for each Borel set A ⊆ Rd, µn(A) =(1/n)

∑ni=1 IXi∈A. 2

Proof of Theorem 11.1. Since the decision rule gn may be rewrittenas

gn(x) =


where ηn is the corresponding regression function estimate

ηn(x) =1k

k∑i=1

Y(i)(x),

the statement follows from Theorem 2.2 if we show that for sufficientlylarge n

P∫


2

≤ 2e−nε

2/(72γ2d).

Define ρn(x) as the solution of the equation

k

n= µ(Sx,ρn(x)).

Note that the absolute continuity of µ implies that the solution alwaysexists. (This is the only point in the proof where we use this assumption.)Also define

η∗n(x) =1k

n∑j=1

YjI‖Xj−x‖<ρn(x).

The basis of the proof is the following decomposition:

|η(x)− ηn(x)| ≤ |ηn(x)− η∗n(x)|+ |η∗n(x)− η(x)|.

For the first term on the right-hand side, observe that denoting Rn(x) =‖X(k)(x)− x‖,

|η∗n(x)− ηn(x)| =1k

∣∣∣∣∣∣n∑j=1

YjIXj∈Sx,ρn(x) −n∑j=1

YjIXj∈Sx,Rn(x)

∣∣∣∣∣∣≤ 1

k

n∑j=1

∣∣∣IXj∈Sx,ρn(x) − IXj∈Sx,Rn(x)

∣∣∣=

∣∣∣∣∣∣1kn∑j=1

IXj∈Sx,ρn(x) − 1

∣∣∣∣∣∣ = |η∗n(x)− η(x)|,


where η∗n is defined as η∗n with Y replaced by the constant random variableY = 1, and η ≡ 1 is the corresponding regression function. Thus,

|η(x)− ηn(x)| ≤ |η∗n(x)− η(x)|+ |η∗n(x)− η(x)|. (11.1)

First we show that the expected values of the integrals of both terms onthe right-hand side converge to zero. Then we use McDiarmid’s inequalityto prove that both terms are very close to their expected values with largeprobability.

For the expected value of the first term on the right-hand side of (11.1),using the Cauchy-Schwarz inequality, we have

E∫|η∗n(x)− ηn(x)|µ(dx) ≤ E

∫|η∗n(x)− η(x)|µ(dx)

≤∫ √

E |η∗n(x)− η(x)|2µ(dx)

=∫ √

1k2nVarIX∈Sx,ρn(x)µ(dx)

≤∫ √

1k2nµ(Sx,ρn(x))µ(dx)

=∫ √

n

k2

k

nµ(dx)

=1√k,

which converges to zero.For the expected value of the second term on the right-hand side of

(11.1), note that in the proof of Theorems 6.3 and 6.4 we already showedthat

limn→∞

E∫|η(x)− ηn(x)|µ(dx) = 0.

Therefore,

E∫|η∗n(x)− η(x)|µ(dx)

≤ E∫|η∗n(x)− ηn(x)|µ(dx) + E

∫|η(x)− ηn(x)|µ(dx) → 0.

Assume now that n is so large that

E∫|η∗n(x)− η(x)|µ(dx) + E

∫|η∗n(x)− η(x)|µ(dx) <

ε

6.


Then, by (11.1), we have

P∫


2

(11.2)

≤ P∫

|η∗n(x)− η(x)|µ(dx)−E∫|η∗n(x)− η(x)|µ(dx) >

ε

6

+P

∫|η∗n(x)− η(x)|µ(dx)−E

∫|η∗n(x)− η(x)|µ(dx) >

ε

6

.

Next we get an exponential bound for the first probability on the right-handside of (11.2) by McDiarmid’s inequality (Theorem 9.2). Fix an arbitraryrealization of the data Dn = (x1, y1), . . . , (xn, yn), and replace (xi, yi) by(xi, yi), changing the value of η∗n(x) to η∗ni(x). Then∣∣∣∣∫ |η∗n(x)− η(x)|µ(dx)−

∫|η∗ni(x)− η(x)|µ(dx)

∣∣∣∣ ≤ ∫ |η∗n(x)−η∗ni(x)|µ(dx).

But |η∗n(x) − η∗ni(x)| is bounded by 2/k and can differ from zero only if‖x − xi‖ < ρn(x) or ‖x − xi‖ < ρn(x). Observe that ‖x − xi‖ < ρn(x) ifand only if µ(Sx,‖x−xi‖) < k/n. But the measure of such x’s is bounded byγdk/n by Lemma 11.1. Therefore

supx1,y1,...,xn,yn,xi,yi

∫|η∗n(x)− η∗ni(x)|µ(dx) ≤ 2

k

γdk

n=

2γdn

and by Theorem 9.2

P∣∣∣∣∫ |η(x)− η∗n(x)|µ(dx)−E

∫|η(x)− η∗n(x)|µ(dx)

∣∣∣∣ > ε

6

≤ 2e−nε

2/(72γ2d).

Finally, we need a bound for the second term on the right-hand side of(11.2). This probability may be bounded by McDiarmid’s inequality exactlythe same way as for the first term, obtaining

P∣∣∣∣∫ |η∗n(x)− η(x)|µ(dx)−E

∫|η∗n(x)− η(x)|µ(dx)

∣∣∣∣ > ε

6

≤ 2e−nε

2/(72γ2d),

and the proof is completed. 2

Remark. The conditions k →∞ and k/n→ 0 are optimal in the sense thatthey are also necessary for consistency for some distributions with a density.However, for some distributions they are not necessary for consistency,and in fact, keeping k = 1 for all n may be a better choice. This latterproperty, dealt with in Problem 11.1, shows that the 1-nearest neighborrule is admissible. 2

11.2 Breaking Distance Ties 188

11.2 Breaking Distance Ties

Theorem 11.1 provides strong consistency under the assumption that Xhas a density. This assumption was needed to avoid problems caused byequal distances. Turning to the general case, we see that if µ does nothave a density then distance ties can occur with nonzero probability, so wehave to deal with the problem of breaking them. To see that the densityassumption cannot be relaxed to the condition that µ is merely nonatomicwithout facing frequent distance ties, consider the following distribution onRd ×Rd′ with d, d′ ≥ 2:

µ =12

(τd × σd′) +12

(σd × τd′) ,

where τd denotes the uniform distribution on the surface of the unit sphereof Rd and σd denotes the unit point mass at the origin of Rd. Observethat if X has distribution τd × σd′ and X ′ has distribution σd × τd′ , then‖X−X ′‖ =

√2. Hence, if X1, X2, X3, X4 are independent with distribution

µ, then P‖X1 −X2‖ = ‖X3 −X3‖ = 1/4.Next we list some methods of breaking distance ties.

• Tie-breaking by indices: If Xi and Xj are equidistant from x,then Xi is declared closer if i < j. This method has some undesirableproperties. For example, if X is monoatomic, with η < 1/2, then X1

is the nearest neighbor of all Xj ’s, j > 1, but Xj is only the j − 1-stnearest neighbor of X1. The influence of X1 in such a situation istoo large, making the estimate very unstable and thus undesirable.In fact, in this monoatomic case, if L∗ + ε < 1/2,

P Ln − L∗ > ε ≥ e−ck

for some c > 0 (see Problem 11.2). Thus, we cannot expect a distribu-tion-free version of Theorem 11.1 with this tie-breaking method.

• Stone’s tie-breaking: Stone (1977) introduced a version of thenearest neighbor rule, where the labels of the points having the samedistance from x as the k-th nearest neighbor are averaged. If wedenote the distance of the k-th nearest neighbor to x by Rn(x), thenStone’s rule is the following:

gn(x) =

0 if

∑i:‖x−xi‖<Rn(x)

IYi=0 +k − |i : ‖x− xi‖ < Rn(x)||i : ‖x− xi‖ = Rn(x)|

×∑

i:‖x−xi‖=Rn(x)

IYi=0 ≥k

2

1 otherwise.

This is not a k-nearest neighbor rule in a strict sense, since this es-timate, in general, uses more than k neighbors. Stone (1977) provedweak universal consistency of this rule.

11.2 Breaking Distance Ties 189

• Adding a random component: To circumvent the aforementioneddifficulties, we may artificially increase the dimension of the featurevector by one. Define the the d+ 1-dimensional random vectors

X ′ = (X,U), X ′1 = (X1, U1), . . . , X ′

n = (Xn, Un),

where the randomizing variables U,U1, . . . , Un are real-valued i.i.d.random variables independent of X,Y , and Dn, and their commondistribution has a density. Clearly, because of the independence of U ,the Bayes error corresponding to the pair (X ′, Y ) is the same as thatof (X,Y ). The algorithm performs the k-nearest neighbor rule on themodified data set

D′n = ((X ′

1, Y1), . . . , (X ′n, Yn)).

It finds the k nearest neighbors ofX ′, and uses a majority vote amongthese labels to guess Y . Since U has a density and is independent ofX, distance ties occur with zero probability. Strong universal consis-tency of this rule can be seen by observing that the proof of Theorem11.1 used the existence of the density in the definition of ρn(x) only.With our randomization, ρn(x) is well-defined, and the same proofyields strong universal consistency. Interestingly, this rule is consis-tent whenever U has a density and is independent of (X,Y ). If, for ex-ample, the magnitude of U is much larger than that of ‖X‖, then therule defined this way will significantly differ from the k-nearest neigh-bor rule, though it still preserves universal consistency. One shouldexpect however a dramatic decrease in the performance. Of course, ifU is very small, then the rule remains intuitively appealing.

• Tie-breaking by randomization: There is another, perhaps morenatural way of breaking ties via randomization. We assume that(X,U) is a random vector independent of the data, where U is in-dependent of X and uniformly distributed on [0, 1]. We also artifi-cially enlarge the data by introducing U1, U2, . . . , Un, where the Ui’sare i.i.d. uniform [0, 1] as well. Thus, each (Xi, Ui) is distributed as(X,U). Let

(X(1)(x, u), Y(1)(x, u)), . . . , (X(n)(x, u), Y(n)(x, u))

be a reordering of the data according to increasing values of ‖x −Xi‖. In case of distance ties, we declare (Xi, Ui) closer to (x, u) than(Xj , Uj) provided that

|Ui − u| ≤ |Uj − u|.

Define the k-nn classification rule as

gn(x) =

0 if∑ki=1 IY(i)(x,u)=1 ≤

∑ki=1 IY(i)(x,u)=0

1 otherwise,(11.3)

11.3 Recursive Methods 190

and denote the error probability of gn by

Ln = Pgn(X,U) 6= Y |X1, U1, Y1, . . . , Xn, Un, Yn.

Devroye, Gyorfi, Krzyzak, and Lugosi (1994) proved that Ln → L∗

with probability one for all distributions if k → ∞ and k/n → 0.The basic argument in (1994) is the same as that of Theorem 11.1,except that the covering lemma (Lemma 11.1) has to be appropriatelymodified.

It should be stressed again that if µ has a density, or just has an abso-lutely continuous component, then tie-breaking is needed with zero proba-bility, and becomes therefore irrelevant.

11.3 Recursive Methods

To find the nearest neighbor of a point x amongX1, . . . , Xn, we may prepro-cess the data in O(n log n) time, such that each query may be answered inO(log n) worst-case time—see, for example, Preparata and Shamos (1985).Other recent developments in computational geometry have made the near-est neighbor rules computationally feasible even when n is formidable.Without preprocessing however, one must resort to slow methods. If weneed to find a decision at x and want to process the data file once whendoing so, a simple rule was proposed by Devroye and Wise (1980). It isa fully recursive rule that may be updated as more observations becomeavailable.

Split the data sequence Dn into disjoint blocks of length l1, . . . , lN , wherel1, . . . , lN are positive integers satisfying

∑Ni=1 li = n. In each block find

the nearest neighbor of x, and denote the nearest neighbor of x from thei-th block by X∗

i (x). Let Y ∗i (x) be the corresponding label. Ties are brokenby comparing indices. The classification rule is defined as a majority voteamong the nearest neighbors from each block:

gn(x) =

0 if∑Ni=1 IY ∗i (x)=1 ≤

∑Ni=1 IY ∗i (x)=0

1 otherwise.

Note that we have only defined the rule gn for n satisfying∑Ni=1 li = n for

some N . A possible extension for all n’s is given by gn(x) = gm(x), wherem is the largest integer not exceeding n that can be written as

∑Ni=1 li for

some N . The rule is weakly universally consistent if

limN→∞

lN = ∞

(Devroye and Wise (1980), see Problem 11.3).

11.4 Scale-Invariant Rules 191

11.4 Scale-Invariant Rules

A scale-invariant rule is a rule that is invariant under rescalings of thecomponents. It is motivated by the lack of a universal yardstick whencomponents of a vector represent physically different quantities, such astemperature, blood pressure, alcohol, and the number of lost teeth. Moreformally, let x(1), . . . , x(d) be the d components of a vector x. If ψ1, . . . , ψdare strictly monotone mappings: R → R, and if we define

ψ(x) =(ψ1(x(1)), . . . , ψd(x(d))

),

then gn is scale-invariant if

gn(x,Dn) = gn(ψ(x), D′n),

where D′n = ((ψ(X1), Y1), . . . , (ψ(Xn), Yn)). In other words, if all the Xi’s

and x are transformed in the same manner, the decision does not change.Some rules based on statistically equivalent blocks (discussed in Chapters

21 and 22) are scale-invariant, while the k-nearest neighbor rule clearlyis not. Here we describe a scale-invariant modification of the k-nearestneighbor rule, suggested by Olshen (1977) and Devroye (1978).

The scale-invariant k-nearest neighbor rule is based upon empirical dis-tances that are defined in terms of the order statistics along the d coordinateaxes. First order the points x,X1, . . . , Xn according to increasing values oftheir first components x(1), X

(1)1 , . . . , X

(1)n , breaking ties via randomization.

Denote the rank of X(1)i by r

(1)i , and the rank of x(1) by r(1). Repeating

the same procedure for the other coordinates, we obtain the ranks

r(j)i , r(j), j = 1, . . . , d, i = 1, . . . , n.

Define the empirical distance between x and Xi by

ρ(x,Xi) = max1≤j≤d

|r(j)i − r(j)|.

A k-nn rule can be defined based on these distances, by a majority voteamong the Yi’s with the corresponding Xi’s whose empirical distance fromx are among the k smallest. Since these distances are integer-valued, tiesfrequently occur. These ties should be broken by randomization. Devroye(1978) proved that this rule (with randomized tie-breaking) is weakly uni-versally consistent when k → ∞ and k/n → 0 (See Problem 11.5). Foranother consistent scale-invariant nearest neighbor rule we refer to Prob-lem 11.6.

11.5 Weighted Nearest Neighbor Rules 192

Figure 11.1. Scale-invariant distances of 15 points from afixed point are shown here.

11.5 Weighted Nearest Neighbor Rules

In the k-nn rule, each of the k nearest neighbors of a point x plays anequally important role in the decision. However, intuitively speaking, nearerneighbors should provide more information than more distant ones. Royall(1966) first suggested using rules in which the labels Yi are given unequalvoting powers in the decision according to the distances of the Xi’s from x:the i-th nearest neighbor receives weight wni, where usually wn1 ≥ wn2 ≥... ≥ wnn ≥ 0 and

∑ni=1 wni = 1. The rule is defined as

gn(x) =

0 if∑ni=1 wniIY(i)(x)=1 ≤

∑ni=1 wniIY(i)(x)=0

1 otherwise.

We get the ordinary k-nearest neighbor rule back by the choice

wni =

1/k if i ≤ k0 otherwise.

The following conditions for consistency were established by Stone (1977):

limn→∞

max1≤i≤n

wni = 0,

andlimn→∞

∑k≤i≤n

wni = 0

11.7 Relabeling Rules 193

for some k with k/n → 0 (see Problem 11.7). Weighted versions of therecursive and scale-invariant methods described above can also be definedsimilarly.

11.6 Rotation-Invariant Rules

Assume that an affine transformation T is applied to x andX1, . . . , Xn (i.e.,any number of combinations of rotations, translations, and linear rescal-ings), and that for any such linear transformation T ,

gn(x,Dn) = gn(T (x), D′n),

where D′n = ((T (X1), Y1), . . . , (T (Xn), Yn)). Then we call gn rotation-

invariant. Rotation-invariance is indeed a very strong property. In Rd, inthe context of k-nn estimates, it suffices to be able to define a rotation-invariant distance measure. These are necessarily data-dependent. An ex-ample of this goes as follows. Any collection of d points in general positiondefines a polyhedron in a hyperplane of Rd. For points (Xi1 , . . . , Xid), wedenote this polyhedron by P (i1, . . . , id). Then we define the distance

ρ(Xi, x) =∑

i1,...,id,i/∈i1,...,id

Isegment (Xi,x) intersects P (i1,...,id).

Near points have few intersections. Using ρ(·, ·) in a k-nn rule with k →∞and k/n → 0, we expect weak universal consistency under an appropriatescheme of tie-breaking. The answer to this is left as an open problem forthe scholars.

Figure 11.2. Rotation-invar-iant distances from x.

11.7 Relabeling Rules

The 1-nn rule appeals to the masses who crave simplicity and attractsthe programmers who want to write short understandable code. Nearestneighbors may be found efficiently if the data are preprocessed (see Chapter


5 for references). Can we make the 1-nn rule universally consistent as well?In this section we introduce a tool called relabeling, which works as follows.Assume that we have a classification rule gn(x,Dn), x ∈ Rd, n ≥ 1, whereDn is the data (X1, Y1), . . . , (Xn, Yn). This rule will be called the ancestralrule. Define the labels

Zi = gn(Xi, Dn), 1 ≤ i ≤ n.

These are the decisions for the Xi’s themselves obtained by mere resub-stitution. In the relabeling method, we apply the 1-nn rule to the newdata (X1, Z1), . . . , (Xn, Zn). If all goes well, when the ancestral rule gnis universally consistent, so should the relabeling rule. We will show thisby example, starting from a consistent k-nn rule as ancestral rule (withk →∞, k/n→ 0).

Unfortunately, relabeling rules do not always inherit consistency fromtheir ancestral rules, so that a more general theorem is more difficult toobtain, unless one adds in a lot of regularity conditions—this does not seemto be the right time for that sort of effort. To see that universal consistencyof the ancestral rule does not imply consistency of the relabeling rule,consider the following rule hn:

hn(x) =

1− Yi if x = Xi and x 6= Xj , all j 6= ign(x,Dn) otherwise,

where gn is a weakly universally consistent rule. It is easy to show (seeProblem 11.15) that hn is universally consistent as well. Changing a ruleon a set of measure zero indeed does not affect Ln. Also, if x = Xi is at anatom of the distribution of X, we only change gn to 1− Yi if Xi is the soleoccurrence of that atom in the data. This has asymptotically no impacton Ln. However, if hn is used as an ancestral rule, and X is nonatomic,then hn(Xi, Dn) = 1 − Yi for all i, and therefore, the relabeling rule is a1-nn rule based on the data (X1, 1 − Y1), . . . , (Xn, 1 − Yn). If L∗ = 0 forthe distribution of (X,Y ), then the relabeling rule has probability of errorconverging to one!

For most nonpathological ancestral rules, relabeling does indeed preserveuniversal consistency. We offer a prototype proof for the k-nn rule.

Theorem 11.2. Let gn be the k-nn rule in which tie-breaking is done byrandomization as in (11.3). Assume that k →∞ and k/n→ 0 (so that gnis weakly universally consistent). Then the relabeling rule based upon gn isweakly universally consistent as well.

Proof. We verify the conditions of Stone’s weak convergence theorem (seeTheorem 6.3). To keep things simple, we assume that the distribution ofX has a density so that distance ties happen with probability zero.


In our case, the weight Wni(X) of Theorem 6.3 equals 1/k iff Xi is amongthe k nearest neighbors of X(1)(X), where X(1)(X) is the nearest neighborof X. It is zero otherwise.

Condition 6.3 (iii) is trivially satisfied since k → ∞. For condition 6.3(ii), we note that if X(i)(x) denotes the i-th nearest neighbor of X amongX1, . . . , Xn, then

‖X(1)(x)−X(k)(X(1)(x))‖ ≤ 2‖x−X(k)‖.

Just note that SX(1)(x),2‖x−X(k)‖ ⊇ Sx,‖x−X(k)‖ and that the latter spherecontains k data points. But we already know from the proof of weak con-sistency of the k-nn rule that if k/n→ 0, then for all ε > 0,

P‖X(k)(X)−X‖ > ε

→ 0.

Finally, we consider condition 6.3 (i). Here we have, arguing partially asin Stone (1977),

E

n∑i=1

1kIXi is among the k NN’s of X(1)(X)f(Xi)

= E

1k

n∑i=1

IX is among the k NN’s of X(1)(Xi) in X1,...,Xn,X−Xif(X)

(reverse the roles of Xi and X)

= E1k

n∑i=1

n∑j=1

IXj is the NN of Xi in X1,...,Xn,X−Xi

×IX is among the k NN’s of Xj in X1,...,Xn,X−Xif(X).

However, by Lemma 11.1,n∑i=1

IXj is the NN of Xi in X1,...,Xn,X−Xi ≤ γd.

Also,n∑j=1

IX is among the k NN’s of Xj in X1,...,Xn,X−Xi ≤ kγd.

Therefore, by a double application of Lemma 11.1,

E

n∑i=1

1kIXi is among the k NN’s of X(1)(X)f(Xi)

≤ γ2

d Ef(X),

and condition 6.3 (i) is verified. 2



Problem 11.1. Show that the conditions k →∞ and k/n→ 0 are necessary foruniversal consistency of the k-nearest neighbor rule. That is, exhibit a distributionsuch that if k remains bounded, lim infn→∞ ELn > L∗. Exhibit a second distri-bution such that if k/n ≥ ε > 0 for all n, and some ε, then lim infn→∞ ELn > L∗.

Problem 11.2. Let X be monoatomic, with η < 1/2. Show that for ε < 1/2−η,

P Ln − L∗ > ε ≥ e−ck

for some c > 0, where Ln is the error probability of the k-nn rule with tie-breakingby indices.

Problem 11.3. Prove that the recursive nearest neighbor rule is universally con-sistent provided that limN→∞ lN = ∞ (Devroye and Wise (1980)). Hint: Checkthe conditions of Theorem 6.3.

Problem 11.4. Prove that the nearest neighbor rule defined by any Lp-distancemeasure 0 < p ≤ ∞ is universally consistent under the usual conditions on k.

The Lp-distance between x, y ∈ Rd is defined by(∑d

i=1|x(i) − y(i)|p

)1/p

for

0 < p <∞, and by supi |x(i) − y(i)| for p = ∞, where x = (x(1), . . . , x(d)). Hint:Check the conditions of Theorem 6.3.

Problem 11.5. Let σ(x, z,Xi, Zi) = ρ(x,Xi)+ |Zi−z| be a generalized distancebetween (x, z) and (Xi, Zi), where x,X1, . . . , Xn are as in the description of thescale-invariant k-nn rule, ρ(x,Xi) is the empirical distance defined there, andz, Zi ∈ [0, 1] are real numbers added to break ties at random. The sequenceZ1, . . . , Zn is i.i.d. uniform [0, 1] and is independent of the data Dn. With thek-nn rule based on the artificial distances σ, show that the rule is universallyconsistent by verifying the conditions of Theorem 6.3, when k →∞ and k/n→ 0.In particular, show first that if Z is uniform [0, 1] and independent of X,Y,Dnand Z1, . . . , Zn, and if Wni(X,Z) is the weight of (Xi, Zi) in this k-nn rule (i.e.,it is 1/k iff (Xi, Zi) is among the k nearest neighbors of (X,Z) according to σ),then

(1) E∑n

i=1Wni(X,Z)f(Xi)

≤ 2dEf(X)

for all nonnegative measurable f with Ef(X) <∞.(2) If k/n→ 0, then

E

n∑i=1

Wni(X,Z)I‖Xi−X‖>a

for all a > 0.

Hint: Check the conditions of Theorem 6.3 (Devroye (1978)).

Problem 11.6. The layered nearest neighbor rule partitions the space at x into2d quadrants. In each quadrant, the outer-layer points are marked, that is, thoseXi for which the hyperrectangle defined by x and Xi contains no other datapoint. Then it takes a majority vote over the Yi’s for the marked points. Observe


that this rule is scale-invariant. Show that whenever X has nonatomic marginals(to avoid ties), ELn → L∗ in probability.

Figure 11.3. The layered nearest neighbor

rule takes a majority vote over the marked

points.

Hint: It suffices to show that the number of marked points increases unbound-edly in probability, and that its proportion to unmarked points tends to zero inprobability.

Problem 11.7. Prove weak universal consistency of the weighted nearest neigh-bor rule if the weights satisfy

limn→∞

max1≤i≤n

wni = 0,

and

limn→∞

∑k≤i≤n

wni = 0

for some k with k/n→ 0 (Stone (1977)). Hint: Check the conditions of Theorem6.3.

Problem 11.8. If (wn1, . . . , wnn) is a probability vector, then limn→∞∑

i>nδwni =

0 for all δ > 0 if and only if there exists a sequence of integers k = kn such thatk = o(n), k →∞, and

∑i>k

wni = o(1). Show this. Conclude that the conditionsof Problem 11.7 are equivalent to

(i) limn→∞

max1≤i≤n

wni = 0,

(ii) limn→∞

∑i>nδ

wni = 0 for all δ > 0.

Problem 11.9. Verify the conditions of Problems 11.7 and 11.8 for weight vec-tors of the form wni = cn/i

α, where α > 0 is a constant and cn is a normalizingconstant. In particular, check that they do not hold for α > 1 but that they dohold for 0 < α ≤ 1.

Problem 11.10. Consider

wni =ρn

(1 + ρn)i× 1

1− (1 + ρn)−n, 1 ≤ i ≤ n,

as weight vector. Show that the conditions of Problem 11.8 hold if ρn → 0 andnρn →∞.


Problem 11.11. Let wni = PZ = i/PZ ≤ n, 1 ≤ i ≤ n, where Z isa Poisson (λn) random variable. Show that there is no choice of λn such thatwni, 1 ≤ i ≤ n is a consistent weight sequence following the conditions ofProblem 11.8.

Problem 11.12. Let wni = PBinomial(n, pn) = i =(ni

)pin(1− pn)n−i. Derive

conditions on pn for this choice of weight sequence to be consistent in the senseof Problem 11.8.

Problem 11.13. k-nn density estimation. We recall from Problem 2.11 that ifthe conditional densities f0, f1 exist, then L1-consistent density estimation leadsto a consistent classification rule. Consider now the k-nearest neighbor densityestimate introduced by Loftsgaarden and Quesenberry (1965). Let X1, . . . , Xnbe independent, identically distributed random variables in Rd, with commondensity f . The k-nn estimate of f is defined by

fn(x) =k

λ(Sx,‖x−X(k)(x)‖

) ,where X(k)(x) is the k-th nearest neighbor of x among X1, . . . , Xn. Show thatfor every n,

∫|f(x) − fn(x)|dx = ∞, so that the density estimate is never con-

sistent in L1. On the other hand, according to Problem 11.14, the correspondingclassification rule is consistent, so read on.

Problem 11.14. Assume that the conditional densities f0 and f1 exist. Then wecan use a rule suggested by Patrick and Fischer (1970): let N0 =

∑n

i=1IYi=0

and N1 = n−N0 be the number of zeros and ones in the training data. Denoteby X

(k)0 (x) the k-th nearest neighbor of x among the Xi’s with Yi = 0. Define

X(k)1 (x) similarly. If λ(A) denotes the volume of a set A ⊂ Rd, then the rule is

defined as

gn(x) =

0 if N0/λ

(Sx,‖x−X(k)

0 (x)‖

)≥ N1/λ

(Sx,‖x−X(k)

1 (x)‖

)1 otherwise.

This estimate is based on the k-nearest neighbor density estimate introduced byLoftsgaarden and Quesenberry (1965). Their estimate of f0 is

f0,n(x) =k

λ(Sx,‖x−X(k)

0 (x)‖

) .Then the rule gn can be re-written as

gn(x) =

0 if p0,nf0,n ≥ p1,nf1,n1 otherwise,

where p0,n = N0/n and p1,n = N1/n are the obvious estimates of the classprobabilities. Show that gn is weakly consistent if k →∞ and k/n→ 0, wheneverthe conditional densities exist.


Problem 11.15. Consider a weakly universally consistent rule gn, and definethe rule

hn(x) =

1− Yi if x = Xi and x 6= Xj , all j 6= ign(x,Dn) otherwise.

Show that hn too is weakly universally consistent. Note: it is the atomic (orpartially atomic) distributions of X that make this exercise interesting. Hint:The next exercise may help.

Problem 11.16. Let X have an atomic distribution which puts probability piat atom i. Let X1, . . . , Xn be an i.i.d. sample drawn from this distribution. Thenshow the following.

(1) P|N − EN | > ε ≤ 2e−2ε2/n, where ε > 0, and N is the number of“occupied” atoms (the number of different values in the data sequence).

(2) EN/n→ 0.(3) N/n→ 0 almost surely.(4)

∑i:Xj 6=i for all j≤n pi → 0 almost surely.

Problem 11.17. Royall’s rule. Royall (1966) proposes the regression functionestimate

ηn(x) =1

nhn

bnhnc∑i=1

J(

i

nhn

)Y(i)(x),

where J(u) is a smooth kernel-like function on [0, 1] with∫ 1

0J(u)du = 1 and

hn > 0 is a smoothing factor. Suggestions included

(i) J(u) ≡ 1;

(ii) J(u) =(d+ 2)2

4

(1− d+ 4

d+ 2u2/d

)(note that this function becomes negative).

Define

gn(x) =

1 if ηn(x) > 1/20 otherwise.

Assume that hn → 0, nhn →∞. Derive sufficient conditions on J that guaranteethe weak universal consistency of Royall’s rule. In particular, insure that choice(ii) is weakly universally consistent. Hint: Try adding an appropriate smoothnesscondition to J .

Problem 11.18. Let K be a kernel and let Rn(x) denote the distance betweenx and its k-th nearest neighbor, X(k)(x), among X1, . . . , Xn. The discrimina-tion rule that corresponds to a kernel-type nearest neighbor regression functionestimate of Mack (1981) is

gn(x) =

0 if

∑n

i=1(2Yi − 1)K

(x−XiRn(x)

)≤ 0

1 otherwise.

(The idea of replacing the smoothing factor in the kernel estimate by a local rank-based value such as Rn(x) is due to Breiman, Meisel, and Purcell (1977).) For the


kernel K = IS0,1 , this rule coincides with the k-nn rule. For regular kernels (seeChapter 10), show that the rule remains weakly universally consistent wheneverk →∞ and k/n→ 0 by verifying Stone’s conditions of Theorem 6.3.

Problem 11.19. Let gn be a weakly universally consistent sequence of classi-fiers. Split the data sequence Dn into two parts: Dm = ((X1, Y1), . . . , (Xm, Ym))and Tn−m = ((Xm+1, Ym+1), . . . , (Xn, Yn)). Use gn−m and the second part Tn−mto relabel the first part, i.e., define Y ′

i = gn−m(Xi, Tn−m) for i = 1, . . . ,m. Provethat the 1-nn rule based on the data (X1, Y

′1 ), . . . , (Xm, Y

′m) is weakly universally

consistent whenever m→∞ and n−m→∞. Hint: Use Problem 5.40.

Problem 11.20. Consider the k-nn rule with a fixed k as the ancestral rule, andapply the 1-nn rule using the relabeled data. Investigate the convergence of ELn.Is the limit LkNN or something else?


12Vapnik-Chervonenkis Theory

12.1 Empirical Error Minimization

In this chapter we select a decision rule from a class of rules with the helpof training data. Working formally, let C be a class of functions φ : Rd →0, 1. One wishes to select a function from C with small error probability.Assume that the training data Dn = ((X1, Y1), . . . , (Xn, Yn)) are given topick one of the functions from C to be used as a classifier. Perhaps themost natural way of selecting a function is to minimize the empirical errorprobability

Ln(φ) =1n

n∑i=1

Iφ(Xi) 6=Yi

over the class C. Denote the empirically optimal rule by φ∗n:

φ∗n = arg minφ∈C

Ln(φ).

Thus, φ∗n is the classifier that, according to the data Dn, “looks best”among the classifiers in C. This idea of minimizing the empirical risk inthe construction of a rule was developed to great extent by Vapnik andChervonenkis (1971; 1974c; 1974a; 1974b).

Intuitively, the selected classifier φ∗n should be good in the sense that itstrue error probability L(φ∗n) = Pφ∗n(X) 6= Y |Dn is expected to be closeto the optimal error probability within the class. Their difference is the

12.1 Empirical Error Minimization 202

quantity that primarily interests us in this chapter:


L(φ).

The latter difference may be bounded in a distribution-free manner, and arate of convergence results that only depends on the structure of C. Whilethis is very exciting, we must add that L(φ∗n) may be far away from theBayes error L∗. Note that

L(φ∗n)− L∗ =(L(φ∗n)− inf

φ∈CL(φ)

)+(

infφ∈C

L(φ)− L∗).

The size of C is a compromise: when C is large, infφ∈C L(φ) may be close toL∗, but the former error, the estimation error, is probably large as well. If Cis too small, there is no hope to make the approximation error infφ∈C L(φ)−L∗ small. For example, if C is the class of all (measurable) decision functions,then we can always find a classifier in C with zero empirical error, but itmay have arbitrary values outside of the points X1, . . . , Xn. For example,an empirically optimal classifier is

φ∗n(x) =Yi if x = Xi, i = 1, . . . , n0 otherwise.

This is clearly not what we are looking for. This phenomenon is calledoverfitting, as the overly large class C overfits the data. We will give preciseconditions on C that allow us to avoid this anomaly. The choice of C suchthat infφ∈C L(φ) is close to L∗ has been the subject of various chapters onconsistency—just assume that C is allowed to grow with n in some manner.Here we take the point of view that C is fixed, and that we have to livewith the functions in C. The best we may then hope for is to minimizeL(φ∗n)− infφ∈C L(φ). A typical situation is shown in Figure 12.1.


Figure 12.1. Various errors in empirical classifier selec-tion.

Consider first a finite collection C, and assume that one of the classifiersin C has zero error probability, that is, minφ∈C L(φ) = 0. Then clearly,Ln(φ∗n) = 0 with probability one. We then have the following performancebound:

Theorem 12.1. (Vapnik and Chervonenkis (1974c)). Assume |C| <∞ and minφ∈C L(φ) = 0. Then for every n and ε > 0,

PL(φ∗n) > ε ≤ |C|e−nε,

and

EL(φ∗n) ≤1 + log |C|

n.

Proof. Clearly,

PL(φ∗n) > ε ≤ P

max

φ∈C:Ln(φ)=0

L(φ) > ε

= E

Imaxφ∈C:Ln(φ)=0

L(φ)>ε

= E

maxφ∈C

ILn(φ)=0IL(φ)>ε


≤∑

φ∈C:L(φ)>ε

PLn(φ) = 0

≤ |C|(1− ε)n,

since the probability that no (Xi, Yi) pair falls in the set (x, y) : φ(x) 6= yis less than (1 − ε)n if the probability of the set is larger than ε. Theprobability inequality of the theorem follows from the simple inequality1− x ≤ e−x.

To bound the expected error probability, note that for any u > 0,

EL(φ∗n) =∫ ∞

0

PL(φ∗n) > tdt

≤ u+∫ ∞

u

PL(φ∗n) > tdt

≤ u+ |C|∫ ∞

u

e−ntdt

= u+|C|ne−nu.

Since u was arbitrary, we may choose it to minimize the obtained upperbound. The optimal choice is u = log |C|/n, which yields the desired in-equality. 2

Theorem 12.1 shows that empirical selection works very well if the sam-ple size n is much larger than the logarithm of the size of the family C.Unfortunately, the assumption on the distribution of (X,Y ), that is, thatminφ∈C L(φ) = 0, is very restrictive. In the sequel we drop this assumption,and deal with the distribution-free problem.

One of our main tools is taken from Lemma 8.2:


L(φ) ≤ 2 supφ∈C

∣∣∣Ln(φ)− L(φ)∣∣∣ .

This leads to the study of uniform deviations of relative frequencies fromtheir probabilities by the following simple observation: let ν be a probabilitymeasure of (X,Y ) on Rd × 0, 1, and let νn be the empirical measurebased upon Dn. That is, for any fixed measurable set A ⊂ Rd × 0, 1,ν(A) = P(X,Y ) ∈ A, and νn(A) = 1

n

∑ni=1 I(Xi,Yi)∈A. Then

L(φ) = ν((x, y) : φ(x) 6= y)

is just the ν-measure of the set of pairs (x, y) ∈ Rd×0, 1, where φ(x) 6= y.Formally, L(φ) is the ν-measure of the set

x : φ(x) = 1 × 0⋃x : φ(x) = 0 × 1 .

12.2 Fingering 205

Similarly, Ln(φ) = νn((x, y) : φ(x) 6= y). Thus,

supφ∈C

|Ln(φ)− L(φ)| = supA∈A

|νn(A)− ν(A)|,

where A is the collection of all sets

x : φ(x) = 1 × 0⋃x : φ(x) = 0 × 1 , φ ∈ C.

For a fixed set A, for any probability measure ν, by the law of large num-bers νn(A)− ν(A) → 0 almost surely as n→∞. Moreover, by Hoeffding’sinequality (Theorem 8.1),

P|νn(A)− ν(A)| > ε ≤ 2e−2nε2 .

However, it is a much harder problem to obtain such results for supA∈A |νn(A)−ν(A)|. If the class of sets A (or, analogously, in the pattern recognition con-text, C) is of finite cardinality, then the union bound trivially gives

P

supA∈A

|νn(A)− ν(A)| > ε

≤ 2|A|e−2nε2 .

However, if A contains infinitely many sets (as in many of the interestingcases) then the problem becomes nontrivial, spawning a vast literature. Themost powerful weapons to attack these problems are distribution-free largedeviation-type inequalities first proved by Vapnik and Chervonenkis (1971)in their pioneering work. However, in some situations, we can handle theproblem in a much simpler way. We have already seen such an example inSection 4.5.

12.2 Fingering

Recall that in Section 4.5 we studied a specific rule that selects a linearclassifier by minimizing the empirical error. The performance bounds pro-vided by Theorems 4.5 and 4.6 show that the selected rule performs veryclosely to the best possible linear rule. These bounds apply only to thespecific algorithm used to find the empirical minima—we have not showedthat any classifier minimizing the empirical error performs well. This mat-ter will be dealt with in later sections. In this section we extend Theorems4.5 and 4.6 to classes other than linear classifiers.

Let C be the class of classifiers assigning 1 to those x contained in a closedhyperrectangle, and 0 to all other points. Then a classifier minimizing theempirical error Ln(φ) over all φ ∈ C may be obtained by the following algo-rithm: to each 2d-tuple (Xi1 , . . . , Xi2d

) of points from X1, . . . , X2d, assignthe smallest hyperrectangle containing these points. If we assume that X

12.2 Fingering 206

has a density, then the points X1, . . . , Xn are in general position with prob-ability one. This way we obtain at most

(n2d

)sets. Let φi be the classifier

corresponding to the i-th such hyperrectangle, that is, the one assigning 1to those x contained in the hyperrectangle, and 0 to other points. Clearly,for each φ ∈ C, there exists a φi, i = 1, . . . ,

(n2d

), such that

φ(Xj) = φi(Xj)

for all Xj , except possibly for those on the boundary of the hyperrectangle.Since the points are in general position, there are at most 2d such excep-tional points. Therefore, if we select a classifier φ among φ1, . . . , φ( n

2d) tominimize the empirical error, then it approximately minimizes the empiri-cal error over the whole class C as well. A quick scan through the proof ofTheorem 4.5 reveals that by similar arguments we may obtain the perfor-mance bound

PL(φ)− inf

φ∈CL(φ) > ε

≤ e4dε

((n

2d

)+ 1)e−nε

2/2,

for n ≥ 2d and ε ≥ 4d/n.The idea may be generalized. It always works, if for some k, k-tuples of

points determine classifiers from C such that no matter where the other datapoints fall, the minimal empirical error over these sets coincides with theoverall minimum. Then we may fix these sets—“put our finger on them”—and look for the empirical minima over this finite collection. The nexttheorems, whose proofs are left as an exercise (Problem 12.2), show that ifC has this property, then “fingering” works extremely well whenever n k.

Theorem 12.2. Assume that the class C of classifiers has the followingproperty: for some integer k there exists a function Ψ :

(Rd)k → C such

that for all x1, . . . , xn ∈ Rd and all φ ∈ C, there exists a k-tuple i1, . . . , ik ∈1, . . . , n of different indices such that

Ψ(xi1 , . . . , xik)(xj) = φ(xj) for all j = 1, . . . , n with j 6= il, l = 1, . . . , k

with probability one. Let φ be found by fingering, that is, by empirical errorminimization over the collection of n!/(n− k)! classifiers of the form

Ψ(Xi1 , . . . , Xik), i1, . . . , ik ∈ 1, . . . , n, different.

Then for n ≥ k and 2k/n ≤ ε ≤ 1,

PL(φ)− inf

φ∈CL(φ) > ε

≤ e2kε

(nk + 1

)e−nε

2/2.

Moreover, if n ≥ k, then

EL(φ)− inf

φ∈CL(φ)

≤√

2(k + 1) log n+ (2k + 2)n

.

12.3 The Glivenko-Cantelli Theorem 207

The smallest k for which C has the property described in the theoremmay be called the fingering dimension of C. In most interesting cases, it isindependent of n. Problem 12.3 offers a few such classes. We will see laterin this chapter that the fingering dimension is closely related the so-calledvc dimension of C (see also Problem 12.4).

Again, we get much smaller errors if infφ∈C L(φ) = 0. The next inequalitygeneralizes Theorem 4.6.

Theorem 12.3. Assume that C has the property described in Theorem 12.2with fingering dimension k. Assume, in addition, that infφ∈C L(φ) = 0.Then for all n and ε,

PL(φ) > ε

≤ nke−(n−k)ε,

and

EL(φ)

≤ k log n+ 2

n− k.

Remark. Even though the results in the next few sections based on theVapnik-Chervonenkis theory supersede those of this section (by requiringless from the class C and being able to bound the error of any classifierminimizing the empirical risk), we must remark that the exponents in theabove probability inequalities are the best possible, and bounds of the sametype for the general case can only be obtained with significantly more effort.2

12.3 The Glivenko-Cantelli Theorem

In the next two sections, we prove the Vapnik-Chervonenkis inequality, apowerful generalization of the classical Glivenko-Cantelli theorem. It pro-vides upper bounds on random variables of the type

supA∈A

|νn(A)− ν(A)|.

As we noted in Section 12.1, such bounds yield performance bounds for anyclassifier selected by minimizing the empirical error. To make the materialmore digestible, we first present the main ideas in a simple one-dimensionalsetting, and then prove the general theorem in the next section.

We drop the pattern recognition setting momentarily, and return to prob-ability theory. The following theorem is sometimes referred to as the fun-damental theorem of mathematical statistics, stating uniform almost sureconvergence of the empirical distribution function to the true one:


Theorem 12.4. (Glivenko-Cantelli theorem). Let Z1, . . . , Zn be i.i.d.real-valued random variables with distribution function F (z) = PZ1 ≤ z.Denote the standard empirical distribution function by

Fn(z) =1n

n∑i=1

IZi≤z.

Then

P

supz∈R

|F (z)− Fn(z)| > ε

≤ 8(n+ 1)e−nε

2/32,

and, in particular, by the Borel-Cantelli lemma,

limn→∞

supz∈R

|F (z)− Fn(z)| = 0 with probability one.

Proof. The proof presented here is not the simplest possible, but it con-tains the main ideas leading to a powerful generalization. Introduce thenotation ν(A) = PZ1 ∈ A and νn(A) = (1/n)

∑nj=1 IZj∈A for all mea-

surable sets A ⊂ R. Let A denote the class of sets of form (−∞, z] forz ∈ R. With these notations,

supz∈R

|F (z)− Fn(z)| = supA∈A

|νn(A)− ν(A)|.

We prove the theorem in several steps, following symmetrization ideas ofDudley (1978), and Pollard (1984). We assume that nε2 ≥ 2, since otherwisethe bound is trivial. In the first step we introduce a symmetrization.

Step 1. First symmetrization by a ghost sample. Define the ran-dom variables Z ′1, . . . , Z

′n ∈ R such that Z1, . . . , Zn, Z

′1, . . . , Z

′n are all in-

dependent and identically distributed. Denote by ν′n the empirical measurecorresponding to the new sample:

ν′n(A) =1n

n∑i=1

IZ′i∈A.

Then for nε2 ≥ 2 we have

P

supA∈A

|νn(A)− ν(A)| > ε

≤ 2P

supA∈A

|νn(A)− ν′n(A)| > ε

2

.

To see this, let A∗ ∈ A be a set for which |νn(A∗) − ν(A∗)| > ε if such aset exists, and let A∗ be a fixed set in A otherwise. Then

P

supA∈A

|νn(A)− ν′n(A)| > ε/2

≥ P |νn(A∗)− ν′n(A∗)| > ε/2

≥ P|νn(A∗)− ν(A∗)| > ε, |ν′n(A∗)− ν(A∗)| < ε

2

= E

I|νn(A∗)−ν(A∗)|>εP

|ν′n(A∗)− ν(A∗)| < ε

2

∣∣∣Z1, . . . , Zn

.


The conditional probability inside may be bounded by Chebyshev’s in-equality as follows:

P|ν′n(A∗)− ν(A∗)| < ε

2

∣∣∣Z1, . . . , Zn

≥ 1− ν(A∗)(1− ν(A∗))

nε2/4

≥ 1− 1nε2

≥ 12

whenever nε2 ≥ 2. In summary,

P

supA∈A

|νn(A)− ν′n(A)| > ε/2

≥ 12P|νn(A∗)− ν(A∗)| > ε

≥ 12P

supA∈A

|νn(A)− ν(A)| > ε

.

Step 2. Second symmetrization by random signs. Let σ1, . . . , σnbe i.i.d. sign variables, independent of Z1, . . . , Zn and Z ′1, . . . , Z

′n, with

Pσi = −1 = Pσi = 1 = 1/2. Clearly, because Z1, Z′1, . . . , Zn, Z

′n are

all independent and identically distributed, the distribution of

supA∈A

∣∣∣∣∣n∑i=1

(IA(Zi)− IA(Z ′i))

∣∣∣∣∣is the same as the distribution of

supA∈A

∣∣∣∣∣n∑i=1

σi(IA(Zi)− IA(Z ′i))

∣∣∣∣∣ .Thus, by Step 1,

P

supA∈A

|νn(A)− ν(A)| > ε

≤ 2P

supA∈A

1n

∣∣∣∣∣n∑i=1

(IA(Zi)− IA(Z ′i))

∣∣∣∣∣ > ε

2

= 2P

supA∈A

1n

∣∣∣∣∣n∑i=1


∣∣∣∣∣ > ε

2

.

Simply applying the union bound, we can remove the auxiliary randomvariables Z ′1, . . . , Z

′n:

P

supA∈A

1n

∣∣∣∣∣n∑i=1


∣∣∣∣∣ > ε

2

≤ P

supA∈A

1n

∣∣∣∣∣n∑i=1

σiIA(Zi)

∣∣∣∣∣ > ε

4

+ P

supA∈A

1n

∣∣∣∣∣n∑i=1

σiIA(Z ′i)

∣∣∣∣∣ > ε

4


= 2P

supA∈A

1n

∣∣∣∣∣n∑i=1

σiIA(Zi)

∣∣∣∣∣ > ε

4

.

Step 3. Conditioning. To bound the probability

P

supA∈A

1n

∣∣∣∣∣n∑i=1

σiIA(Zi)

∣∣∣∣∣ > ε

4

= P

supz∈R

1n

∣∣∣∣∣n∑i=1

σiIZi≤z

∣∣∣∣∣ > ε

4

,

we condition on Z1, . . . , Zn. Fix z1, . . . , zn ∈ Rd, and note that as z rangesover R, the number of different vectors

(Iz1≤z, . . . , Izn≤z

)is at most

n + 1. Thus, conditional on Z1, . . . , Zn, the supremum in the probabilityabove is just a maximum taken over at most n+1 random variables. Thus,applying the union bound gives

P

supA∈A

1n

∣∣∣∣∣n∑i=1

σiIA(Zi)

∣∣∣∣∣ > ε

4

∣∣∣∣∣Z1, . . . , Zn

≤ (n+ 1) supA∈A

P

1n

∣∣∣∣∣n∑i=1

σiIA(Zi)

∣∣∣∣∣ > ε

4

∣∣∣∣∣Z1, . . . , Zn

.

With the supremum now outside the probability, it suffices to find an ex-ponential bound on the conditional probability

P

1n

∣∣∣∣∣n∑i=1

σiIA(Zi)

∣∣∣∣∣ > ε

4

∣∣∣∣∣Z1, . . . , Zn

.

Step 4. Hoeffding’s inequality. With z1, . . . , zn fixed,∑ni=1 σiIA(zi)

is the sum of n independent zero mean random variables bounded between−1 and 1. Therefore, Theorem 8.1 applies in a straightforward manner:

P

1n

∣∣∣∣∣n∑i=1

σiIA(Zi)

∣∣∣∣∣ > ε

4

∣∣∣∣∣Z1, . . . , Zn

≤ 2e−nε

2/32.

Thus,

P

supA∈A

1n

∣∣∣∣∣n∑i=1

σiIA(Zi)

∣∣∣∣∣ > ε

4

∣∣∣∣∣Z1, . . . , Zn

≤ 2(n+ 1)e−nε

2/32.

Taking the expected value on both sides we have

P

supA∈A

1n

∣∣∣∣∣n∑i=1

σiIA(Zi)

∣∣∣∣∣ > ε

4

≤ 2(n+ 1)e−nε

2/32.

In summary,

P

supA∈A

|νn(A)− ν(A)| > ε

≤ 8(n+ 1)e−nε

2/32. 2

12.4 Uniform Deviations of Relative Frequencies from Probabilities 211

12.4 Uniform Deviations of Relative Frequenciesfrom Probabilities

In this section we prove the Vapnik-Chervonenkis inequality, a mighty gen-eralization of Theorem 12.4. In the proof we need only a slight adjustmentof the proof above. In the general setting, let the independent identicallydistributed random variables Z1, . . . , Zn take their values from Rd. Again,we use the notation ν(A) = PZ1 ∈ A and νn(A) = (1/n)

∑nj=1 IZj∈A

for all measurable sets A ⊂ Rd. The Vapnik-Chervonenkis theory beginswith the concepts of shatter coefficient and Vapnik-Chervonenkis (or vc)dimension:

Definition 12.1. Let A be a collection of measurable sets. For (z1, . . . , zn) ∈Rdn, let NA(z1, . . . , zn) be the number of different sets in

z1, . . . , zn ∩A;A ∈ A.

The n-th shatter coefficient of A is

s(A, n) = max(z1,...,zn)∈Rdn

NA(z1, . . . , zn).

That is, the shatter coefficient is the maximal number of different subsetsof n points that can be picked out by the class of sets A.

The shatter coefficients measure the richness of the class A. Clearly,s(A, n) ≤ 2n, as there are 2n subsets of a set with n elements. IfNA(z1, . . . , zn) =2n for some (z1, . . . , zn), then we say thatA shatters z1, . . . , zn. If s(A, n) <2n, then any set of n points has a subset such that there is no set in Athat contains exactly that subset of the n points. Clearly, if s(A, k) < 2k

for some integer k, then s(A, n) < 2n for all n > k. The first time whenthis happens is important:

Definition 12.2. Let A be a collection of sets with |A| ≥ 2. The largestinteger k ≥ 1 for which s(A, k) = 2k is denoted by VA, and it is calledthe Vapnik-Chervonenkis dimension (or vc dimension) of the class A. Ifs(A, n) = 2n for all n, then by definition, VA = ∞.

For example, if A contains all halflines of form (−∞, x], x ∈ R, thens(A, 2) = 3 < 22, and VA = 1. This is easily seen by observing that forany two different points z1 < z2 there is no set of the form (−∞, x] thatcontains z2, but not z1. A class of sets A for which VA < ∞ is called aVapnik-Chervonenkis (or vc) class. In a sense, VA may be considered asthe complexity, or size, of A. Several properties of the shatter coefficientsand the vc dimension will be shown in Chapter 13. The main purposeof this section is to prove the following important result by Vapnik andChervonenkis (1971):


Theorem 12.5. (Vapnik and Chervonenkis (1971)). For any proba-bility measure ν and class of sets A, and for any n and ε > 0,

P

supA∈A

|νn(A)− ν(A)| > ε

≤ 8s(A, n)e−nε

2/32.

Proof. The proof parallels that of Theorem 12.4. We may again assumethat nε2 ≥ 2. In the first two steps we prove that

P

supA∈A

|νn(A)− ν(A)| > ε

≤ 4P

supA∈A

1n

∣∣∣∣∣n∑i=1

σiIA(Zi)

∣∣∣∣∣ > ε

4

.

This may be done exactly the same way as in Theorem 12.4; we do notrepeat the argument. The only difference appears in Step 3:

Step 3. Conditioning. To bound the probability

P

supA∈A

1n

∣∣∣∣∣n∑i=1

σiIA(Zi)

∣∣∣∣∣ > ε

4

,

again we condition on Z1, . . . , Zn. Fix z1, . . . , zn ∈ Rd, and observe that asA ranges over A, the number of different vectors (IA(z1), . . . , IA(zn)) is justthe number of different subsets of z1, . . . , zn produced by intersecting itwith sets in A, which, by definition, cannot exceed s(A, n). Therefore, withZ1, . . . , Zn fixed, the supremum in the above probability is a maximum ofat most NA(Z1, . . . , Zn) random variables. This number, by definition, isbounded from above by s(A, n). By the union bound we get

P

supA∈A

1n

∣∣∣∣∣n∑i=1

σiIA(Zi)

∣∣∣∣∣ > ε

4

∣∣∣∣∣Z1, . . . , Zn

≤ s(A, n) supA∈A

P

1n

∣∣∣∣∣n∑i=1

σiIA(Zi)

∣∣∣∣∣ > ε

4

∣∣∣∣∣Z1, . . . , Zn

.

Therefore, as before, it suffices to bound the conditional probability

P

1n

∣∣∣∣∣n∑i=1

σiIA(Zi)

∣∣∣∣∣ > ε

4

∣∣∣∣∣Z1, . . . , Zn

.

This may be done by Hoeffding’s inequality exactly as in Step 4 of theproof of Theorem 12.4. Finally, we obtain

P

supA∈A

|νn(A)− ν(A)| > ε

≤ 8s(A, n)e−nε

2/32. 2


The bound of Theorem 12.5 is useful when the shatter coefficients do notincrease too quickly with n. For example, if A contains all Borel sets of Rd,then we can shatter any collection of n different points at will, and obtains(A, n) = 2n. This would be useless, of course. The smaller A, the smallerthe shatter coefficient is. To apply the vc bound, it suffices to computeshatter coefficients for certain families of sets. Examples may be foundin Cover (1965), Vapnik and Chervonenkis (1971), Devroye and Wagner(1979a), Feinholz (1979), Devroye (1982a), Massart (1983), Dudley (1984),Simon (1991), and Stengle and Yukich (1989). This list of references is farfrom exhaustive. More information about shatter coefficients is given inChapter 13.

Remark. Measurability. The supremum in Theorem 12.5 is not alwaysmeasurable. Measurability must be verified for every family A. For all ourexamples, the quantities are indeed measurable. For more on the measur-ability question, see Dudley (1978; 1984), Massart (1983), and Gaenssler(1983). Gine and Zinn (1984) and Yukich (1985) provide further work onsuprema of the type shown in Theorem 12.5. 2

Remark. Optimal exponent. For the sake of readability we followedthe line of Pollard’s proof (1984) instead of the original by Vapnik and Cher-vonenkis. In particular, the exponent −nε2/32 in Theorem 12.5 is worsethan the −nε2/8 established in the original paper. The best known expo-nents together with some other related results are mentioned in Section12.8. The basic ideas of the original proof by Vapnik and Chervonenkisappear in the proof of Theorem 12.7 below. 2

Remark. Necessary and Sufficient Conditions. It is clear fromthe proof of the theorem that it can be strengthened to

P

supA∈A

|νn(A)− ν(A)| > ε

≤ 8E NA(Z1, . . . , Zn) e−nε

2/32,

where Z1, . . . , Zn are i.i.d. random variables with probability measure ν.Although this upper bound is tighter than that in the stated inequality,it is usually more difficult to handle, since the coefficient in front of theexponential term depends on the distribution of Z1, while s(A, n) is purelycombinatorial in nature. However, this form is important in a differentsetting: we say that the uniform law of large numbers holds if

supA∈A

|νn(A)− ν(A)| → 0 in probability.

It follows from this form of Theorem 12.5 that the uniform law of largenumbers holds if

E log (NA(Z1, . . . , Zn))n

→ 0.

12.5 Classifier Selection 214

Vapnik and Chervonenkis showed (1971; 1981) that this condition is alsonecessary for the uniform law of large numbers. Another characterizationof the uniform law of large numbers is given by Talagrand (1987), whoshowed that the uniform law of large numbers holds if and only if theredoes not exist a set A ⊂ Rd with ν(A) > 0 such that, with probability one,the set Z1, . . . , Zn ∩A is shattered by A. 2

12.5 Classifier Selection

The following theorem relates the results of the previous sections to empir-ical classifier selection, that is, when the empirical error probability Ln(φ)is minimized over a class of classifiers C. We emphasize that unlike in Sec-tion 12.2, here we allow any classifier with the property that has minimalempirical error Ln(φ) in C. First introduce the shatter coefficients and vcdimension of C:

Definition 12.3. Let C be a class of decision functions of the form φ :Rd → 0, 1. Define A as the collection of all sets

x : φ(x) = 1 × 0⋃x : φ(x) = 0 × 1 , φ ∈ C.

Define the n-th shatter coefficient S(C, n) of the class of classifiers C as

S(C, n) = s(A, n).

Furthermore, define the vc dimension VC of C as

VC = VA.

For the performance of the empirically selected decision φ∗n, we have thefollowing:

Theorem 12.6. Let C be a class of decision functions of the form φ :Rd → 0, 1. Then using the notation Ln(φ) = 1

n

∑ni=1 Iφ(Xi) 6=Yi and

L(φ) = Pφ(X) 6= Y , we have

P

supφ∈C

|Ln(φ)− L(φ)| > ε

≤ 8S(C, n)e−nε

2/32,

and therefore

PL(φ∗n)− inf

φ∈CL(φ) > ε

≤ 8S(C, n)e−nε

2/128,

where φ∗n denotes the classifier minimizing Ln(φ) over the class C.

12.5 Classifier Selection 215

Proof. The statements are immediate consequences of Theorem 12.5 andLemma 8.2. 2

The next corollary, an easy application of Theorem 12.6 (see Problem12.1), makes things a little more transparent.

Corollary 12.1. In the notation of Theorem 12.6,

E L(φ∗n) − infφ∈C

L(φ) ≤ 16

√log(8eS(C, n))

2n.

If S(C, n) increases polynomially with n, then the average error probabil-ity of the selected classifier is within O

(√log n/n

)of the error of the best

rule in the class. We point out that this result is completely distribution-free. Furthermore, note the nonasymptotic nature of these inequalities: theyhold for every n. From here on the problem is purely combinatorial—onehas to estimate the shatter coefficients. Many properties are given in Chap-ter 13. In particular, if VC > 2, then S(C, n) ≤ nVC , that is, if the class Chas finite vc dimension, then S(C, n) increases at a polynomial rate, and

E L(φ∗n) − infφ∈C

L(φ) ≤ 16

√VC log n+ 4

2n.

Remark. Theorem 12.6 provides a bound for the behavior of the empiri-cally optimal classifier. In practice, finding an empirically optimal classifieris often computationally very expensive. In such cases, the designer is oftenforced to put up with algorithms yielding suboptimal classifiers. Assume forexample, that we have an algorithm which selects a classifier gn such thatits empirical error is not too far from the optimum with large probability:

PLn(gn) ≤ inf

φ∈CLn(φ) + εn

≥ 1− δn,

where εn and δn are sequences of positive numbers converging to zero.Then it is easy to see (see Problem 12.6) that

PL(gn)− inf

φ∈CL(φ) > ε

≤ δn + P

2 supφ∈C

∣∣∣Ln(φ)− L(φ)∣∣∣ > ε− εn

,

and Theorem 12.6 may be used to obtain bounds for the error probabilityof gn. 2

Interestingly, the empirical error probability of the empirically optimalclassifier is always close to its expected value, as may be seen from thefollowing example:

12.6 Sample Complexity 216

Corollary 12.2. Let C be an arbitrary class of classification rules (thatis, functions of the form φ : Rd → 0, 1). Let φ∗n ∈ C be the rule thatminimizes the number of errors committed on the training sequence Dn,among the classifiers in C. In other words,

Ln(φ∗n) ≤ Ln(φ) for every φ ∈ C,

where Ln(φ) = 1n

∑ni=1 Iφ(Xi) 6=Yi. Then for every n and ε > 0,

P∣∣∣Ln(φ∗n)−E

Ln(φ∗n)

∣∣∣ > ε≤ 2e−2nε2 .

The corollary follows immediately from Theorem 9.1 by observing thatchanging the value of one (Xi, Yi) pair in the training sequence results in achange of at most 1/n in the value of Ln(φ∗n). The corollary is true even ifthe vc dimension of C is infinite (!). The result shows that Ln(φ∗n) is alwaysvery close to its expected value with large probability, even if E

Ln(φ∗n)

is far from infφ∈C L(φ) (see also Problem 12.12).

12.6 Sample Complexity

In his theory of learning, Valiant (1984) rephrases the empirical classifierselection problem as follows. For ε, δ > 0, define an (ε, δ) learning algorithmas a method that selects a classifier gn from C using the data Dn such thatfor the selected rule

sup(X,Y )

PL(gn)− inf

φ∈CL(φ) > ε

≤ δ,

whenever n ≥ N(ε, δ). Here N(ε, δ) is the sample complexity of the algo-rithm, defined as the smallest integer with the above property. Since thesupremum is taken over all possible distributions of (X,Y ), the integerN(ε, δ) is the number of data pairs that guarantees ε accuracy with δ confi-dence for any distribution. Note that we use the notation gn and not φ∗n inthe definition above, as the definition does not force us to take the empiricalrisk minimizer φ∗n.

We may use Theorem 12.6 to get an upper bound on the sample com-plexity of the selection algorithm based on empirical error minimization(i.e., the classifier φ∗n).

Corollary 12.3. The sample complexity of the method based on empiricalerror minimization is bounded from above by

N(ε, δ) ≤ max(

512VCε2

log256VCε2

,256ε2

log8δ

).

12.7 The Zero-Error Case 217

The corollary is a direct consequence of Theorem 12.6. The details areleft as an exercise (Problem 12.5). The constants may be improved by usingrefined versions of Theorem 12.5 (see, e.g., Theorem 12.8). The sample sizethat guarantees the prescribed accuracy and confidence is proportional tothe maximum of

VCε2

logVCε2

and1ε2

log1δ.

Here we have our first practical interpretation of the vc dimension. Dou-bling the vc dimension requires that we basically double the sample size toobtain the same accuracy and confidence. Doubling the accuracy however,forces us to quadruple the sample size. On the other hand, the confidencelevel has little influence on the sample size, as it is hidden behind a loga-rithmic term, thanks to the exponential nature of the Vapnik-Chervonenkisinequality.

The Vapnik-Chervonenkis bound and the sample complexity N(ε, δ) alsoallow us to compare different classes in a unified manner. For example, ifwe pick φ∗n by minimizing the empirical error over all hyperrectangles ofR18, will we need a sample size that exceeds that of the rule that mini-mizes the empirical error over all linear halfspaces of R29? With the samplecomplexity in hand, it is just a matter of comparing vc dimensions.

As a function of ε, the above bound grows as O((1/ε2) log(1/ε2)

). It

is possible, interestingly, to get rid of the “log” term, at the expense ofincreasing the linearity in the vc dimension (see Problem 12.11).

12.7 The Zero-Error Case

Theorem 12.5 is completely general, as it applies to any class of classi-fiers and all distributions. In some cases, however, when we have someadditional information about the distribution, it is possible to obtain evenbetter bounds. For example, in the theory of concept learning one com-monly assumes that L∗ = 0, and that the Bayes decision is contained inC (see, e.g., Valiant (1984), Blumer, Ehrenfeucht, Haussler, and Warmuth(1989), Natarajan (1991)). The following theorem provides significant im-provement. Its various forms have been proved by Devroye and Wagner(1976b), Vapnik (1982), and Blumer, Ehrenfeucht, Haussler, and Warmuth(1989). For a sharper result, see Problem 12.9.

Theorem 12.7. Let C be a class of decision functions mapping Rd to0, 1, and let φ∗n be a function in C that minimizes the empirical errorbased on the training sample Dn. Suppose that infφ∈C L(φ) = 0, i.e., theBayes decision is contained in C, and L∗ = 0. Then

P L(φ∗n) > ε ≤ 2S(C, 2n)2−nε/2.


To contrast this with Theorem 12.6, observe that the exponent in theupper bound for the empirically optimal rule is proportional to −nε in-stead of −nε2. To see the significance of this difference, note that Theorem12.7 implies that the error probability of the selected classifier is withinO(log n/n) of the optimal rule in the class (which equals zero in this case,see Problem 12.8), as opposed to O

(√log n/n

)from Theorem 12.6. We

show in Chapter 14 that this is not a technical coincidence, but since bothbounds are essentially tight, it is a mathematical witness to the fact thatit is remarkably easier to select a good classifier when L∗ = 0. The proofis based on the random permutation argument developed in the originalproof of the Vapnik-Chervonenkis inequality (1971).

Proof. For nε ≤ 2, the inequality is clearly true. So, we assume that nε >2. First observe that since infφ∈C L(φ) = 0, Ln(φ∗n) = 0 with probabilityone. It is easily seen that

L(φ∗n) ≤ supφ:Ln(φ)=0

|L(φ)− Ln(φ)|.

Now, we return to the notation of the previous sections, that is, Zi denotesthe pair (Xi, Yi), ν denotes its measure, and νn is the empirical measurebased on Z1, . . . , Zn. Also, A consists of all sets of the form A = (x, y) :φ(x) 6= y for φ ∈ C. With these notations,

supφ:Ln(φ)=0

|L(φ)− Ln(φ)| = supA:νn(A)=0

|νn(A)− ν(A)|.

Step 1. Symmetrization by a ghost sample. The first step of theproof is similar to that of Theorem 12.5. Introduce the auxiliary sampleZ ′1, . . . , Z

′n such that the random variables Z1, . . . , Zn, Z

′1, . . . , Z

′n are i.i.d.,

and let ν′n be the empirical measure for Z ′1, . . . , Z′n. Then for nε > 2

P

sup

A:νn(A)=0

|νn(A)− ν(A)| > ε

≤ 2P

sup

A:νn(A)=0

|νn(A)− ν′n(A)| > ε

2

.

The proof of this inequality parallels that of the corresponding one in theproof of Theorem 12.5. However, an important difference is that the condi-tion nε2 > 2 there is more restrictive than the condition nε > 2 here. Thedetails of the proof are left to the reader (see Problem 12.7).

Step 2. Symmetrization by permuting. Note that the distribution of

supA:νn(A)=0

|νn(A)− ν′n(A)| = supA:∑n

i=1IA(Zi)=0

∣∣∣∣∣ 1nn∑i=1

IA(Zi)−1n

n∑i=1

IA(Z ′i)

∣∣∣∣∣is the same as the distribution of

β(Π) def= supA:∑n

i=1IA(Π(Zi))=0

∣∣∣∣∣ 1nn∑i=1

IA(Π(Zi))−1n

n∑i=1

IA(Π(Z ′i))

∣∣∣∣∣ ,


where Π(Z1), . . . ,Π(Zn),Π(Z ′1), . . . ,Π(Z ′n) is an arbitrary permutation ofthe random variables Z1, . . . , Zn, Z

′1, . . . , Z

′n. The (2n)! possible permuta-

tions are denoted byΠ1,Π2, . . . ,Π(2n)!.

Therefore,

P

sup

A:νn(A)=0

|νn(A)− ν′n(A)| > ε

2

= E

1(2n)!

(2n)!∑j=1

Iβ(Πj)>ε2

= E

1(2n)!

(2n)!∑j=1

supA:∑n

i=1IA(Πj(Zi))=0

I| 1n∑n

i=1IA(Πj(Zi))− 1

n

∑n

i=1IA(Πj(Z′i))|> ε

2

.

Step 3. Conditioning. Next we fix z1, . . . , zn, z′1, . . . , z′n and bound the

value of the random variable above. Let A ⊂ A be a collection of sets suchthat any two sets in A pick different subsets of z1, . . . , zn, z′1, . . . , z′n, andits cardinality is NA(z1, . . . , zn, z′1, . . . , z

′n), that is, all possible subsets are

represented exactly once. Then it suffices to take the supremum over Ainstead of A:

1(2n)!

(2n)!∑j=1

supA:∑n

i=1IA(Πj(zi))=0

I| 1n∑n

i=1IA(Πj(zi))− 1

n

∑n

i=1IA(Πj(z′i))|> ε

2

=1

(2n)!

(2n)!∑j=1

supA∈A:

∑n

i=1IA(Πj(zi))=0

I 1n

∑n

i=1IA(Πj(z′i))>

ε2

≤ 1(2n)!

(2n)!∑j=1

∑A∈A:

∑n

i=1IA(Πj(zi))=0

I 1n

∑n

i=1IA(Πj(z′i))>

ε2

=1

(2n)!

(2n)!∑j=1

∑A∈A

I∑n

i=1IA(Πj(zi))=0I 1

n

∑n

i=1IA(Πj(z′i))>

ε2

=∑A∈A

1(2n)!

(2n)!∑j=1

I∑n

i=1IA(Πj(zi))=0I 1

n

∑n

i=1IA(Πj(z′i))>

ε2.

Step 4. Counting. Clearly, the expression behind the first summationsign is just the number of permutations of the 2n points z1, z′1, . . . , zn, z

′n


with the property

I∑n

i=1IA(Πj(zi))=0I 1

n

∑n

i=1IA(Πj(z′i))>

ε2 = 1,

divided by (2n)!, the total number of permutations. Observe that if l =∑ni=1(IA(zi)+IA(z′i)) is the total number of points inA among z1, z′1, . . . , zn, z

′n

for a fixed set A, then the number of permutations such that

I∑n

i=1IA(Πj(zi))=0I 1

n

∑n

i=1IA(Πj(z′i))>

ε2 = 1

is zero if l ≤ nε/2. If l > nε/2, then the fraction of the number of permuta-tions with the above property and the number of all permutations can notexceed (

nl

)(2nl

) .To see this, note that for the above product of indicators to be 1, all thepoints falling in A have to be in the second half of the permuted sample.Now clearly,(

nl

)(2nl

) =n(n− 1) · · · (n− l + 1)

2n(2n− 1) · · · (2n− l + 1)≤ 2−l ≤ 2−nε/2.

Summarizing, we have

P

sup

A:νn(A)=0

|νn(A)− ν′n(A)| > ε

2

≤ E

1(2n)!

(2n)!∑j=1

supA:∑n

i=1IA(Πj(Zi))=0

I| 1n∑n

i=1IA(Πj(Zi))− 1

n

∑n

i=1IA(Πj(Z′i))|> ε

2

≤ E

∑A∈A

2−nε/2

= E

NA(Z1, . . . , Zn, Z

′1, . . . , Z

′n)2

−nε/2

≤ s(A, 2n)2−nε/2,

and the theorem is proved. 2

Again, we can bound the sample complexity N(ε, δ) restricted to theclass of distributions with infφ∈C L(φ) = 0. Just as Theorem 12.6 impliesCorollary 12.3, Theorem 12.7 yields

12.8 Extensions 221

Corollary 12.4. The sample complexity N(ε, δ) that guarantees

sup(X,Y ):infφ∈C L(φ)=0

PL(gn)− inf

φ∈CL(φ) > ε

≤ δ

for n ≥ N(ε, δ), is bounded by

N(ε, δ) ≤ max(

8VCε

log2

13ε,4ε

log2

2δ

).

A quick comparison with Corollary 12.3 shows that the ε2 factors in thedenominators there are now replaced by ε. For the same accuracy, muchsmaller samples suffice if we know that infφ∈C L(φ) = 0. Interestingly, thesample complexity is still roughly linear in the vc dimension.

Remark. We also note that under the condition of Theorem 12.7,

P L(φ∗n) > ε ≤ 2E NA(Z1, . . . , Zn, Z′1, . . . , Z

′n) 2−nε/2

= 2E NA(Z1, . . . , Z2n) 2−nε/2,

where the class A of sets is defined in the proof of Theorem 12.7. 2

Remark. As Theorem 12.7 shows, the difference between the error prob-ability of an empirically optimal classifier and that of the optimal in theclass is much smaller if the latter quantity is known to be zero than if norestriction is imposed on the distribution. To bridge the gap between thetwo bounds, one may put the restriction infφ∈C L(φ) ≤ L on the distribu-tion, where L ∈ (0, 1/2) is a fixed number. Devroye and Wagner (1976b)and Vapnik (1982) obtained such bounds. For example, it follows from aresult by Vapnik (1982) that

P

sup

A∈A:ν(A)≤L|ν(A)− νn(A)| > ε

≤ 8s(A, 2n)e−nε

2/(16L).

As expected, the bound becomes smaller as L decreases. We face the samephenomenon in Chapter 14, where lower bounds are obtained for the prob-ability above. 2

12.8 Extensions

We mentioned earlier that the constant in the exponent in Theorem 12.5can be improved at the expense of a more complicated argument. The bestpossible exponent appears in the following result, whose proof is left as anexercise (Problem 12.15):

12.8 Extensions 222

Theorem 12.8. (Devroye (1982a)).

P

supA∈A

|νn(A)− ν(A)| > ε

≤ cs(A, n2)e−2nε2 ,

where the constant c does not exceed 4e4ε+4ε2 ≤ 4e8, ε ≤ 1.

Even though the coefficient in front is larger than in Theorem 12.5, itbecomes very quickly absorbed by the exponential term. We will see inChapter 13 that for VA > 2, s(A, n) ≤ nVA , so s(A, n2) ≤ n2VA . Thisdifference is negligible compared to the difference between the exponentialterms, even for moderately large values of nε2.

Both Theorem 12.5 and Theorem 12.8 imply that

E

supA∈A

|νn(A)− ν(A)|

= O(√

log n/n)

(see Problem 12.1). However, it is possible to get rid of the logarithmic termto obtain O(1/

√n). For example, for the Kolmogorov-Smirnov statistic

we have the following result by Dvoretzky, Kiefer and Wolfowitz (1956),sharpened by Massart (1990):

Theorem 12.9. (Dvoretzky, Kiefer, and Wolfowitz (1956); Mas-sart (1990)). Using the notation of Theorem 12.4, we have for every nand ε > 0,

P

supz∈R

|F (z)− Fn(z)| > ε

≤ 2e−2nε2 .

For the general case, we also have Alexander’s bound:

Theorem 12.10. (Alexander (1984)). For nε2 ≥ 64,

P

supA∈A

|νn(A)− ν(A)| > ε

≤ 16

(√nε)4096VA

e−2nε2 .

The theorem implies the following (Problem 12.10):

Corollary 12.5.

E

supA∈A

|νn(A)− ν(A)|≤

8 +√

2048VA log(4096VA)√n

.

The bound in Theorem 12.10 is theoretically interesting, since it implies(see Problem 12.10) that for fixed VA the expected value of the supremumdecreases as a/

√n instead of

√log n/n. However, a quick comparison re-

veals that Alexander’s bound is larger than that of Theorem 12.8, unless

12.8 Extensions 223

n > 26144, an astronomically large value. Recently, Talagrand (1994) ob-tained a very strong result. He proved that there exists a universal constantc such that

P

supA∈A

|νn(A)− ν(A)| > ε

≤ c√

nε

(cnε2

VA

)VAe−2nε2 .

For more information about these inequalities, see also Vapnik (1982),Gaenssler (1983), Gaenssler and Stute (1979), and Massart (1983).

It is only natural to ask whether the uniform law of large numbers

supA∈A

|νn(A)− ν(A)| → 0

holds if we allow A to be the class of all measurable subsets of Rd. In thiscase the supremum above is called the total variation between the mea-sures νn and ν. The convergence clearly can not hold if νn is the standardempirical measure

νn(A) =1n

n∑i=1

IZi∈A.

But is there another empirical measure such that the convergence holds?The somewhat amusing answer is no. As Devroye and Gyorfi (1992) proved,for any empirical measure νn—that is, a function depending on Z1, . . . , Znassigning a nonnegative number to any measurable set—there exists a dis-tribution of Z such that for all n

supA∈A

|νn(A)− ν(A)| > 1/4

almost surely. Thus, in this generality, the problem is hopeless. For mean-ingful results, either A or ν must be restricted. For example, if we assumethat ν is absolutely continuous with density f , and that νn is absolutelycontinuous too (with density fn), then by Scheffe’s theorem (1947),

supA∈A

|νn(A)− ν(A)| = 12

∫Rd

|fn(x)− f(x)|dx

(see Problem 12.13). But as we see from Problems 6.2, 10.2, 9.6, and 10.3,there exist density estimators (such as histogram and kernel estimates) suchthat the L1-error converges to zero almost surely for all possible densities.Therefore, the total variation between the empirical measures derived fromthese density estimates and the true measure converges to zero almostsurely for all distributions with a density. For other large classes of distri-butions that can be estimated consistently in total variation, we refer toBarron, Gyorfi, and van der Meulen (1992).



Problem 12.1. Prove that if a nonnegative random variable Z satisfies PZ >

t ≤ ce−2nt2 for all t > 0 and some c > 0, then,

EZ2 ≤ log(ce)

2n.

Furthermore,

EZ ≤√

EZ2 ≤

√log(ce)

2n.

Hint: Use the identity EZ2 =∫∞0

PZ2 > tdt, and set∫∞0

=∫ u0

+∫∞u

.Bound the first integral by u, and the second by the exponential inequality. Findthe value of u that minimizes the obtained upper bound.

Problem 12.2. Generalize the arguments of Theorems 4.5 and 4.6 to prove The-orems 12.2 and 12.3.

Problem 12.3. Determine the fingering dimension of classes of classifiers C =φ : φ(x) = Ix∈A;A ∈ A if the class A is

(1) the class of all closed intervals in R,(2) the class of all sets obtained as the union of m closed intervals in R,(3) the class of balls in Rd centered at the origin,(4) the class of all balls in Rd,(5) the class of sets of the form (−∞, x1]× · · · × (−∞, xd] in Rd, and(6) the class of all convex polygons in R2.

Problem 12.4. Let C be a class of classifiers with fingering dimension k > 4(independently of n). Show that VC ≤ k log2

2 k.

Problem 12.5. Prove that Theorem 12.6 implies Corollary 12.3. Hint: Find

N(ε, δ) such that 8nVCe−nε2/128 ≤ δ whenever n > N(ε, δ). To see this, first show

that nVC ≤ enε2/256 is satisfied for n ≥ 512VC

ε2log 256VC

ε2, which follows from the

fact that 2 log x ≤ x if x ≥ e2. But in this case 8nVCe−nε2/128 ≤ 8e−nε

2/256. Theupper bound does not exceed δ if n ≥ 256

ε2log 8

δ.

Problem 12.6. Let C be a class of classifiers φ : Rd → 0, 1, and let φ∗n bea classifier minimizing the empirical error probability measured on Dn. Assumethat we have an algorithm which selects a classifier gn such that

P

Ln(gn) ≤ inf

φ∈CLn(φ) + εn

≥ 1− δn,

where εn and δn are sequences of positive numbers converging to zero. Showthat

P

L(gn)− inf

φ∈CL(φ) > ε

≤ δn + P

2 supφ∈C

∣∣∣Ln(φ)− L(φ)

∣∣∣ > ε− εn

.


Find conditions on εn and δn so that

EL(gn)− infφ∈C

L(φ) = O

(√logn

n

),

that is, EL(gn) converges to the optimum at the same order as the error proba-bility of φ∗n.

Problem 12.7. Prove that

P

sup

A:νn(A)=0

|νn(A)− ν(A)| > ε

≤ 2P

sup

A:νn(A)=0

|νn(A)− ν′n(A)| > ε

2

holds if nε > 2. This inequality is needed to complete the proof of Theorem 12.7.Hint: Proceed as in the proof of Theorem 12.5. Introduce A∗ with νn(A∗) = 0and justify the validity of the steps of the following chain of inequalities:

P

sup

A:νn(A)=0

|νn(A)− ν′n(A)| > ε/2

≥ E

Iν(A∗)>εP

ν′n(A∗) ≥ ε

2

∣∣∣Z1, . . . , Zn

≥ P

B(n, ε) >

nε

2

P |νn(A∗)− ν(A∗)| > ε ,

where B(n, ε) is a binomial random variable with parameters n and ε. Finish theproof by showing that the probability on the right-hand side is greater than orequal to 1/2 if nε > 2. (Under the slightly more restrictive condition nε > 8, thisfollows from Chebyshev’s inequality.)

Problem 12.8. Prove that Theorem 12.7 implies that if infφ∈C L(φ) = 0, then

EL(φ∗n) ≤ 2VC log(2n) + 4

n log 2.

We note here that Haussler, Littlestone, and Warmuth (1988) demonstrated theexistence of a classifier φ∗n with EL(φ∗n) ≤ 2VC/n when infφ∈C L(φ) = 0. Hint:Use the identity EX =

∫∞0

PX > tdt for nonnegative random variables X,

and employ the fact that S(C, n) ≤ nVC (see Theorem 13.3).

Problem 12.9. Prove the following version of Theorem 12.7. Let φ∗n be a func-tion that minimizes the empirical error over a class in C. Assume that infφ∈C L(φ) =0. Then

P L(φ∗n) > ε ≤ 2S(C, n2)e−nε

(Lugosi (1995)). Hint: Modify the proof of Theorem 12.7 by introducing a ghostsample Z′1, . . . , Z

′m with size m (to be specified after optimization). Only the first

symmetrization step (Problem 12.7) needs adjusting: show that for any α ∈ (0, 1),

P

sup

A:νn(A)=0

ν(A) > ε

≤ 1

1− e−mαεP

sup

A:νn(A)=0

ν′m(A) > (1− α)ε

,

where ν′m is the empirical measure based on the ghost sample (use Bernstein’sinequality—Theorem 8.2). The rest of the proof is similar to that of Theorem12.7. Choose m = n2 − n and α = n/(n+m).


Problem 12.10. Prove that Alexander’s bound (Theorem 12.10) implies that ifA is a Vapnik-Chervonenkis class with vc dimension V = VA, then

E

supA∈A

|νn(A)− ν(A)|≤

8 +√

2048V log(4096V )√n

.

Hint: Justify the following steps:(1) If ψ is a negative decreasing concave function, then∫ ∞

u

eψ(t)dt ≤ eψ(u)

−ψ′(u) .

Hint: Bound ψ by using its Taylor series expansion.(2) Let b > 0 be fixed. Then for u ≥

√b/2,∫ ∞

u

tbe−2t2dt ≤ ub−1e−2u2

2.

Hint: Use the previous step.(3) Let X be a positive random variable for which

PX > u ≤ aube−2u2, u ≥

√c,

where a, b, c are positive constants. Then, if b ≥ 2c ≥ e,

EX ≤ a

2+

√b log b

2.

Hint: Use EX =∫∞0

PX > udu, and bound the probability either byone, or by the bound of the previous step.

Problem 12.11. Use Alexander’s inequality to obtain the following sample sizebound for empirical error minimization:

N(ε, δ) ≤ max

(214VC log

(214VC

)ε2

,4

ε2log

16

δ

).

For what values of ε does this bound beat Corollary 12.3?

Problem 12.12. Let Z1, . . . , Zn be i.i.d. random variables in Rd, with measureν, and standard empirical measure νn. Let A be an arbitrary class of subsets ofRd. Show that as n→∞,

supA∈A

|νn(A)− ν(A)| → 0 in probability

if and only if

supA∈A

|νn(A)− ν(A)| → 0 with probability one.

Hint: Use McDiarmid’s inequality.


Problem 12.13. Prove Scheffe’s theorem (1947): let µ and ν be absolute con-tinuous probability measures on Rd with densities f and g, respectively. Provethat

supA∈A

|µ(A)− ν(A)| = 1

2

∫|f(x)− g(x)|dx,

where A is the class of all Borel-measurable sets. Hint: Show that the supremumis achieved for the set x : f(x) > g(x).

Problem 12.14. Learning based on empirical covering. This problem demon-strates an alternative method of picking a classifier which works as well as em-pirical error minimization. The method, based on empirical covering of the classof classifiers, was introduced by Buescher and Kumar (1994a). The idea of cov-ering the class goes back to Vapnik (1982). See also Benedek and Itai (1988),Kulkarni (1991), and Dudley, Kulkarni, Richardson, and Zeitouni (1994). LetC be a class of classifiers φ : Rd → 0, 1. The data set Dn is split into twoparts, Dm = ((X1, Y1), . . . , (Xm, Ym)), and Tl = ((Xm+1, Ym+1), . . . , (Xn, Yn)),where n = m + l. We use the first part Dm to cover C as follows. Definethe random variable N as the number of different values the binary vectorbm(φ) = (φ(X1), . . . , φ(Xm)) takes as φ is varied over C. Clearly, N ≤ S(C,m).Take N classifiers from C, such that all N possible values of the binary vec-tor bm(φ) are represented exactly once. Denote these classifiers by φ1, . . . , φN .Among these functions, pick one that minimizes the empirical error on the secondpart of the data set Tl:

Ll(φi) =1

l

l∑j=1

Iφi(Xm+j) 6=Ym+j.

Denote the selected classifier by φn. Show that for every n,m and ε > 0, the

difference between the error probability L(φn) = Pφn(X) 6= Y |Dn and theminimal error probability in the class satisfies

P

L(φn)− inf

φ∈CL(φ) > ε

≤ 2S(C,m)e−(n−m)ε2/8 + 2S4(C, 2m)e−mε log 2/4

(Buescher and Kumar (1994a)). For example, by taking m ∼√n, we get

E

L(φn)− inf

φ∈CL(φ)

≤ c

√VC logn

n,

where c is a universal constant. The fact that the number of samples m used forcovering C is very small compared to n, may make the algorithm computationallymore attractive than the method of empirical error minimization. Hint: Use thedecomposition

P

L(φn)− inf

φ∈CL(φ) > ε

≤ P

L(φn)− inf

i=1,...,NL(φi) >

ε

2

+ P

inf

i=1,...,NL(φi)− inf

φ∈CL(φ) >

ε

2

.


Bound the first term on the right-hand side by using Lemma 8.2 and Hoeffding’sinequality:

P

L(φn)− inf

i=1,...,NL(φi) >

ε

2

≤ 2S(C,m)e−lε

2/8.

To bound the second term of the decomposition, observe that

infi=1,...,N

L(φi)− infφ∈C

L(φ) ≤ supφ,φ′∈C:bm(φ)=bm(φ′)

∣∣L(φ)− L(φ′)∣∣

≤ supφ,φ′∈C:bm(φ)=bm(φ′)

Pφ(X) 6= φ′(X)

= sup

A∈A:νm(A)=0

ν(A),

where

A = x : φ(x) = 1 : φ(x) = |φ1(x)− φ2(x)|, φ1, φ2 ∈ C ,

and νm(A) = 1m

∑m

i=1IXi∈A. Bound the latter quantity by applying Theorem

12.7. To do this, you will need to bound the shatter coefficients s(A, 2m). InChapter 13 we introduce simple tools for this. For example, it is easy to deducefrom parts (ii), (iii), and (iv) of Theorem 13.5, that s(A, 2m) ≤ S4(C, 2m).

Problem 12.15. Prove that for all ε ∈ (0, 1),

P

supA∈A

|νn(A)− ν(A)| > ε

≤ cs(A, n2)e−2nε2 ,

where c ≤ 4e4ε+4ε2 (Devroye (1982a)). Hint: Proceed as indicated by the follow-ing steps:

(1) Introduce an i.i.d. ghost sample Z′1, . . . , Z′m of size m = n2 − n, where

Z′1 is distributed as Z1. Denote the corresponding empirical measure byν′m. As in the proof of the first step of Theorem 12.4, prove that forα, ε ∈ (0, 1),

P

supA∈A

|νn(A)− ν′m(A)| > (1− α)ε

≥

(1− 1

4α2ε2m

)P

supA∈A

|νn(A)− ν(A)| > ε

.

(2) Introduce n2! permutations Π1, . . . ,Πn+m of the n+m random variablesas in Step 2 of the proof of Theorem 12.5. Show that

1

n2!

n2!∑j=1

supA∈A

I∣∣ 1n

∑n

i=1IA(Πj(Zi))− 1

m

∑m

i=1IA(Πj(Z′

i))∣∣>(1−α)ε

≤ S(A, n2)max

A∈A

1

n2!

n2!∑j=1

I∣∣ 1n

∑n

i=1IA(Πj(Zi))− 1

m

∑m

i=1IA(Πj(Z′

i))∣∣>(1−α)ε

.


(3) Show that for each A ∈ A,

1

n2!

n2!∑j=1

I∣∣ 1n

∑n

i=1IA(Πj(Zi))− 1

m

∑m

i=1IA(Πj(Z′

i))∣∣>(1−α)ε

≤ 2e−2nε2+4αnε2+4ε2

by using Hoeffding’s inequality for sampling without replacement fromn2 binary-valued elements (see Theorem A.25). Choose α = 1/(nε).


13Combinatorial Aspects ofVapnik-Chervonenkis Theory

13.1 Shatter Coefficients and VC Dimension

In this section we list a few interesting properties of shatter coefficientss(A, n) and of the vc dimension VA of a class of sets A. We begin with aproperty that makes things easier. In Chapter 12 we noted the importanceof classes of the form

A = A× 0 ∪Ac × 1;A ∈ A .

(The sets A are of the form x : φ(x) = 1, and the sets in A are setsof pairs (x, y) for which φ(x) 6= y.) Recall that if C is a class of classifiersφ : Rd → 0, 1, then by definition, S(C, n) = s(A, n) and VC = VA.The first result states that S(C, n) = s(A, n), so it suffices to investigateproperties of A, a class of subsets of Rd.

Theorem 13.1. For every n we have s(A, n) = s(A, n), and thereforeVA = VA.

Proof. Let N be a positive integer. We show that for any n pairs fromRd × 0, 1, if N sets from A pick N different subsets of the n pairs, thenthere areN corresponding sets inA that pickN different subsets of n pointsin Rd, and vice versa. Fix n pairs (x1, 0), . . . , (xm, 0), (xm+1, 1), . . . , (xn, 1).Note that since ordering does not matter, we may arrange any n pairs in thismanner. Assume that for a certain set A ∈ A, the corresponding set A =A×0

⋃Ac×1 ∈ A picks out the pairs (x1, 0), . . . , (xk, 0), (xm+1, 1), . . .,

13.1 Shatter Coefficients and VC Dimension 231

(xm+l, 1), that is, the set of these pairs is the intersection of A and the npairs. Again, we can assume without loss of generality that the pairs areordered in this way. This means that A picks from the set x1, . . . , xn thesubset x1, . . . , xk, xm+l+1, . . . , xn, and the two subsets uniquely deter-mine each other. This proves s(A, n) ≤ s(A, n). To prove the other direc-tion, notice that if A picks a subset of k points x1, . . . , xk, then the corre-sponding set A ∈ A picks the pairs with the same indices from (x1, 0), . . .,(xk, 0). Equality of the vc dimensions follows from the equality of theshatter coefficients. 2

The following theorem, attributed to Vapnik and Chervonenkis (1971)and Sauer (1972), describes the relationship between the vc dimension andshatter coefficients of a class of sets. This is the most important tool forobtaining useful upper bounds on the shatter coefficients in terms of thevc dimension.

Theorem 13.2. If A is a class of sets with vc dimension VA, then forevery n

s(A, n) ≤VA∑i=0

(n

i

).

Proof. Recall the definition of the shatter coefficients

s(A, n) = max(x1,...,xn)

NA(x1, . . . , xn),

whereNA(x1, . . . , xn) =

∣∣∣x1, . . . , xn⋂A;A ∈ A

∣∣∣ .Clearly, it suffices to prove that for every x1, . . . , xn,

NA(x1, . . . , xn) ≤VA∑i=0

(n

i

).

But since NA(x1, . . . , xn) is just the shatter coefficient of the class of finitesets

x1, . . . , xn⋂A;A ∈ A

,

we need only to prove the theorem for finite sets. We assume without lossof generality that A is a class of subsets of x1, . . . , xn with vc dimensionVA. Note that in this case s(A, n) = |A|.

We prove the theorem by induction with respect to n and VA. The state-ment is obviously true for n = 1 for any class with VA ≥ 1. It is also truefor any n ≥ 1 if VA = 0, since s(A, n) = 1 for all n in this case. Thus,we assume VA ≥ 1. Assume that the statement is true for all k < n forall classes of subsets of x1, . . . , xk with vc dimension not exceeding VA,


and for n for all classes with vc dimension smaller than VA. Define thefollowing two classes of subsets of x1, . . . , xn:

A′ = A− xn;A ∈ A ,

andA =

A ∈ A : xn /∈ A, A ∪ xn ∈ A

.

Note that both A′ and A contain subsets of x1, . . . , xn−1. A contains allsets A that are members of A such that A∪xn is also in A, but xn /∈ A.

Then |A| = |A′|+ |A|. To see this, write

A′ = A− xn : xn ∈ A,A ∈ A⋃A− xn : xn /∈ A,A ∈ A = B1∪B2.

Thus,

|A′| = |B1|+ |B2| − |B1 ∩ B2|

= |A− xn : xn ∈ A,A ∈ A|+ |A− xn : xn /∈ A,A ∈ A| − |A|

= |A : xn ∈ A,A ∈ A|+ |A : xn /∈ A,A ∈ A| − |A|

= |A| − |A|.

Since |A′| ≤ |A|, andA′ is a class of subsets of x1, . . . , xn−1, the inductionhypothesis implies that

|A′| = s(A′, n− 1) ≤VA∑i=0

(n− 1i

).

Next we show that VA ≤ VA − 1, which will imply

|A| = s(A, n− 1) ≤VA−1∑i=0

(n− 1i

)by the induction hypothesis. To see this, consider a set S ⊂ x1, . . . , xn−1that is shattered by A. Then S ∪ xn is shattered by A. To prove thiswe have to show that any set S′ ⊂ S and S′ ∪ xn is the intersection ofS ∪xn and a set from A. Since S is shattered by A, if S′ ⊂ S, then thereexists a set A ∈ A such that S′ = S ∩ A. But since by definition xn /∈ A,we must have

S′ = (S ∪ xn) ∩ A

andS′ ∪ xn = (S ∪ xn) ∩

(A ∪ xn

).

Since by the definition of A both A and A ∪ xn are in A, we see thatS∪xn is indeed shattered by A. But any set that is shattered by A must


have cardinality not exceeding VA, therefore |S| ≤ VA − 1. But S was anarbitrary set shattered by A, which means VA ≤ VA − 1. Thus, we haveshown that

s(A, n) = |A| = |A′|+ |A| ≤VA∑i=0

(n− 1i

)+VA−1∑i=0

(n− 1i

).

Straightforward application of the identity(ni

)=(n−1i

)+(n−1i−1

)shows that

VA∑i=0

(n

i

)=

VA∑i=0

(n− 1i

)+VA−1∑i=0

(n− 1i

). 2

Theorem 13.2 has some very surprising implications. For example, itfollows immediately from the binomial theorem that s(A, n) ≤ (n + 1)VA .This means that a shatter coefficient falls in one of two categories: eithers(A, n) = 2n for all n, or s(A, n) ≤ (n + 1)VA , which happens if the vcdimension of A is finite. We cannot have s(A, n) ∼ 2

√n, for example. If

VA <∞, the upper bound in Theorem 12.5 decreases exponentially quicklywith n. Other sharper bounds are given below.

Theorem 13.3. For all n > 2V ,

s(A, n) ≤VA∑i=0

(n

i

)≤(en

VA

)VA.

For VA > 2, s(A, n) ≤ nVA , and for all VA, s(A, n) ≤ nVA + 1.

Theorem 13.3 follows from Theorem 13.4 below. We leave the details asan exercise (see Problem 13.2).

Theorem 13.4. For all n ≥ 1 and VA < n/2,

s(A, n) ≤ enH(

VAn

),

where H(x) = −x log x − (1 − x) log(1 − x) for x ∈ (0, 1), and H(0) =H(1) = 0 is the binary entropy function.

Theorem 13.4 is a consequence of Theorem 13.2, and the inequality be-low. A different, probabilistic proof is sketched in Problem 13.3 (see alsoProblem 13.4).

Lemma 13.1. For k < n/2,

k∑i=0

(n

i

)≤ enH( k

n ).


Proof. Introduce λ = k/n ≤ 1/2. By the binomial theorem,

1 = (λ+ (1− λ))n

≥λn∑i=1

(n

i

)λi(1− λ)n−i

≥λn∑i=1

(n

i

)(λ

1− λ

)λn(1− λ)n (since λ/(1− λ) ≤ 1)

= e−nH(λ)k∑i=0

(n

i

),

the desired inequality. 2

Remark. The binary entropy function H(x) plays a central role in infor-mation theory (see, e.g., Csiszar and Korner (1981), Cover and Thomas(1991)). Its main properties are the following: H(x) is symmetric around1/2, where it takes its maximum. It is continuous, concave, strictly mono-tone increasing in [0, 1/2], decreasing in [1/2, 1], and equals zero for x = 0,and x = 1. 2

Next we present some simple results about shatter coefficients of classesthat are obtained by combinations of classes of sets.

Theorem 13.5.(i) If A = A1 ∪ A2, then s(A, n) ≤ s(A1, n) + s(A2, n).(ii) Given a class A define Ac = Ac;A ∈ A. Then s(Ac, n) = s(A, n).(iii) For A =

A ∩ A; A ∈ A, A ∈ A

, s(A, n) ≤ s(A, n)s(A, n).

(iv) For A =A ∪ A; A ∈ A, A ∈ A

, s(A, n) ≤ s(A, n)s(A, n).

(v) For A =A× A; A ∈ A, A ∈ A

, s(A, n) ≤ s(A, n)s(A, n).

Proof. (i), (ii), and (v) are trivial. To prove (iii), fix n points x1, . . . , xn,and assume that A picks N ≤ s(A, n) subsets C1, . . . , CN . Then A picksfrom Ci at most s(A, |Ci|) subsets. Therefore, sets of the form A ∩ A pickat most

N∑i=1

s(A, |Ci|) ≤ s(A, n)s(A, n)

subsets. Here we used the obvious monotonicity property s(A, n) ≤ s(A, n+m). To prove (iv), observe that

A ∪ A; A ∈ A, A ∈ A

=(Ac ∩ Ac

)c; A ∈ A, A ∈ A

.

13.2 Shatter Coefficients of Some Classes 235

The statement now follows from (ii) and (iii). 2

13.2 Shatter Coefficients of Some Classes

Here we calculate shatter coefficients of some simple but important exam-ples of classes of subsets of Rd. We begin with a simple observation.

Theorem 13.6. If A contains finitely many sets, then VA ≤ log2 |A|, ands(A, n) ≤ |A| for every n.

Proof. The first inequality follows from the fact that at least 2n sets arenecessary to shatter n points. The second inequality is trivial. 2

In the next example, it is interesting to observe that the bound of The-orem 13.2 is tight.

Theorem 13.7.(i) If A is the class of all half lines: A = (−∞, x];x ∈ R, then

VA = 1, and

s(A, n) = n+ 1 =(n

0

)+(n

1

).

(ii) If A is the class of all intervals in R, then VA = 2, and

s(A, n) =n(n+ 1)

2+ 1 =

(n

0

)+(n

1

)+(n

2

).

Proof. (i) is easy. To see that VA = 2 in (ii), observe that if we fix threedifferent points in R, then there is no interval that does not contain themiddle point, but does contain the other two. The shatter coefficient canbe calculated by counting that there are at most n − k + 1 sets in A ∩x1, . . . , xn;A ∈ A such that |A∩x1, . . . , xn| = k for k = 1, . . . , n, andone set (namely ∅) such that |A ∩ x1, . . . , xn| = 0. This gives altogether

1 +n∑k=1

(n− k + 1) =n(n+ 1)

2+ 1. 2

Now we can generalize the result above for classes of intervals and rect-angles in Rd:

Theorem 13.8.(i) If A = (−∞, x1]× · · · × (−∞, xd], then VA = d.


(ii) If A is the class of all rectangles in Rd, then VA = 2d.

Proof. We prove (ii). The first part is left as an exercise (Problem 13.5).We have to show that there are 2d points that can be shattered by A, butfor any set of 2d + 1 points there is a subset of it that can not be pickedby sets in A. To see the first part just consider the following 2d points:

(1, 0, 0, . . . , 0), (0, 1, 0, 0, . . . , 0), . . . , (0, 0, . . . , 0, 1),

(−1, 0, 0, . . . , 0), (0,−1, 0, 0, . . . , 0), . . . , (0, 0, . . . , 0,−1),

(see Figure 13.1).

Figure 13.1. 24 = 16 rectanglesshatter 4 points in R2.

On the other hand, for any given set of 2d+1 points we can choose a subsetof at most 2d points with the property that it contains a point with largestfirst coordinate, a point with smallest first coordinate, a point with largestsecond coordinate, and so forth. Clearly, there is no set in A that containsthese points, but does not contain the others (Figure 13.2). 2

Figure 13.2. No 5 points can beshattered by rectangles in R2.


Theorem 13.9. (Steele (1975), Dudley (1978)). Let G be a finite-dimensional vector space of real functions on Rd. The class of sets

A = x : g(x) ≥ 0 : g ∈ G

has vc dimension VA ≤ r, where r = dimension(G).

Proof. It suffices to show that no set of size m = 1 + r can be shatteredby sets of the form x : g(x) ≥ 0. Fix m arbitrary points x1, . . . , xm, anddefine the linear mapping L : G → Rm as

L(g) = (g(x1), . . . , g(xm)).

Then the image of G, L(G), is a linear subspace of Rm of dimension notexceeding the dimension of G, that is, m− 1. Then there exists a nonzerovector γ = (γ1, . . . , γm) ∈ Rm, that is orthogonal to L(G), that is, for everyg ∈ G

γ1g(x1) + . . .+ γmg(xm) = 0.

We can assume that at least one of the γi’s is negative. Rearrange thisequality so that terms with nonnegative γi stay on the left-hand side:∑

i:γi≥0

γig(xi) =∑i:γi<0

−γig(xi).

Now, suppose that there exists a g ∈ G such that the set x : g(x) ≥ 0picks exactly the xi’s on the left-hand side. Then all terms on the left-hand side are nonnegative, while the terms on the right-hand side must benegative, which is a contradiction, so x1, . . . , xm cannot be shattered, andthe proof is completed. 2

Remark. Theorem 13.2 implies that the shatter coefficients of the class ofsets in Theorem 13.9 are bounded as follows:

s(A, n) ≤r∑i=0

(n

i

).

In many cases it is possible to get sharper estimates. Let

G =

r∑i=1

aiψi : a1, . . . , ar ∈ R

be the linear space of functions spanned by some fixed functions ψ1, . . . , ψr :Rd → R, and define Ψ(x) = (ψ1(x), . . . , ψr(x)). Cover (1965) showed thatif for some x1, . . . , xn ∈ Rd, every r-element subset of Ψ(x1), . . . ,Ψ(xn) is


linearly independent, then the n-th shatter coefficient of the class of setsA = x : g(x) ≥ 0 : g ∈ G actually equals

s(A, n) = 2r−1∑i=0

(n− 1i

)(see Problem 13.6). By using the last identity in the proof of Theorem13.2, it is easily seen that the difference between the bound obtained fromTheorem 13.9 and the true value is

(n−1r

). Using Cover’s result for the

shatter coefficients we can actually improve Theorem 13.9, in that the vcdimension of the class A equals r. To see this, note that

s(A, r) = 2r−1∑i=0

(r − 1i

)= 2 · 2r−1 = 2r,

while

s(A, r + 1) = 2r−1∑i=0

(r

i

)= 2

r∑i=0

(r

i

)− 2(r

r

)= 2 · 2r − 2 < 2r+1.

Therefore no r+1 points are shattered. It is interesting that Theorem 13.9above combined with Theorem 13.2 and Theorem 13.3 gives the bound

s(A, n) ≤ nr + 1

when r > 2. Cover’s result, however, improves it to

s(A, n) ≤ 2(n− 1)r−1 + 2. 2

Perhaps the most important class of sets is the class of halfspaces inRd, that is, sets containing points falling on one side of a hyperplane. Theshatter coefficients of this class can be obtained from the results above:

Corollary 13.1. Let A be the class of halfspaces, that is, subsets of Rd

of the form x : ax ≥ b, where a ∈ Rd, b ∈ R take all possible values.Then VA = d+ 1, and

s(A, n) = 2d∑i=0

(n− 1i

)≤ 2(n− 1)d + 2.

Proof. This is an immediate consequence of the remark above if we takeG to be the linear space spanned by the functions

φ1(x) = x(1), φ2(x) = x(2), . . . , φd(x) = x(d), and φd+1(x) = 1,


where x(1), . . . , x(d) denote the d components of the vector x. 2

It is equally simple now to obtain an upper bound on the vc dimensionof the class of all closed balls in Rd (see Cover (1965), Devroye (1978), orDudley (1979) for more information).

Corollary 13.2. Let A be the class of all closed balls in Rd, that is,subsets of Rd of the form

x = (x(1), . . . , x(d)) :d∑i=1

|x(i) − ai|2 ≤ b

,

where a1, . . . , ad, b ∈ R take all possible values. Then VA ≤ d+ 2.

Proof. If we write

d∑i=1

|x(i) − ai|2 − b =d∑i=1

|x(i)|2 − 2d∑i=1

x(i)ai +d∑i=1

a2i − b,

then it is clear that Theorem 13.9 yields the result by setting G to be thelinear space spanned by

φ1(x) =d∑i=1

|x(i)|2, φ2(x) = x(1), . . . , φd+1(x) = x(d), and φd+2(x) = 1. 2

It follows from Theorems 13.9 and 13.5 (iii) that the class of all polytopeswith a bounded number of faces has finite vc dimension. The next negativeexample demonstrates that this boundedness is necessary.

Theorem 13.10. If A is the class of all convex polygons in R2, then VA =∞.

Figure 13.3. Any subset of n

points on the unit circle can bepicked by a convex polygon.

13.3 Linear and Generalized Linear Discrimination Rules 240

Proof. Let x1, . . . , xn ∈ R2 lie on the unit circle. Then it is easy to seethat for any subset of these (different) points there is a polygon that picksthat subset. 2

13.3 Linear and Generalized Linear DiscriminationRules

Recall from Chapter 4 that a linear classification rule classifies x into oneof the two classes according to whether

a0 +d∑i=1

aix(i)

is positive or negative, where x(1), . . . , x(d) denote the components of x ∈Rd. The coefficients ai are determined by the training sequence. Thesedecisions dichotomize the space Rd by virtue of a halfspace, and assignclass 1 to one halfspace, and class 0 to the other. Points on the borderare treated as belonging to class 0. Consider a classifier that adjusts thecoefficients by minimizing the number of errors committed on Dn. In theterminology of Chapter 12, C is the collection of all linear classifiers.

Figure 13.4. An empirically op-timal linear classifier.

Glick (1976) pointed out that for the error probability L(φ∗n) of thisclassifier, L(φ∗n)− infφ∈C L(φ) → 0 almost surely. However, from Theorems12.6, 13.1 and Corollary 13.1, we can now provide more details:

Theorem 13.11. For all n and ε > 0, the error probability L(φ∗n) of theempirically optimal linear classifier satisfies

PL(φ∗n)− inf

φ∈CL(φ) > ε

≤ 8nd+1e−nε

2/128.

13.4 Convex Sets and Monotone Layers 241

Comparing the above inequality with Theorem 4.5, note that there weselected φ∗n by a specific algorithm, while this result holds for any linearclassifier whose empirical error is minimal.

Generalized linear classification rules (see Duda and Hart (1973)) aredefined by

gn(x) =

0 if a0 +∑d∗

i=1 aiψi(x) ≥ 01 otherwise,

where d∗ is a positive integer, the functions ψ1, . . . , ψd∗ are fixed, and thecoefficients a0, . . . , ad∗ are functions of the data Dn. These include for ex-ample all quadratic discrimination rules in Rd when we choose all functionsthat are either components of x, or squares of components of x, or productsof two components of x. That is, the functions ψi(x) are of the form eitherx(j), or x(j)x(k). In all, d∗ = 2d+ d(d− 1)/2. The argument used for lineardiscriminants remain valid, and we obtain

Theorem 13.12. Let C be the class of generalized linear discriminants(i.e., the coefficients vary, the basis functions ψi are fixed). For the errorprobability L(φ∗n) of the empirically optimal classifier, for all d∗ > 1,n andε > 0, we have

PL(φ∗n)− inf

φ∈CL(φ) > ε

≤ 8nd

∗+1e−nε2/128.

Also, for n > 2d∗ + 1,

PL(φ∗n)− inf

φ∈CL(φ) > ε

≤ 8enH( d∗+1

n )e−nε2/128.

The second inequality is obtained by using the bound of Theorem 13.4 forthe shatter coefficients. Note nevertheless that unless d∗ (and therefore C)is allowed to increase with n, there is no hope of obtaining universal consis-tency. The question of universal consistency will be addressed in Chapters17 and 18.

13.4 Convex Sets and Monotone Layers

Classes of infinite vc dimension are not hopeless by any means. In thissection, we offer examples that will show how they may be useful in patternrecognition. The classes of interest to us for now are

C =

all convex sets of R2

L =

all monotone layers of R2, i.e., all sets of the form

(x1, x2) : x2 ≤ ψ(x1) for some nonincreasing function ψ.


In discrimination, this corresponds to making decisions of the form φ(x) =Ix∈C, C ∈ C, or φ(x) = Ix/∈C, C ∈ C, and similarly for L. Decisionsof these forms are important in many situations. For example, if η(x) ismonotone decreasing in both components of x ∈ R2, then the Bayes rule isof the form g∗(x) = Ix∈L for some L ∈ L. We have pointed out elsewhere(Theorem 13.10, and Problem 13.19) that VC = VL = ∞. To see this, notethat any set of n points on the unit circle is shattered by C, while any setof n points on the antidiagonal x2 = −x1 is shattered by L. Nevertheless,shattering becomes unlikely if X has a density. Our starting point here isthe bound obtained while proving Theorem 12.5:

P

supA∈A

|µn(A)− µ(A)| > ε

≤ 8E NA(X1, . . . , Xn) e−nε

2/32,

whereNA(X1, . . . , Xn) is the number of sets in A∩X1, . . . , Xn : A ∈ A.The following theorem is essential:

Theorem 13.13. If X has a density f on R2, then

E NA(X1, . . . , Xn) = 2o(n)

when A is either C or L.

This theorem, a proof of which is a must for the reader, implies thefollowing:

Corollary 13.3. Let X have a density f on R2. Let φ∗n be picked byminimizing the empirical error over all classifiers of the form

φ(x) =

1 if x ∈ A0 if x /∈ A,

where A or Ac is in C (or L). Then

L(φ∗n) → infφ=IC for C or Cc in C

L(φ)

with probability one (and similarly for L).

Proof. This follows from the inequality of Lemma 8.2, Theorem 13.13,and the Borel-Cantelli lemma. 2

Remark. Theorem 13.13 and the corollary may be extended toRd, but thisgeneralization holds nothing new and will only result in tedious notations.2

Proof of Theorem 13.13. We show the theorem for L, and indicate theproof for C. Take two sequences of integers,m and r, wherem ∼

√n and r ∼


m1/3, so that m→∞, yet r2/m→ 0, as n→∞. Consider the set [−r, r]2and partition each side into m equal intervals, thus obtaining an m×m gridof square cells C1, . . . , Cm2 . Denote C0 = R2−[−r, r]2. LetN0, N1, . . . , Nm2

be the number of points among X1, . . . , Xn that belong to these cells. Thevector (N0, N1, . . . , Nm2) is clearly multinomially distributed.

Let ψ be a nonincreasing function R → R defining a set in L by L =(x1, x2) : x2 ≤ ψ(x1). Let C(ψ) be the collection of all cell sets cut byψ, that is, all cells with a nonempty intersection with both L and Lc. Thecollection C(ψ) is shaded in Figure 13.5.

Figure 13.5. A monotone layerand bordering cells.

We bound NL(X1, . . . , Xn) from above, conservatively, as follows:

∑all possible C(ψ)

∏Ci∈C(ψ)

2Ni

2N0 . (13.1)

The number of different collections C(ψ) cannot exceed 22m because eachcell in C(ψ) may be obtained from its predecessor cell by either movingright on the same row or moving down one cell in the same column. For aparticular collection, denoting pi = PX ∈ Ci, we have

E

∏Ci∈C(ψ)

2Ni

2N0

=

∑Ci∈C(ψ)

2pi + 2p0 + 1−∑

Ci∈C(ψ)

pi − p0

n

(by applying Lemma A.7)

≤ exp

n ∑Ci∈C(ψ)

pi + p0


≤ exp

n sup

all sets A with λ(A) ≤ 8r2/m

∫A

f +∫C0

f

(since λ

⋃i:Ci∈C(ψ)

Ci

≤ 2m(2r/m)2 = 8r2/m)

= eo(n),

because r2/m → 0 and r → ∞ and by the absolute continuity of X. Asthis estimate is uniform over all collections C(ψ), we see that the expectedvalue of (13.1) is not more than 22meo(n) = eo(n). The argument for thecollection C of convex sets is analogous. 2

Remark. The theorem implies that if A is the class of all convex sets, thensupA∈A |µn(A)−µ(A)| → 0 with probability one whenever µ has a density.This is a special case of a result of Ranga Rao (1962). Assume now thatthe density f of X is bounded and of bounded support. Then in the proofabove we may take r fixed so that [−r, r]2 contains the support of f . Thenthe estimate of Theorem 13.13 is

E NA(X1, . . . , Xn) ≤ 22m · e8n‖f‖∞r2/m

(where ‖f‖∞ denotes the bound on f)

= 24r√n‖f‖∞e4r

√n‖f‖∞

(if we take m = 2r√n‖f‖∞)

≤ eα√n

for a constant α. This implies by Corollary 12.1 that

EL(φ∗n)− inf

φ=IC for C or Cc in CL(φ)

= O

(n−1/4

).

This latter inequality was proved by Steele (1975). To see that it cannot beextended to arbitrary densities, observe that the data points falling on theconvex hull (or upper layer) of the points X1, . . . , Xn can always be shat-tered by convex sets (or monotone layers, respectively). Thus,NA(X1, . . . , Xn)is at least 2Mn , whereMn is the number of points amongX1, . . . , Xn, fallingon the convex hull (or upper layer) of X1, . . . , Xn. Thus,

E NA(X1, . . . , Xn) ≥ E2Mn

≥ 2EMn.

But it follows from results of Carnal (1970) and Devroye (1991b) that foreach a < 1, there exists a density such that

lim supn→∞

EMnna

> 1. 2


The important point of this interlude is that with infinite vc dimension,we may under some circumstances get expected error rates that are o(1) butlarger than 1/

√n. However, the bounds are sometimes rather loose. The

reason is the looseness of the Vapnik-Chervonenkis inequality when thecollections A become very big. To get such results for classes with infinitevc dimension it is necessary to impose some conditions on the distribution.We will prove this in Chapter 14.


Problem 13.1. Show that the inequality of Theorem 13.2 is tight, that is, exhibita class A of sets such that for each n, s(A, n) =

∑VAi=0

(ni

).

Problem 13.2. Show that for all n > 2VA

VA∑i=0

(n

i

)≤ 2

(nVA

VA!

)≤(en

VA

)VAand that

VA∑i=0

(n

i

)≤ nVA

if VA > 2. Hint: There are several ways to prove the statements. One can proceeddirectly by using the recurrence

∑VAi=0

(ni

)=∑VA

i=0

(n−1i

)+∑VA−1

i=0

(n−1i

). A

simpler way to prove∑VA

i=0

(ni

)≤(enVA

)VA is by using Theorem 13.4. The thirdinequality is an immediate consequence of the first two.

Problem 13.3. Give an alternative proof of Lemma 13.1 by completing the fol-lowing probabilistic argument. Observe that for k′ = n− k,

k∑i=0

(n

i

)=

n∑i=k′

(n

i

)= 2nPBn ≥ k′,

where Bn is a binomial random variable with parameters n and 1/2. Then Cher-noff’s bounding technique (see the proof of Theorem 8.1) may be used to boundthis probability: for all s > 0,

PBn ≥ k′ ≤ e−sk′EesBn

= exp

(−n(sk′/n− log

(es + 1

2

))).

Take the derivative of the exponent with respect to s to minimize the upperbound. Substitute the obtained value into the bound to get the desired inequality.

Problem 13.4. Let B be a binomial random variable with parameters n and p.Prove that for k > np

PB ≥ k ≤ exp(−n(H(k/n) +

k

nlog p+

n− k

nlog(1− p)

)).

Hint: Use Chernoff’s bounding technique as in the previous problem.


Problem 13.5. Prove part (i) of Theorem 13.8.

Problem 13.6. Prove that for the class of sets defined in the remark followingTheorem 13.9,

s(A, n) = 2

r−1∑i=0

(n− 1

i

)(Cover (1965)). Hint: Proceed by induction with respect to n and r. In particular,show that the recurrence s(Ar, n) = s(Ar, n − 1) + s(Ar−1, n − 1) holds, whereAr denotes the class of sets defined as x : g(x) ≥ 0 where g runs through avector space spanned by the first r of the sequence of functions ψ1, ψ2, . . ..

Problem 13.7. Let A and B be two families of subsets of Rd. Assume that forsome r ≥ 2,

s(A, n) ≤ nrs(B, n)

for all n ≥ 1. Show that if VB > 2, then VA ≤ 2 (VB + r − 1) log (VB + r − 1).Hint: By Theorem 13.3, s(A, n) ≤ nVB+r−1. Clearly, VA is not larger than anyk for which kVB+r−1 < 2k.

Problem 13.8. Determine the vc dimension of the class of subsets of the realline such that each set in the class can be written as a union of k intervals.

Problem 13.9. Determine the vc dimension of the collection of all polygonswith k vertices in the plane.

Problem 13.10. What is the vc dimension of the collection of all ellipsoids ofRd?

Problem 13.11. Determine the vc dimension of the collection of all subsets of1, . . . , kd, where k and d are fixed. How does the answer change if we restrictthe subsets to those of cardinality l ≤ kd?

Problem 13.12. Let A consist of all simplices of Rd, that is, all sets of the formx : x =

d+1∑i=1

λixi for some λ1, . . . , λd+1 :

d+1∑i=1

λi = 1, λi ≥ 0

,

where x1, . . . , xd+1 are fixed points of Rd. Determine the vc dimension of A.

Problem 13.13. Let A(x1, . . . , xk) be the set of all x ∈ R that are of the form

x = ψ1(i)x1 + ψ2(i)x2 + · · ·+ ψk(i)xk, i = 1, 2, . . . ,

where x1, . . . , xk are fixed numbers and ψ1, . . . , ψk are fixed functions on theintegers. Let A = A(x1, . . . , xk) : x1, . . . , xk ∈ R. Determine the vc dimensionof A.

Problem 13.14. In some sense, vc dimension measures the “size” of a class ofsets. However it has little to do with cardinalities, as this exercise demonstrates.Exhibit a class of subsets of the integers with uncountably many sets, yet vcdimension 1. (This property was pointed out to us by Andras Farago.) Note:


On the other hand, the class of all subsets of integers cannot be written as acountable union of classes with finite vc dimension (Theorem 18.6). Hint: Finda class of subsets of the reals with the desired properties, and make a propercorrespondence between sets of integers and sets in the class.

Problem 13.15. Show that if a class of sets A is linearly ordered by inclusion,that is, for any pair of sets A,B ∈ A either A ⊂ B or B ⊂ A and |A| ≥ 2, thenVA = 1. Conversely, assume that VA = 1 and for every set B with |B| = 2,

∅, B ∈ A ∩B : A ∈ A.

Prove that then A is linearly ordered by inclusion (Dudley (1984)).

Problem 13.16. We say that four sets A,B,C,D form a diamond if A ⊂ B ⊂ C,A ⊂ D ⊂ C, but B 6⊂ D and D 6⊂ C. Let A be a class of sets. Show that VA ≥ 2if and only if for some set R, the class A ∩ R : A ∈ A includes a diamond(Dudley (1984)).

Problem 13.17. Let A be a class of sets, and define its density by

DA = inf

r > 0 : sup

n≥1

s(A, n)

nr<∞

.

Verify the following properties:(1) DA ≤ VA;(2) For each positive integer k, there exists a classA of sets such that VA = k,

yet DA = 0; and(3) DA <∞ if and only if VA <∞

(Assouad (1983a)).

Problem 13.18. Continued. Let A and A′ be classes of sets, and define B =A ∪A′. Show that

(1) DB = max (DA, DA′);(2) VB ≤ VA + VA′ + 1; and(3) For every pair of positive integers k,m there exist classes A and A′ such

that VA = k, VA′ = m, and VB = k +m+ 1(Assouad (1983a)).

Problem 13.19. A set A ⊂ Rd is called a monotone layer if x ∈ A impliesthat y ∈ A for all y ≤ x (i.e., each component of x is not larger than thecorresponding component of y). Show that the class of all monotone layers hasinfinite vc dimension.

Problem 13.20. Let (X,Y ) be a pair of random variables in R2 × 0, 1 suchthat Y = IX∈A, where A is a convex set. Let Dn = ((X1, Y1), . . . , (Xn, Yn)) bean i.i.d. training sequence, and consider the classifier

gn(x) =

1 if x is in the convex hull of the Xi’s with Y = 10 otherwise.

Find a distribution for which gn is not consistent, and find conditions for consis-tency. Hint: Recall Theorem 13.10 and its proof.


14Lower Bounds for EmpiricalClassifier Selection

In Chapter 12 a classifier was selected by minimizing the empirical errorover a class of classifiers C. With the help of the Vapnik-Chervonenkis the-ory we have been able to obtain distribution-free performance guaranteesfor the selected rule. For example, it was shown that the difference betweenthe expected error probability of the selected rule and the best error prob-ability in the class behaves at least as well as O(

√VC log n/n), where VC is

the Vapnik-Chervonenkis dimension of C, and n is the size of the trainingdata Dn. (This upper bound is obtained from Theorem 12.5. Corollary 12.5may be used to replace the log n term with log VC .) Two questions arise im-mediately: Are these upper bounds (at least up to the order of magnitude)tight? Is there a much better way of selecting a classifier than minimizingthe empirical error? This chapter attempts to answer these questions. Asit turns out, the answer is essentially affirmative for the first question, andnegative for the second.

These questions were also asked in the learning theory setup, where itis usually assumed that the error probability of the best classifier in theclass is zero (see Blumer, Ehrenfeucht, Haussler, and Warmuth (1989),Haussler, Littlestone, and Warmuth (1988), and Ehrenfeucht, Haussler,Kearns, and Valiant (1989)). In this case, as the bound of Theorem 12.7implies, the error of the rule selected by minimizing the empirical error iswithin O(VC log n/n) of that of the best in the class (which equals zero, byassumption). We will see that essentially there is no way to beat this upperbound either.

14.1 Minimax Lower Bounds 249

14.1 Minimax Lower Bounds

Let us formulate exactly what we are interested in. Let C be a class of deci-sion functions φ : Rd → 0, 1. The training sequence Dn = ((X1, Y1), . . .,(Xn, Yn)) is used to select the classifier gn(X) = gn(X,Dn) from C, wherethe selection is based on the data Dn. We emphasize here that gn can be anarbitrary function of the data, we do not restrict our attention to empiricalerror minimization, where gn is a classifier in C that minimizes the numbererrors committed on the data Dn.

As before, we measure the performance of the selected classifier by thedifference between the error probability L(gn) = Pgn(X) 6= Y |Dn of theselected classifier and that of the best in the class. To save space furtheron, denote this optimum by

LCdef= inf

φ∈CPφ(X) 6= Y .

In particular, we seek lower bounds for

supPL(gn)− LC > ε,

and

supEL(gn)− LC ,

where the supremum is taken over all possible distributions of the pair(X,Y ). A lower bound for one of these quantities means that no matterwhat our method of picking a rule from C is, we may face a distributionsuch that our method performs worse than the bound. This view may becriticized as too pessimistic. However, it is clearly a perfectly meaningfulquestion to pursue, as typically we have no other information available thanthe training data, so we have to be prepared for the worst situation.

Actually, we investigate a stronger problem, in that the supremum istaken over all distributions with LC kept at a fixed value between zero and1/2. We will see that the bounds depend on n, VC , and LC jointly. As itturns out, the situations for LC > 0 and LC = 0 are quite different. Becauseof its relative simplicity, we first treat the case LC = 0. All the proofs arebased on a technique called “the probabilistic method.” The basic idea hereis that the existence of a “bad” distribution is proved by considering a largeclass of distributions, and bounding the average behavior over the class.

Lower bounds on the probabilities PL(gn) − LC > ε may be trans-lated into lower bounds on the sample complexity N(ε, δ). We obtain lowerbounds for the size of the training sequence such that for any classifier,PL(gn) − LC > ε cannot be smaller than δ for all distributions if n issmaller than this bound.

14.2 The Case LC = 0 250

14.2 The Case LC = 0

In this section we obtain lower bounds under the assumption that the bestclassifier in the class has zero error probability. In view of Theorem 12.7 wesee that the situation here is different from when LC > 0; there exist meth-ods of picking a classifier from C (e.g., minimization of the empirical error)such that the error probability decreases to zero at a rate of O(VC log n/n).We obtain minimax lower bounds close to the upper bounds obtained forempirical error minimization. For example, Theorem 14.1 shows that ifLC = 0, then the expected error probability cannot decrease faster than asequence proportional to VC/n for some distributions.

Theorem 14.1. (Vapnik and Chervonenkis (1974c); Haussler, Lit-tlestone, and Warmuth (1988)). Let C be a class of discriminationfunctions with vc dimension V . Let X be the set of all random variables(X,Y ) for which LC = 0. Then, for every discrimination rule gn basedupon X1, Y1, . . . , Xn, Yn, and n ≥ V − 1,

sup(X,Y )∈X

ELn ≥V − 12en

(1− 1

n

).

Proof. The idea is to construct a family F of 2V−1 distributions withinthe distributions with LC = 0 as follows: first find points x1, . . . , xV thatare shattered by C. Each distribution in F is concentrated on the set ofthese points. A member in F is described by V − 1 bits, b1, . . . , bV−1. Forconvenience, this is represented as a bit vector b. Assume V − 1 ≤ n. Fora particular bit vector, we let X = xi (i < V ) with probability 1/n each,while X = xV with probability 1− (V − 1)/n. Then set Y = fb(X), wherefb is defined as follows:

fb(x) =bi if x = xi, i < V0 if x = xV .

Note that since Y is a function of X, we must have L∗ = 0. Also, LC = 0,as the set x1, . . . , xV is shattered by C, i.e., there is a g ∈ C with g(xi) =fb(xi) for 1 ≤ i ≤ V . Clearly,

sup(X,Y ):LC=0

ELn − LC

≥ sup(X,Y )∈F

ELn − LC

= supb

ELn − LC

≥ ELn − LC(where b is replaced by B, uniformly distributed over 0, 1V−1)

14.2 The Case LC = 0 251

= ELn ,= Pgn(X,X1, Y1, . . . , Xn, Yn) 6= fB(X) .

The last probability may be viewed as the error probability of the decisionfunction gn : Rd×(Rd×0, 1)n → 0, 1 in predicting the value of the ran-dom variable fB(X) based on the observation Zn = (X,X1, Y1, . . . , Xn, Yn).Naturally, this probability is bounded from below by the Bayes probabilityof error

L∗(Zn, fB(X)) = infgn

Pgn(Zn) 6= fB(X)

corresponding to the decision problem (Zn, fB(X)). By the results of Chap-ter 2,

L∗(Zn, fB(X)) = E min(η∗(Zn), 1− η∗(Zn)) ,where η∗(Zn) = PfB(X) = 1|Zn. Observe that

η∗(Zn) =

1/2 if X 6= X1, . . . , X 6= Xn, X 6= xV0 or 1 otherwise.

Thus, we see that

sup(X,Y ):LC=0

ELn − LC ≥ L∗(Zn, fB(X))

=12PX 6= X1, . . . , X 6= Xn, X 6= xV

=12

V−1∑i=1

PX = xi(1−PX = xi)n

=V − 1

2n(1− 1/n)n

≥ V − 12en

(1− 1

n

)(since (1− 1/n)n−1 ↓ 1/e).

This concludes the proof. 2

Minimax lower bounds on the probability PLn ≥ ε can also be ob-tained. These bounds have evolved through several papers: see Ehrenfeucht,Haussler, Kearns and Valiant (1989); and Blumer, Ehrenfeucht, Hausslerand Warmuth (1989). The tightest bounds we are aware of thus far aregiven by the next theorem.

Theorem 14.2. (Devroye and Lugosi (1995)). Let C be a class ofdiscrimination functions with vc dimension V ≥ 2. Let X be the set of allrandom variables (X,Y ) for which LC = 0. Assume ε ≤ 1/4. Assume n ≥V −1. Then for every discrimination rule gn based upon X1, Y1, . . . , Xn, Yn,

sup(X,Y )∈X

PLn ≥ ε ≥ 1e√πV

(2neεV − 1

)(V−1)/2

e−4nε/(1−4ε) .

14.2 The Case LC = 0 252

If on the other hand n ≤ (V − 1)/(4ε), then

sup(X,Y )∈X

PLn ≥ ε ≥ 12.

Proof. We randomize as in the proof of Theorem 14.1. The difference nowis that we pick x1, . . . , xV−1 with probability p each. Thus, PX = xV =1− p(V − 1). We inherit the notation from the proof of Theorem 14.1. Fora fixed b, denote the error probability by

Ln(b) = Pgn(X,X1, fb(X1), . . . , Xn, fb(Xn)) 6= fb(X)|X1, . . . , Xn.

We now randomize and replace b by B. Clearly,

sup(X,Y ):LC=0

PLn ≥ ε ≥ supb

PLn(b) ≥ ε

≥ E PLn(B) ≥ ε|B= PLn(B) ≥ ε.

As in the proof of Theorem 14.1, observe that Ln(B) cannot be smallerthan the Bayes risk corresponding to the decision problem (Zn, fB(X)),where

Zn = (X,X1, fB(X1), . . . , Xn, fB(Xn)).

Thus,Ln(B) ≥ E min(η∗(Zn), 1− η∗(Zn))|X1, . . . , Xn .

As in the proof of Theorem 14.1, we see that

E min(η∗(Zn), 1− η∗(Zn))|X1, . . . , Xn

=12PX 6= X1, . . . , X 6= Xn|X1, . . . , Xn

≥ 12pV−1∑i=1

Ixi 6=X1,...,xi 6=Xn.

For fixed X1, . . . , Xn, we denote by J the collection j : 1 ≤ j ≤ V −1,∩ni=1Xi 6= xj. This is the collection of empty cells xi. We summarize:

sup(X,Y ):LC=0

PLn ≥ ε ≥ P

12pV−1∑i=1

Ixi 6=X1,...,xi 6=Xn ≥ ε

≥ P|J | ≥ 2ε/p .

We consider two choices for p:

Choice A. Take p = 1/n, and assume 4nε ≤ V − 1, ε < 1/2. Notethat E|J | = n(V − 1)p = V − 1. Also, since 0 ≤ |J | ≤ V − 1, we have

14.2 The Case LC = 0 253

Var |J | ≤ (V − 1)2/4. By the Chebyshev-Cantelli inequality (TheoremA.17),

P|J | ≥ 2nε = (1−P|J | < 2nε)≥ (1−P|J | < (V − 1)/2)= (1−P|J | −E|J | ≤ −(V − 1)/2)

≥(

1− Var |J |Var |J |+ (V − 1)2/4

)≥

(1− (V − 1)2

(V − 1)2 + (V − 1)2

)=

12.

This proves the second inequality for supPLn ≥ ε.

Choice B. Assume that ε ≤ 1/4. By the pigeonhole principle, |J | ≥ 2ε/pif the number of points Xi, 1 ≤ i ≤ n, that are not equal to xV does notexceed V − 1− 2ε/p. Therefore, we have a further lower bound:

P|J | ≥ 2ε/p ≥ PBinomial(n, (V − 1)p) ≤ V − 1− 2ε/p .

Define v = d(V − 1)/2e. Take p = 2ε/v. By assumption, n ≥ 2v − 1. Thenthe lower bound is

PBinomial(n, 4ε) ≤ v

≥(n

v

)(4ε)v(1− 4ε)n−v

≥ 1e√

2πv

(4eε(n− v + 1)v(1− 4ε)

)v(1− 4ε)n

(since(nv

)≥(

(n−v+1)ev

)v1

e√

2πvby Stirling’s formula)

≥ 1e√

2πv

(4eε(n− v + 1)

v

)v(1− 4ε)n

≥ 1e√

2πv

(4enεv

)v(1− 4ε)n

(1− v − 1

n

)v≥ 1

e√

2πv

(2enεv

)ve−4nε/(1−4ε) (since n ≥ 2(v − 1))

(use 1− x ≥ exp (−x/(1− x))).

This concludes the proof. 2

14.4 The Case LC > 0 254

14.3 Classes with Infinite VC Dimension

The results presented in the previous section may also be applied to classeswith infinite vc dimension. For example, it is not hard to derive the fol-lowing result from Theorem 14.1:

Theorem 14.3. Assume that VC = ∞. For every n, δ > 0 and classifica-tion rule gn, there is a distribution with LC = 0 such that

EL(gn) >12e− δ.

For the proof, see Problem 14.2. Thus, when VC = ∞, distribution-free nontrivial performance guarantees for L(gn) − LC do not exist. Thisgeneralizes Theorem 7.1, where a similar result is shown if C is the classof all measurable discrimination functions. We have also seen in Theorem7.2, that if C is the class of all measurable classifiers, then no universal rateof convergence exists. However, we will see in Chapter 18 that for someclasses with infinite vc dimension, it is possible to find a classificationrule such that L(gn)− L∗ ≤ c

√log n/n for any distribution such that the

Bayes classifier is in C. The constant c, however, necessarily depends on thedistribution, as is apparent from Theorem 14.3.

Infinite vc dimension means that the class C shatters finite sets of anysize. On the other hand, if C shatters infinitely many points, then no uni-versal rate of convergence exists. This may be seen by an appropriate mod-ification of Theorem 7.2, as follows. See Problem 14.3 for the proof.

Theorem 14.4. Let an be a sequence of positive numbers converging tozero with 1/16 ≥ a1 ≥ a2 ≥ . . .. Let C be a class of classifiers with theproperty that there exists a set A ⊂ Rd of infinite cardinality such that forany subset B of A, there exists φ ∈ C such that φ(x) = 1 if x ∈ B andφ(x) = 0 if x ∈ A−B. Then for every sequence of classification rules, thereexists a distribution of (X,Y ) with LC = 0, such that

ELn ≥ an

for all n.

Note that the basic difference between this result and all others in thischapter is that in Theorem 14.4 the “bad” distribution does not vary withn. This theorem shows that selecting a classifier from a class shatteringinfinitely many points is essentially as hard as selecting one from the classof all classifiers.

14.4 The Case LC > 0

In the more general case, when the best decision in the class C has positiveerror probability, the upper bounds derived in Chapter 12 for the expected

14.4 The Case LC > 0 255

error probability of the classifier obtained by minimizing the empirical riskare much larger than when LC = 0. In this section we show that these upperbounds are necessarily large, and they may be tight for some distributions.Moreover, there is no other classifier that performs significantly better thanempirical error minimization.

Theorem 14.5 below gives a lower bound for sup(X,Y ):LC fixed EL(gn) −LC . As a function of n and VC , the bound decreases basically as in the upperbound obtained from Theorem 12.6. Interestingly, the lower bound becomessmaller as LC decreases, as should be expected. The bound is largest whenLC is close to 1/2. The constants in the bound may be tightened at theexpense of more complicated expressions. The theorem is essentially due toDevroye and Lugosi (1995), though the proof given here is different. Similarbounds—without making the dependence on LC explicit—have been provedby Vapnik and Chervonenkis (1974c) and Simon (1993).

Theorem 14.5. Let C be a class of discrimination functions with vc di-mension V ≥ 2. Let X be the set of all random variables (X,Y ) for whichfor fixed L ∈ (0, 1/2),

L = infg∈C

Pg(X) 6= Y .

Then, for every discrimination rule gn based upon X1, Y1, . . . , Xn, Yn,

sup(X,Y )∈X

E(Ln − L) ≥√L(V − 1)

24ne−8 if n ≥ V−1

2L max(9, 1/(1− 2L)2).

Proof. Again we consider the finite family F from the previous section.The notation b and B is also as above. X now puts mass p at xi, i < V ,and mass 1 − (V − 1)p at xV . This imposes the condition (V − 1)p ≤ 1,which will be satisfied. Next introduce the constant c ∈ (0, 1/2). We nolonger have Y as a function of X. Instead, we have a uniform [0, 1] randomvariable U independent of X and define

Y =

1 if U ≤ 12 − c+ 2cbi, X = xi, i < V

0 otherwise.

Thus, when X = xi, i < V , Y is 1 with probability 1/2 − c or 1/2 + c. Asimple argument shows that the best rule for b is the one which sets

fb(x) =

1 if x = xi , i < V , bi = 10 otherwise.

Also, observe thatL = (V − 1)p(1/2− c) .

14.4 The Case LC > 0 256

Noting that |2η(xi) − 1| = c for i < V , for fixed b, by the equality inTheorem 2.2, we may write

Ln − L ≥V−1∑i=1

2pcIgn(xi,X1,Y1,...,Xn,Yn)=1−fb(xi) .

It is sometimes convenient to make the dependence of gn upon b explicitby considering gn(xi) as a function of xi, X1, . . . , Xn, U1, . . . , Un (an i.i.d.sequence of uniform [0, 1] random variables), and bi. The proof given hereis based on the ideas used in the proofs of Theorems 14.1 and 14.2. Wereplace b by a uniformly distributed random B over 0, 1V−1. After thisrandomization, denote Zn = (X,X1, Y1, . . . , Xn, Yn). Thus,

sup(X,Y )∈F

ELn − L = supb

ELn − L

≥ ELn − L (with random B)

≥V−1∑i=1

2pcEIgn(xi,X1,...,Yn)=1−fB(xi)

= 2cPgn(Zn) 6= fB(X)≥ 2cL∗(Zn, fB(X)),

where, as before, L∗(Zn, fB(X)) denotes the Bayes probability of error ofpredicting the value of fB(X) based on observing Zn. All we have to do isto find a suitable lower bound for

L∗(Zn, fB(X)) = E min(η∗(Zn), 1− η∗(Zn)) ,

where η∗(Zn) = PfB(X) = 1|Zn. Observe that

η∗(Zn) =

1/2 if X 6= X1, . . . , X 6= Xn and X 6= xVPBi = 1|Yi1 , . . . , Yik if X = Xi1 = · · · = Xik = xi, i < V .

Next we compute PBi = 1|Yi1 = y1, . . . , Yik = yk for y1, . . . , yk ∈ 0, 1.Denoting the numbers of zeros and ones by k0 = |j ≤ k : yj = 0| andk1 = |j ≤ k : yj = 1|, we see that

PBi = 1|Yi1 = y1, . . . , Yik = yk

=(1− 2c)k1(1 + 2c)k0

(1− 2c)k1(1 + 2c)k0 + (1 + 2c)k1(1− 2c)k0.

14.4 The Case LC > 0 257

Therefore, if X = Xi1 = · · · = Xik = xi, i < V , then

min(η∗(Zn), 1− η∗(Zn))

=min

((1− 2c)k1(1 + 2c)k0 , (1 + 2c)k1(1− 2c)k0

)(1− 2c)k1(1 + 2c)k0 + (1 + 2c)k1(1− 2c)k0

=min

(1,(

1+2c1−2c

)k1−k0)1 +

(1+2c1−2c

)k1−k0=

1

1 +(

1+2c1−2c

)|k1−k0| .In summary, denoting a = (1 + 2c)/(1− 2c), we have

L∗(Zn, fB(X)) = E

1

1 + a

∣∣∣∑j:Xj=X

(2Yj−1)

∣∣∣

≥ E

1

2a

∣∣∣∑j:Xj=X(2Yj−1)

∣∣∣

≥ 12

V−1∑i=1

PX = xiE

a−∣∣∣∑

j:Xj=xi(2Yj−1)

∣∣∣

≥ 12(V − 1)pa

−E

∣∣∣∑j:Xj=xi

(2Yj−1)

∣∣∣(by Jensen’s inequality).

Next we bound E∣∣∣∑j:Xj=xi

(2Yj − 1)∣∣∣. Clearly, if B(k, q) denotes a bi-

nomial random variable with parameters k and q,

E

∣∣∣∣∣∣∑

j:Xj=xi

(2Yj − 1)

∣∣∣∣∣∣ =

n∑k=0

(n

k

)pk(1− p)n−kE |2B(k, 1/2− c)− k| .

However, by straightforward calculation we see that

E |2B(k, 1/2− c)− k| ≤√

E (2B(k, 1/2− c)− k)2

=√k(1− 4c2) + 4k2c2

≤ 2kc+√k.

14.4 The Case LC > 0 258

Therefore, applying Jensen’s inequality once again, we getn∑k=0

(n

k

)pk(1− p)n−kE |2B(k, 1/2− c)− k| ≤ 2npc+

√np.

Summarizing what we have obtained so far, we have

supb

ELn − L ≥ 2cL∗(Zn, fB(X))

≥ 2c12(V − 1)pa−2npc−√np

≥ c(V − 1)pe−2npc(a−1)−(a−1)√np

(by the inequality 1 + x ≤ ex)

= c(V − 1)pe−8npc2/(1−2c)−4c√np/(1−2c).

A rough asymptotic analysis shows that the best asymptotic choice for c isgiven by

c =1√4np

.

Then the constraint L = (V − 1)p(1/2 − c) leaves us with a quadraticequation in c. Instead of solving this equation, it is more convenient to takec =

√(V − 1)/(8nL). If 2nL/(V − 1) ≥ 9, then c ≤ 1/6. With this choice

for c, using L = (V − 1)p(1/2− c), straightforward calculation provides

sup(X,Y )∈F

E(Ln − L) ≥√

(V − 1)L24n

e−8.

The condition p(V − 1) ≤ 1 implies that we need to ask that n ≥ (V −1)/(2L(1− 2L)2). This concludes the proof of Theorem 14.5. 2

Next we obtain a probabilistic bound. Its proof below is based uponHellinger distances, and its methodology is essentially due to Assouad(1983b). For refinements and applications, we refer to Birge (1983; 1986)and Devroye (1987).

Theorem 14.6. (Devroye and Lugosi (1995)). Let C be a class ofdiscrimination functions with vc dimension V ≥ 2. Let X be the set of allrandom variables (X,Y ) for which for fixed L ∈ (0, 1/4],

L = infg∈C

Pg(X) 6= Y .

Then, for every discrimination rule gn based upon X1, Y1, . . . , Xn, Yn, andany ε ≤ L,

sup(X,Y )∈X

PLn − L ≥ ε ≥ 14e−4nε2/L .

14.4 The Case LC > 0 259

Proof. The method of randomization here is similar to that in the proofof Theorem 14.5. Using the same notation as there, it is clear that

sup(X,Y )∈X

PLn − L ≥ ε

≥ EI∑V−1

i=12pcIgn(xi,X1,...,Yn)=1−fB(xi)≥ε

= 2−(V−1)∑

(x′1

,...,x′n,y1,...,yn)

∈(x1,...,xV ×0,1)n∑b

I∑V−1

i=12pcIgn(xi;x

′1

,y1,...,x′n,yn)=1−fb(xi)≥ε

n∏j=1

pb(x′j , yj) .

First observe that ifε/(2pc) ≤ (V − 1)/2, (14.1)

then

I∑V−1

i=12pcIgn(xi;x

′1

,y1,...,x′n,yn)=1−fb(xi)≥ε

+ I∑V−1

i=12pcIgn(xi;x

′1

,y1,...,x′n,yn)=1−fbc (xi)≥ε ≥ 1 ,

where bc denotes the binary vector (1 − b1, . . . , 1 − bV−1), that is, thecomplement of b. Therefore, for ε ≤ pc(V − 1), the last expression in thelower bound above is bounded from below by

12V−1

∑(x′

1,...,x′n,y1,...,yn)

∈(x1,...,xV ×0,1)n

∑b

12

min

n∏j=1

pb(x′j , yj) ,n∏j=1

pbc(x′j , yj)

≥ 12V+1

∑b

∑(x′

1,...,x′n,y1,...,yn)

∈(x1,...,xV ×0,1)n

√√√√ n∏j=1

pb(x′j , yj) ×n∏j=1

pbc(x′j , yj)

2

(by LeCam’s inequality, Lemma 3.1)

=1

2V+1

∑b

∑(x,y)

√pb(x, y)pbc(x, y)

2n

.

It is easy to see that for x = xV ,

pb(x, y) = pbc(x, y) =1− (V − 1)p

2,

and for x = xi, i < V ,

pb(x, y)pbc(x, y) = p2

(14− c2

).


Thus, we have the equality

∑(x,y)

√pb(x, y)pbc(x, y) = 1− (V − 1)p+ 2(V − 1)p

√14− c2 .

Summarizing, since L = p(V − 1)(1/2− c), we have

sup(X,Y )∈X

PLn − L ≥ ε ≥ 14

(1− L

12 − c

(1−

√1− 4c2

))2n

≥ 14

(1− L

12 − c

4c2)2n

≥ 14

exp(−16nLc2

1− 2c

/(1− 8Lc2

1− 2c

)),

where we used the inequality 1− x ≥ e−x/(1−x) again. We may choose c asε

2L+2ε . It is easy to verify that condition (14.1) holds. Also, p(V − 1) ≤ 1.From the condition L ≥ ε we deduce that c ≤ 1/4. The exponent in theexpression above may be bounded as

16nLc2

1−2c

1− 8Lc2

1−2c

=16nLc2

1− 2c− 8Lc2

=4nε2

L+ε

1− 2ε2

L+ε

(by substituting c = ε/(2L+ 2ε))

≤ 4nε2

L.

Thus,

sup(X,Y )∈X

PLn − L ≥ ε ≥ 14

exp(−4nε2/L

),

as desired. 2

14.5 Sample Complexity

We may rephrase the probability bounds above in terms of the samplecomplexity of algorithms for selecting a classifier from a class. Recall thatfor given ε, δ > 0, the sample complexity of a selection rule gn is the smallestinteger N(ε, δ) such that

supPL(gn)− LC ≥ ε ≤ δ


for all n ≥ N(ε, δ). The supremum is taken over a class of distributions of(X,Y ). Here we are interested in distributions such that LC is fixed.

We start with the case LC = 0, by checking the implications of The-orem 14.2 for the sample complexity N(ε, δ). First Blumer, Ehrenfeucht,Haussler, and Warmuth (1989) showed that for any algorithm,

N(ε, δ) ≥ C

(1ε

log(

1δ

)+ VC

),

where C is a universal constant. In Ehrenfeucht, Haussler, Kearns, andValiant (1989), the lower bound was partially improved to

N(ε, δ) ≥ VC − 132ε

when ε ≤ 1/8 and δ ≤ 1/100. It may be combined with the previous bound.Theorem 14.2 provides the following bounds:

Corollary 14.1. Let C be a class of discrimination functions with vcdimension V ≥ 2. Let X be the set of all random variables (X,Y ) forwhich LC = 0. Assume ε ≤ 1/4, and denote v = d(V − 1)/2e. Then forevery discrimination rule gn based upon X1, Y1, . . . , Xn, Yn, when ε ≤ 1/8and

log(

1δ

)≥(

4ve

)(e√

2πv)1/v

,

then

N(ε, δ) ≥ 18ε

log(

1δ

).

Finally, for δ ≤ 1/2, and ε < 1/2,

N(ε, δ) ≥ V − 14ε

.

Proof. The second bound follows trivially from the second inequality ofTheorem 14.2. By the first inequality there,

sup(X,Y )∈X

PLn ≥ ε ≥ 1e√

2πv

(2neεV − 1

)(V−1)/2

e−4nε/(1−4ε)

≥ 1e√

2πv

(2enεv

)ve−8nε (since ε ≤ 1/8)

≥ (8ε)v

logv(1/δ)nve−8nε

(since we assume log(

1δ

)≥(

4ve

) (e√

2πv)1/v

).


The function nve−8nε varies unimodally in n, and achieves a peak at n =v/(8ε). For n below this threshold, by monotonicity, we apply the boundat n = v/(8ε). It is easy to verify that the value of the bound at v/(8ε) isalways at least δ. If on the other hand, (1/8ε) log(1/δ) ≥ n ≥ v/(8ε), thelower bound achieves its minimal value at (1/8ε) log(1/δ), and the valuethere is δ. This proves the first bound. 2

Corollary 14.1 shows that for any classifier, at least

max(

18ε

log(

1δ

),VC − 1

4ε

)training samples are necessary to achieve ε accuracy with δ confidence forall distributions. Apart from a log

(1ε

)factor, the order of magnitude of

this expression is the same as that of the upper bound for empirical errorminimization, obtained in Corollary 12.4. That the upper and lower boundsare very close, has two important messages. On the one hand it gives avery good estimate for the number of training samples needed to achieve acertain performance. On the other hand it shows that there is essentiallyno better method than minimizing the empirical error probability.

In the case LC > 0, we may derive lower bounds for N(ε, δ) from Theo-rems 14.5 and 14.6:

Corollary 14.2. Let G be a class of discrimination functions with vcdimension V ≥ 2. Let X be the set of all random variables (X,Y ) forwhich for fixed L ∈ (0, 1/2),

L = infg∈G

Pg(X) 6= Y .

Then, for every discrimination rule gn based upon X1, Y1, . . . , Xn, Yn,

N(ε, δ) ≥ L(V − 1)e−10

32×min

(1δ2,

1ε2

).

Also, and in particular, for ε ≤ L ≤ 1/4,

N(ε, δ) ≥ L

4ε2log

14δ

.

Proof. The first bound may be obtained easily from the expectation-bound of Theorem 14.5 (see Problem 14.1). Setting the bound of Theorem14.6 equal to δ provides the second bound on N(ε, δ). 2

These bounds may of course be combined. They show that N(ε, δ) isbounded from below by terms like (1/ε2) log(1/δ) (independent of VC) and(VC − 1)/ε2, as δ is typically much smaller than ε. By comparing these


bounds with the upper bounds of Corollary 12.3, we see that the only dif-ference between the orders of magnitude is a log (1/ε)-factor, so all remarksmade for the case LC = 0 remain valid.

Interestingly, all bounds depend on the class C only through its vc di-mension. This fact suggests that when studying distribution-free propertiesof L(gn) − LC , the vc dimension is the most important characteristic ofthe class. Also, all bounds are linear in the vc dimension, which links itconveniently to sample size.

Remark. It is easy to see from the proofs that all results remain valid ifwe allow randomization in the rules gn. 2


Problem 14.1. Show that Theorem 14.5 implies that for every discriminationrule gn based upon Dn,

N(ε, δ) ≥ L(V − 1)e−10

32×min

(1

δ2,

1

ε2

).

Hint: Assume that PLn − L > ε < δ. Then clearly, ELn − L ≤ ε+ δ.

Problem 14.2. Prove Theorem 14.3: First apply the proof method of Theorem14.1 to show that for every n, and gn, there is a distribution with LC = 0 suchthat EL(gn) ≥ (1− 1/n)/(2e). Use a monotonicity argument to finish the proof.

Problem 14.3. Prove Theorem 14.4 by modifying the proof of Theorem 7.2.


15The Maximum Likelihood Principle

In this chapter we explore the various uses of the maximum likelihoodprinciple in discrimination. In general, the principle is only applicable ifwe have some a priori knowledge of the problem at hand. We offer defini-tions, consistency results, and examples that highlight the advantages andshortcomings.

15.1 Maximum Likelihood: The Formats

Sometimes, advance information takes a very specific form (e.g., “if Y = 1,X is normal (µ, σ2)”). Often, it is rather vague (e.g., “we believe that Xhas a density,” or “η(x) is thought to be a monotone function of x ∈ R”).

If we have information in set format, the maximum likelihood principleis less appropriate. Here we know that the Bayes rule g∗(x) is of the formg(x) = Ix∈A, where A ∈ A and A is a class of sets of Rd. We refer to thechapters on empirical risk minimization (see Chapter 12 and also Chapter18) for this situation.

If we know that the true (unknown) η belongs to a class F of functionsthat map Rd to [0, 1], then we say that we are given information in regres-sion format. With each η′ ∈ F we associate a set A = x : η′(x) > 1/2and a discrimination rule g(x) = Ix∈A. The class of these rules is denotedby C. Assume that we somehow could estimate η by ηn. Then it makessense to use the associated rule gn(x) = Iηn(x)>1/2. The maximum likeli-hood method suggests a way of picking the ηn from F that in some sense is

15.2 The Maximum Likelihood Method: Regression Format 265

most likely given the data. It is fully automatic—the user does not have topick any parameters—but it does require a serious implementation effortin many cases. In a sense, the regression format is more powerful than theset format, as there is more information in knowing a function η than inknowing the indicator function Iη>1/2. Still, no structure is assumed onthe part of X, and none is needed to obtain consistency results.

A third format, even more detailed, is that in which we know that the dis-tribution of (X,Y ) belongs to a class D of distributions onRd×0, 1. For agiven distribution, we know η, so we may once again deduce a rule g by set-ting g(x) = Iη(x)>1/2. This distribution format is even more powerful, asthe positions X1, . . . , Xn alone in some cases may determine the unknownparameters in the model. This situation fits in squarely with classical pa-rameter estimation in mathematical statistics. Once again, we may applythe maximum likelihood principle to select a distribution from D. Unfortu-nately, as we move to more restrictive and stronger formats, the number ofconditions under which the maximum likelihood principle is consistent in-creases as well. We will only superficially deal with the distribution format(see Chapter 16 for more detail).

15.2 The Maximum Likelihood Method:Regression Format

Given X1, . . . , Xn, the probability of observing Y1 = y1, . . . , Yn = yn is

n∏i=1

η(Xi)yi(1− η(Xi))1−yi .

If η is unknown but belongs to a family F of functions, we may wish toselect that η′ from F (if it exists) for which that likelihood product ismaximal. More formally, we select η′ so that the logarithm

Ln(η′) =1n

n∑i=1

(Yi log η′(Xi) + (1− Yi) log(1− η′(Xi)))

is maximal. If the family F is too rich, this will overfit, and consequently,the selected function ηn has a probability of error

L(ηn) = PIηn(X)>1/2 6= Y |Dn

that does not tend to L∗. For convenience, we assume here that there existsan element of F maximizing Ln.

We do not assume here that the class F is very small. Classes in whicheach η′ in F is known up to one or a few parameters are loosely calledparametric. Sometimes F is defined via a generic description such as: F is


the class of all η′ : Rd → [0, 1] that are Lipschitz with constant c. Suchclasses are called nonparametric. In certain cases, the boundary betweenparametric and nonparametric is unclear. We will be occupied with theconsistency question: does L(ηn) → infη′∈F L(η′) with probability one forall distribution of (X,Y )? (Here L(η′) = P

Iη′(X)>1/2 6= Y

is the prob-

ability of error of the natural rule that corresponds to η′.) If, additionally,F is rich enough or our prior information is good enough, we may haveinfη′∈F L(η′) = L∗, but that is not our concern here, as F is given to us.

We will not be concerned with the computational problems related to themaximization of Ln(η′) over F . Gradient methods or variations of them aresometimes used—refer to McLachlan (1992) for a bibliography. It sufficesto say that in simple cases, an explicit form for ηn may be available. Anexample follows.

Our first lemma illustrates that the maximum likelihood method shouldonly be used when the true (but unknown) η indeed belongs to F . Recallthat the same was not true for empirical risk minimization over vc classes(see Chapter 12).

Lemma 15.1. Consider the class F with two functions ηA ≡ 0.45, ηB ≡0.95. Let ηn be the function picked by the maximum likelihood method.There exists a distribution for (X,Y ) with η /∈ F such that with probabilityone, as n→∞,

L(ηn) → maxη′∈F

L(η′) > minη′∈F

L(η′).

Thus, maximum likelihood picks the wrong classifier.

Proof. Define the distribution of (X,Y ) on 0, 1 × 0, 1 by PX =0, Y = 0 = PX = 1, Y = 0 = 2/9, PX = 0, Y = 1 = 1/9, andPX = 1, Y = 1 = 4/9. Then one may quickly verify that

L(ηA) =59, L(ηB) =

49.

(Note that L∗ = 1/3, but this is irrelevant here.) Within F , ηB is the betterfor our distribution. By the strong law of large numbers, we have

Ln(ηA) → E Y log ηA(X) + (1− Y ) log(1− ηA(X))

with probability one (and similarly for ηB). If one works out the values, itis seen that with probability one, ηn ≡ ηA for all n large enough. Hence,L(ηn) → maxη′∈F L(η′) with probability one. 2

Remark. Besides the clear theoretical hazard of not capturing η in F ,the maximum likelihood method runs into a practical problem with “infin-ity.” For example, take F = ηA ≡ 0, ηB ≡ 1, and assume η ≡ 1/3. Forall n large enough, both classes are represented in the data sample with


probability one. This implies that Ln(ηA) = Ln(ηB) = −∞. The maxi-mum likelihood estimate ηn is ill-defined, while any reasonable rule shouldquickly be able to pick ηA over ηB . 2

The lemma shows that when η /∈ F , the maximum likelihood methodis not even capable of selecting one of two choices. The situation changesdramatically when η ∈ F . For finite classes F , nothing can go wrong.Noting that whenever η ∈ F , we have L∗ = infη′∈F L(η′), we may nowexpect that L(ηn) → L∗ in probability, or with probability one.

Theorem 15.1. If |F| = k <∞, and η ∈ F , then the maximum likelihoodmethod is consistent, that is,

L(ηn) → L∗ in probability.

Proof. For a fixed distribution of (X,Y ), we rank the members η(1), . . . , η(k)

of F by increasing values of L(η(i)). Put η(1) ≡ η. Let i0 be the largest in-dex for which L(η(i)) = L∗. Let ηn be the maximum likelihood choice fromF . For any α, we have

PL(ηn) 6= L∗

≤ PLn(η(1)) ≤ max

i>i0Ln(η(i))

≤ PLn(η(1)) ≤ α+

∑i>i0

PLn(η(i)) ≥ α.

Define the entropy of (X,Y ) by

E = E(η) = −E η(X) log η(X) + (1− η(X)) log(1− η(X))

(see Chapter 3). Recall that 0 ≤ E ≤ log 2. We also need the negativedivergences

Di = Eη(X) log

η(i)(X)η(X)

+ (1− η(X)) log1− η(i)(X)1− η(X)

,

which are easily seen to be nonpositive for all i (by Jensen’s inequality).Furthermore, Di = 0 if and only if η(i)(X) = η(X) with probability one.Observe that for i > i0, we cannot have this. Let θ = maxi>i0 Di. (If i0 = k,this set is empty, but then the theorem is trivially true.)

It is advantageous to take α = −E + θ/2. Observe that

ELn(η(i))

= −E +Di

= −E if i = 1≤ −E + θ if i > i0.


Thus,

PL(ηn) 6= L∗ ≤ PLn(η(1)) ≤ −E +

θ

2

+∑i>i0

PLn(η(i)) ≥ −E +

θ

2

.

By the law of large numbers, we see that both terms converge to zero. Note,in particular, that it is true even if for some i > i0, Di = −∞. 2

For infinite classes, many things can go wrong. Assume that F is theclass of all η′ : R2 → [0, 1] with η′ = IA and A is a convex set containingthe origin. Pick X uniformly on the perimeter of the unit circle. Then themaximum likelihood estimate ηn matches the data, as we may always finda closed polygon Pn with vertices at the Xi’s with Yi = 1. For ηn = IPn

,we have Ln(ηn) = 0 (its maximal value), yet L(ηn) = PY = 1 = p andL∗ = 0. The class F is plainly too rich.

For distributions of X on the positive integers, maximum likelihood doesnot behave as poorly, even though it must pick among infinitely manypossibilities. Assume F is the class of all η′, but we know that X putsall its mass on the positive integers. Then maximum likelihood tries tomaximize

∞∏i=1

(η′(i))N1,i(1− η′(i))N0,i

over η′ ∈ F , whereN1,i =∑nj=1 IXj=i,Yj=1 andN0,i =

∑nj=1 IXj=i,Yj=0.

The maximization is to be done over all (η′(1), η′(2), . . .) from⊗∞

i=1[0, 1].Fortunately, this is turned into a maximization for each individual i—usually we are not so lucky. Thus, if N0,i + N1,i = 0, we set ηn(i) = 0(arbitrarily), while if N0,i + N1,i > 0, we pick ηn(i) as the value u thatmaximizes

N1,i log u+N0,i log(1− u).

Setting the derivative with respect to u equal to zero shows that

ηn(i) =N1,i

N1,i +N0,i.

In other words, ηn is the familiar histogram estimate with bin width lessthan one half. It is known to be universally consistent (see Theorem 6.2).Thus, maximum likelihood may work for large F if we restrict the distri-bution of (X,Y ) a bit.

15.3 Consistency

Finally, we are ready for the main consistency result for ηn when F mayhave infinitely many elements. The conditions of the theorem involve the


bracketing metric entropy of F , defined as follows: for every distribution ofX, and ε > 0, let FX,ε be a set of functions such that for each η′ ∈ F , thereexist functions η′L, η

′U ∈ FX,ε such that for all x ∈ Rd

η′L(x) ≤ η′(x) ≤ η′U (x),

andEη′U (X)− η′L(X) ≤ ε.

That is, every η′ ∈ F is “bracketed” between two members of FX,ε whoseL1(µ) distance is not larger than ε. Let N(X, ε) denote the cardinality ofthe smallest such FX,ε. If N(X, ε) <∞,

logN(X, ε)

is called the bracketing ε-entropy of F , corresponding to X.

Theorem 15.2. Let F be a class of regression functions Rd → [0, 1]. As-sume that for every distribution of X and ε > 0, N(X, ε) < ∞. Then themaximum likelihood choice ηn satisfies

limn→∞

L(ηn) = L∗ in probability

for all distributions of (X,Y ) with η ∈ F .

Thus, consistency is guaranteed if F has a finite bracketing ε-entropy forevery X and ε. We provide examples in the next section. For the proof,first we need a simple corollary of Lemma 3.2:

Corollary 15.1. Let η, η′ : Rd → [0, 1], and let (X,Y ) be an Rd×0, 1-valued random variable pair with PY = 1|X = x = η(x). Define

L∗ = Emin(η(X), 1− η(X)), L(η′) = PIη′(X)>1/2 6= Y

.

Then

E√

η′(X)η(X) +√

(1− η′(X))(1− η(X))

≤ 1− (2E|η(X)− η′(X)|)2

8

≤ 1− 18(L(η′)− L∗)2.

Proof. The first inequality follows by Lemma 3.2, and the second byTheorem 2.2. 2

Now, we are ready to prove the theorem. Some of the ideas used hereappear in Wong and Shen (1992).

Proof of Theorem 15.2. Look again at the proof for the case |F| <∞(Theorem 15.1). Define E as there. Let F (ε) be the collection of those η′ ∈ F


with L(η′) > L∗+ε, recalling that L(η′) = PIη′(X)>1/2 6= Y

. For every

α,

PL(ηn)− L∗ > ε ≤ P

sup

η′∈F(ε)Ln(η′) ≥ Ln(η)

≤ P Ln(η) ≤ α+ P

sup

η′∈F(ε)Ln(η′) ≥ α

.

For reasons that will be obvious later, we take α = −E − ε2/16. Thus,

PL(ηn)− L∗ > ε

≤ PLn(η) ≤ −E − ε2

16

+ P

sup

η′∈F(ε)Ln(η′) ≥ −E − ε2

16

= PLn(η) ≤ −E − ε2

16

+P

(sup

η′∈F(ε)Ln(η′)− Ln(η)

)+ (Ln(η) + E) ≥ ε2

16− ε2

8

≤ P|Ln(η) + E| ≥ ε2

16

+ P

sup

η′∈F(ε)Ln(η′)− Ln(η) ≥ −ε

2

8

.

Noting that E Ln(η) = −E , the law of large numbers implies that thefirst term on the right-hand side converges to zero for every ε. (See Problem15.4 for more information.)

Next we bound the second term. For a fixed distribution, let FX,δ be thesmallest set of functions such that for each η′ ∈ F , there exist η′L, η

′U ∈ FX,δ

withη′L(x) ≤ η′(x) ≤ η′U (x), x ∈ Rd

andEη′U (X)− η′L(X) ≤ δ,

where δ > 0 will be specified later. By assumption, N(X, δ) = |FX,δ| <∞.We have,

P

sup


2

8

= P

sup

η′∈F(ε)

n∏i=1

η′(Xi)Yi(1− η′(Xi))1−Yi

η(Xi)Yi(1− η(Xi))1−Yi≥ e−nε

2/8

≤ P

sup

η′∈F(ε)

n∏i=1

η′U (Xi)Yi(1− η′L(Xi))1−Yi


2/8

(where η′L, η

′U ∈ FX,δ, η′L ≤ η′ ≤ η′U and Eη′U (X)− η′L(X) < δ)


≤ N2(X, δ) supη′∈F(ε)

P

n∏i=1

η′U (Xi)Yi(1− η′L(Xi))1−Yi


2/8

(by the union bound)

= N2(X, δ) supη′∈F(ε)

P

n∏i=1

√η′U (Xi)Yi(1− η′L(Xi))1−Yi


2/16

≤ N2(X, δ)

× supη′∈F(ε)

(E√

η′U (X)η(X) +√

(1− η′L(X))(1− η(X)))n

enε2/16,

where in the last step we used Markov’s inequality and independence. Wenow find a good bound for

E√

η′U (X)η(X) +√

(1− η′L(X))(1− η(X))

for each η′ ∈ F (ε) as follows:

E√

η′U (X)η(X) +√

(1− η′L(X))(1− η(X))

≤ E√

η′(X)η(X) +√

(1− η′(X))(1− η(X))

+√η(X) (η′U (X)− η′(X)) +

√(1− η(X)) (η′(X)− η′L(X))

(here we used

√a+ b ≤

√a+

√b)

≤ E√

η′(X)η(X) +√

(1− η′(X))(1− η(X))

+√

Eη(X)√

E η′U (X)− η′(X)

+√

E1− η(X)√

E η′(X)− η′L(X)


≤ E√

η′(X)η(X) +√

(1− η′(X))(1− η(X))

+ 2√δ

≤ 1− ε2

8+ 2

√δ (by Corollary 15.1).

Summarizing, we obtain

P

sup


2

8

≤ N2(X, δ)(

1− ε2

8+ 2

√δ

)nenε

2/16

15.4 Examples 272

≤ N2(X, δ)e−nε2/16+2n

√δ (since 1 + x ≤ ex).

By taking δ <(ε2

32

)2

, we see that the probability converges to zero expo-nentially rapidly, which concludes the proof. 2

15.4 Examples

In this section we show by example how to apply the previous consistencyresults. In all cases, we assume that η ∈ F and we are concerned with theweak convergence of L(ηn) to L∗ for all such distributions of (X,Y ). Theclasses are as follows:

F1 =η = I[a,b],−∞ ≤ a ≤ b ≤ ∞

F ′1 =

η = cI[a,b], c ∈ [0, 1],−∞ ≤ a ≤ b ≤ ∞

F2 =

η = I[a1,b1]×···×[ad,bd],−∞ ≤ ai ≤ bi ≤ ∞, 1 ≤ i ≤ d

F3 = η : η(x) = cx/(1 + cx) : x ≥ 0, c ≥ 0F4 = η is monotone decreasing on [0, 1]

F5 =η : η(x) =

11 + ‖x−m‖2

,m ∈ Rd

F6 =

η : η(x) = sin2(θx), θ ∈ R

F7 =

η : η(x) =

eα0+αT x

1 + eα0+αT x, α0 ∈ R, x, α ∈ Rd

.

These are all rather simple, yet they will illustrate various points. Ofthese classes, F4 is nonparametric; yet, it behaves “better” than the one-parameter class F6, for example. We emphasize again that we are interestedin consistency for all distributions of X.

For F1, ηn will agree with the samples, that is, ηn(Xi) = Yi for all i, andtherefore, ηn ∈ F1 is any function of the form I[a,b] with

X(l−1) < a ≤ X(l), X(r) ≤ b < X(r+1),

where X(1) ≤ · · · ≤ X(n) are the order statistics for X1, . . . , Xn (withX(0) = −∞, X(n+1) = ∞), X(l) is the smallest data point with Y (l) = 1,and X(r) is the largest data point with Y (r) = 1. As L∗ = 0, we claim that

E L(ηn) ≤ PX(l−1) < X < X(l)

+ P

X(r) < X < X(r+1)

≤ 4

n+ 1.

15.4 Examples 273

The rule is simply excellent, and has universal performance guarantees.

Remark. Note that in this case, maximum likelihood minimizes the em-pirical risk over the class of classifiers C = φ = I[a,b], a ≤ b. As C hasvc dimension VC = 2 (Theorem 13.7), and infφ∈C L(φ) = 0, Theorem 12.7implies that for all ε > 0,

PL(ηn) > ε ≤ 8n22−nε/2,

and thatEL(ηn) ≤

4 log n+ 4n log 2

(see Problem 12.8). With the analysis given here, we have gotten rid of thelog n factor. 2

The class F ′1 is much more interesting. Here you will observe a dramaticdifference with empirical risk minimization, as the parameter c plays a keyrole that will aid a lot in the selection. Note that the likelihood product iszero if N1 = 0 or if Yi = 1 for some i with Xi /∈ [a, b], and it is

cN1(1− c)N0 otherwise,

where N0 is the number of (Xi, Yi) pairs with a ≤ Xi ≤ b, Yi = 0, and N1

is the number of (Xi, Yi) pairs with a ≤ Xi ≤ b, Yi = 1. For fixed N0, N1,this is maximal when

c =N1

N0 +N1.

Resubstitution yields that the likelihood product is zero if N1 = 0 or ifYi = 1 for some i with Xi /∈ [a, b], and equals

expN1 log

N1

N0 +N1+N0 log

N0

N0 +N1

if otherwise.

Thus, we should pick ηn = cI[a,b] such that c = N1/(N0 + N1), and [a, b]maximizes the divergence

N1 logN1

N0 +N1+N0 log

N0

N0 +N1.

See Problem 15.5.

For F2, by a similar argument as for F1, let [A1, B1]× · · · × [Ad, Bd] bethe smallest rectangle of Rd that encloses all Xi’s for which Yi = 1. Then,we know that ηn = I[A1,B1]×···×[Ad,Bd] agrees with the data. Furthermore,L∗ = 0, and

EL(ηn) ≤4dn+ 1

15.4 Examples 274

(see Problem 15.6).

The logarithm of the likelihood product for F3 is

∑i:Yi=1

log(cXi)−n∑i=1

log(1 + cXi).

Setting the derivative with respect to c equal to zero in the hope of ob-taining an equation that must be satisfied by the maximum, we see that cmust satisfy

N1 =n∑i=1

cXi

1 + cXi=

n∑i=1

Xi

1/c+Xi

unless X1 = · · · = Xn = 0, where N1 =∑ni=1 IYi=1. This equation

has a unique solution for c. The rule corresponding to ηn is of the formg(x) = Icx>1 = Ix>1/c. Equivalently,

g(x) =

1 if N1 >∑ni=1

Xi

x+Xi

0 otherwise.

This surprising example shows that we do not even have to know the pa-rameter c in order to describe the maximum likelihood rule. In quite a fewcases, this shortcut makes such rules very appealing indeed. Furthermore,as the condition of Theorem 15.2 is fulfilled, η ∈ F3 implies that the ruleis consistent as well.

For F4, it is convenient to order the Xi’s from small to large and toidentify k consecutive groups, each group consisting of any number of Xi’swith Yi = 0, and one Xi with Yi = 1. Thus, k =

∑ni=1 IYi=1. Then, a

moment’s thought shows that ηn must be piecewise constant, taking values1 ≥ a1 ≥ · · · ≥ ak ≥ 0 on the consecutive groups, and the value zero pastthe k-th group (which can only consist of Xi’s with Yi = 0. The likelihoodproduct thus is of the form

a1(1− a1)n1a2(1− a2)n2 · · · ak(1− ak)nk ,

where n1, . . . , nk are the cardinalities of the groups minus one (i.e., thenumber of Yj = 0-elements in the i-th group). Finally, we have

(a1, . . . , ak) = arg max1≥a1≥···≥ak≥0

k∏j=1

aj(1− aj)nj .

To check consistency, we see that for every X and ε > 0,

N(X, ε) ≤ 4d3ε e

15.4 Examples 275

(see Problem 15.7). Thus, the condition of Theorem 15.2 is satisfied, andL(ηn) → L∗ in probability, whenever η ∈ F4.

In F5, the maximum likelihood method will attempt to place m at thecenter of the highest concentration in Rd of Xi’s with Yi = 1, while stayingaway from Xi’s with Yi = 0. Certainly, there are computational problems,but it takes little thought to verify that the conditions of Theorem 15.2 aresatisfied. Explicit description of the rule is not necessary for some theoret-ical analysis!

Class F6 is a simple one-parameter class that does not satisfy the con-dition of Theorem 15.2. In fact, maximum likelihood fails here for the fol-lowing reason: Let X be uniform on [0, 1]. Then the likelihood productis ∏

i:Yi=1

sin2(θXi)×∏i:Yi=0

cos2(θXi).

This product reaches a degenerate global maximum (1) as θ →∞, regard-less of the true (unknown) value of θ that gave rise to the data. See Problem15.9.

Class F7 is used in the popular logistic discrimination problem, reviewedand studied by Anderson (1982), see also McLachlan (1992, Chapter 8). Itis particularly important to observe that with this model,

η > 1/2 ≡eα0+α

T x > 1/2≡αTx > β

,

where β = −α0−log 2. Thus, F7 subsumes linear discriminants. It does alsoforce a bit more structure on the problem, making η approach zero or one aswe move away from the separating hyperplane. Day and Kerridge (1967)point out that model F7 is appropriate if the class-conditional densitiestake the form

cf(x) exp−1

2(x−m)TΣ−1(x−m)

,

where c is a normalizing constant, f is a density, m is a vector, and Σ is apositive definite matrix; f and Σ must be the same for the two densities,but c and m may be different. Unfortunately, obtaining the best values forα0 and α by maximum likelihood takes a serious computational effort. Hadwe tried to estimate f,m, and Σ in the last example, we would have donemore than what is needed, as both f and Σ drop out of the picture. In thisrespect, the regression format is both parsimonious and lightweight.


15.5 Classical Maximum Likelihood: DistributionFormat

In a more classical approach, we assume that given Y = 1, X has a densityf1, and given Y = 0, X has a density f0, where both f0 and f1 belongto a given family F of densities. A similar setup may be used for atomicdistributions, but this will not add anything new here, and is rather routine.The likelihood product for the data (X1, Y1), . . . , (Xn, Yn) is

n∏i=1

(pf1(Xi))Yi((1− p)f0(Xi))1−Yi ,

where p = PY = 1 is assumed to be unknown as well. The maximumlikelihood choices for p, f0, f1 are given by

(p∗, f∗0 , f∗1 ) = arg max

p∈[0,1],f0,f1∈FLn(p, f0, f1),

where

Ln(p, f0, f1) =1n

n∑i=1

(Yi log(pf1(Xi)) + (1− Yi) log((1− p)f0(Xi))) .

Having determined p∗, f∗0 , f∗1 (recall that the solution is not necessarily

unique), the maximum likelihood rule is

gn(x) =

0 if (1− p∗)f∗0 (x) ≥ p∗f∗1 (x)1 otherwise.

Generally speaking, the distribution format is more sensitive than the re-gression format. It may work better under the correct circumstances. How-ever, we give up our universality, as the nature of the distribution ofX mustbe known. For example, if X is distributed on a lower dimensional nonlin-ear manifold of Rd, the distribution format is particularly inconvenient.Consistency results and examples are provided in a few exercises.


Problem 15.1. Show that if |F| = k < ∞ and η ∈ F , then L(ηn) → L∗ withprobability one, where ηn is obtained by the maximum likelihood method. Also,prove that convergence with probability one holds in Theorem 15.2.

Problem 15.2. Prove that if η, η′ are [0, 1]-valued regression functions, then thedivergence

D(η′) = E

η(X) log

η′(X)

η(X)+ (1− η(X)) log

1− η′(X)

1− η(X)


is nonpositive, and that D = 0 if and only if η(X) = η′(X) with probability one.In this sense, D measures the distance between η and η′.

Problem 15.3. Let

L∗ = Emin(η(X), 1− η(X)), L(η′) = PIη′(X)>1/2 6= Y

,

and

D(η′) = E

η(X) log

η′(X)

η(X)+ (1− η(X)) log

1− η′(X)

1− η(X)

,

where η(x) = PY = 1|X = x, and η′ : Rd → [0, 1] is arbitrary. Prove that

D(η′) ≤ −E(η(X)− η′(X))2

2

≤ −1

2(L(η′)− L∗)2

(see also Problem 3.22). Hint: First prove that for p, q ∈ [0, 1],

p logq

p+ (1− p) log

1− q

1− p≤ − (p− q)2

2

(use Taylor’s series with remainder term for H(·)).

Problem 15.4. Show that for each η there exists an ε0 > 0 such that for allε ∈ (0, ε0),

PLn(η) > ELn(η)+ ε ≤ e−nε2/16.

Hint: Proceed by Chernoff’s bounding technique.

Problem 15.5. Consider η ∈ F ′1 and let ηn be a maximum likelihood estimate

over F ′1. Let p = PY = 1. Show that

L∗ = pmin(1,

1− c

c

).

Derive an upper bound forE L(ηn)− L∗ ,

where in case of multiple choices for ηn, you take the smallest ηn in the equivalenceclass. This is an important problem, as F ′

1 picks a histogram cell in a data-basedmanner. F ′

1 may be generalized to the automatic selection of the best k-cellhistogram: let F be the collection of all η’s that are constant on the k intervalsdetermined by breakpoints −∞ < a1 < · · · < ak−1 <∞.

Problem 15.6. Show that for the class F2,

EL(ηn) ≤ 4d

n+ 1

when η ∈ F2 and ηn is the maximum likelihood estimate.

Problem 15.7. Show that for the class F4,

N(X, ε) ≤ 4d3/εe

holds for all X and ε > 0, that is, the bracketing ε-entropy of F4 is boundedby d3/εe. Hint: Cover the class F4 by a class of monotone decreasing, piecewiseconstant functions, whose values are multiples of ε/3, and whose breakpoints areat the kε/3-quantiles of the distribution of X (k = 1, 2, . . . , d3/εe).


Problem 15.8. Discuss the maximum likelihood method for the class

F =

η : η(x) =

c if αTx > βc′ otherwise;

c, c′ ∈ [0, 1], α ∈ Rd, β ∈ R.

What do the discrimination rules look like? If η ∈ F , is the rule consistent?Can you guarantee a certain rate of convergence for EL(ηn)? If η /∈ F , canyou prove that L(ηn) does not converge in probability to infη′∈F L(η′) for somedistribution of (X,Y ) with η(x) = PY = 1|X = x? How would you obtain thevalues of c, c′, α, β for the maximum likelihood choice ηn?

Problem 15.9. Let X1, . . . , Xn be i.i.d. uniform [0, 1] random variables. Lety1, . . . , yn be arbitrary 0, 1-valued numbers. Show that with probability one,

lim supθ→∞

∏i:yi=1

sin2(θXi)×∏i:yi=0

cos2(θXi) = 1,

while for any t <∞, with probability one,

sup0≤θ≤t

∏i:yi=1

sin2(θXi)×∏i:yi=0

cos2(θXi) < 1.


16Parametric Classification

What do you do if you believe (or someone tells you) that the conditionaldistributions of X given Y = 0 and Y = 1 are members of a given fam-ily of distributions, described by finitely many real-valued parameters? Ofcourse, it does not make sense to say that there are, say, six parameters.By interleaving the bits of binary expansions, we can always make one pa-rameter out of six, and by splitting binary expressions, we may make acountable number of parameters out of one parameter (by writing the bitsdown in triangular fashion as shown below).

b1 b2 b4 b7 · · ·b3 b5 b8 · · ·b6 b9 · · ·b10 · · ·...

Thus, we must proceed with care. The number of parameters of a familyreally is measured more by the sheer size or vastness of the family than bymere representation of numbers. If the family is relatively small, we will callit parametric but we will not give you a formal definition of “parametric.”For now, we let Θ, the set of all possible values of the parameter θ, be asubset of a finite-dimensional Euclidean space. Formally, let

PΘ = Pθ : θ ∈ Θ,

be a class of probability distributions on the Borel sets ofRd. Typically, thefamily PΘ is parametrized in a smooth way. That is, two distributions, cor-

16. Parametric Classification 280

responding to two parameter vectors close to each other are in some senseclose to each other, as well. Assume that the class-conditional densities f0and f1 exist, and that both belong to the class of densities

FΘ = fθ : θ ∈ Θ.

Discrete examples may be handled similarly. Take for example all gaussiandistributions on Rd, in which

fθ(x) =1√

(2π)d det(Σ)e−

12 (x−m)T Σ−1(x−m),

wherem ∈ Rd is the vector of means, and Σ is the covariance matrix. Recallthat xT denotes the transposition of the column vector x, and det(Σ) is thedeterminant of Σ. This class is conveniently parametrized by θ = (m,Σ),that is, a vector of d+ d(d+ 1)/2 real numbers.

Knowing that the class-conditional distributions are in PΘ makes dis-crimination so much easier—rates of convergence to L∗ are excellent. TakeFΘ as the class of uniform densities on hyperrectangles of Rd: this has2d natural parameters, the coordinates of the lower left and upper rightvertices.

Figure 16.1. The class-conditional densitiesare uniform on hyperrectangles.

Given (X1, Y1), . . . , (Xn, Yn), a child could not do things wrong—for class1, estimate the upper right vertex by(

max1≤i≤n:Yi=1

X(1)i , max

1≤i≤n:Yi=1X

(2)i , . . . , max

1≤i≤n:Yi=1X

(d)i

)and similarly for the upper right vertex of the class 0 density. Lower left ver-tices are estimated by considering minima. If A0, A1 are the two unknownhyperrectangles and p = PY = 1, the Bayes rule is simply

g∗(x) =

1 if x ∈ A1 −A0

0 if x ∈ A0 −A1

1 if x ∈ A1 ∩A0,p

λ(A1)>

1− p

λ(A0)

0 if x ∈ A1 ∩A0,p

λ(A1)≤ 1− p

λ(A0).

16. Parametric Classification 281

In reality, replace A0, A1, and p by the sample estimates A0, A1 (describedabove) and p = (1/n)

∑ni=1 IYi=1. This way of doing things works very

well, and we will pick up the example a bit further on. However, it is abit ad hoc. There are indeed a few main principles that may be used inthe design of classifiers under the additional information given here. In noparticular order, here are a few methodologies:

(A) As the Bayes classifiers belong to the class

C =φ = Ipfθ1>(1−p)fθ0 : p ∈ [0, 1], θ0, θ1 ∈ Θ

,

it suffices to consider classifiers in C. For example, if FΘ is the normalfamily, then C coincides with indicators of functions in the set

x : xTAx+ bTx+ c > 0 : A is a d× d matrix, b ∈ Rd, c ∈ R

,

that is, the family of quadratic decisions. In the hyperrectangularexample above, every φ is of the form IA1−A2 where A1 and A2 arehyperrectangles of Rd. Finding the best classifier of the form φ =IA where A ∈ A is something we can do in a variety of ways: onesuch way, empirical risk minimization, is dealt with in Chapter 12 forexample.

(B) Plug-in rules estimate (θ0, θ1) by (θ0, θ1) and p = PY = 1 by pfrom the data, and form the rule

gn(x) =

1 if pfθ1

(x) > (1− p)fθ0

(x)0 otherwise.

The rule here is within the class C described in the previous para-graph. We are hopeful that the performance with gn is close tothe performance with the Bayes rule g∗ when (p, θ0, θ1) is close to(p, θ0, θ1). For this strategy to work it is absolutely essential thatL(p, θ0, θ1) (the probability of error when p, θ0, θ1 are the parame-ters) be continuous in (p, θ0, θ1). Robustness is a key ingredient. Ifthe continuity can be captured in an inequality, then we may getperformance guarantees for ELn − L∗ in terms of the distance be-tween (p, θ0, θ1) and (p, θ0, θ1). Methods of estimating the parametersinclude maximum likelihood. This methodology is dealt with in Chap-ter 15. This method is rather sensitive to incorrect hypotheses (whatif we were wrong about our assumption that the class-conditionaldistributions were in PΘ?). Another strategy, minimum distance es-timation, picks that member from PΘ that is closest in some sense tothe raw empirical measure that puts mass 1/n at each if the n datapoints. See Section 16.3 This approach does not care about continuityof L(p, θ0, θ1), as it judges members of PΘ by closeness in some spaceunder a metric that is directly related to the probability of error.Robustness will drop out naturally.

16.1 Example: Exponential Families 282

A general approach should not have to know whether Θ can be describedby a finite number of parameters. For example, it should equally well han-dle descriptions as PΘ is the class of all distributions of (X,Y ) in whichη(x) = PY = 1|X = x is monotonically increasing in all the componentsof x. Universal paradigms such as maximum likelihood, minimum distanceestimation, and empirical error minimization are all applicable here. Thisparticular PΘ is dealt with in Chapter 15, just to show you that the de-scription of the class does invalidate the underlying principles.

16.1 Example: Exponential Families

A class PΘ is exponential, if every class-conditional density fθ can be writ-ten in the form

fθ(x) = cα(θ)β(x)e∑k

i=1πi(θ)ψi(x),

where β, ψ1, . . . , ψk : Rd → R, β ≥ 0, α, π1, . . . , πk : Θ → R are fixedfunctions, and c is a normalizing constant. Examples of exponential familiesinclude the gaussian, gamma, beta, Rayleigh, and Maxwell densities (seeProblem 16.4). The Bayes-rule can be rewritten as

g∗(x) =

1 if log(

pfθ∗1(x)

(1−p)fθ∗0(x)

)> 0

0 otherwise.

This is equivalent to

g∗(x) =

1 if

∑ki=1 ψi(x)(πi(θ

∗0)− πi(θ∗1)) < log

(pα(θ∗1 )

(1−p)α(θ∗0 )

)0 otherwise.

The Bayes-rule is a so-called generalized linear rule with 1, ψ1, . . . , ψk asbasis functions. Such rules are easily dealt with by empirical risk minimiza-tion and related methods such as complexity regularization (Chapters 12,17, 18, 22).

Another important point is that g∗ does not involve the function β. Forall we know, β may be some esoteric ill-behaved function that would makeestimating fθ(x) all but impossible if β were unknown. Even if PΘ is thehuge family in which β ≥ 0 is left undetermined, but it is known to beidentical for the two class-conditional densities (and α(θ) is just a normal-ization factor), we would still only have to look at the same small class ofgeneralized linear discrimination rules! So, densities do not matter—ratiosof densities do. Pattern recognition should be, and is, easier than densityestimation. Those who first estimate (f0, f1) by (f0, f1) and then constructrules based on the sign of pf1(x)− (1− p)f0(x) do themselves a disservice.

16.2 Standard Plug-In Rules 283

16.2 Standard Plug-In Rules

In standard plug-in methods, we construct estimates θ0, θ1 and p from thedata and use them to form a classifier

gn(x) =

1 if pfθ1

(x) > (1− p)fθ0

(x)0 otherwise.

It is generally not true that if p→ p, θ0 → θ0, and θ1 → θ1 in probability,then L(gn) → L∗ in probability, where L(gn) is the probability of errorwith gn. Consider the following simple example:

Example. Let FΘ be the class of all uniform densities on [−θ, 0] if θ 6= 1and on [0, θ] if θ = 1. Let θ0 = 1, θ1 = 2, p = 1/2. Then a reasonableestimate of θi would be θi = maxi:Yi=1 |Xi|. Clearly, θi → θi in probability.However, as θi 6= 1 with probability one, we note that gn(x) = 0 for x > 0and thus, even though L∗ = 0 (as the supports of fθ0 and fθ1 are notoverlapping), L(gn) ≥ PY = 1 = p = 1/2. The problem with this is thatthere is no continuity with respect to θ in FΘ. 2

Basic consistency based upon continuity considerations is indeed easy toestablish. As ratios of densities matter, it helps to introduce

ηθ(x) =pfθ1(x)

(1− p)fθ0(x) + pfθ1(x)= PY = 1|X = x,

where θ = (p, θ0, θ1) or (θ0, θ1) as the case may be. We recall that if gn(x) =Iη

θ(x)>1/2 where θ is an estimate of θ, then

L(gn)− L∗ ≤ 2E|ηθ(X)− ηθ(X)|

∣∣∣Dn

,

where Dn is the data sequence. Thus,

E L(gn)− L∗ ≤ 2E|ηθ(X)− ηθ(X)|

.

By the Lebesgue dominated convergence theorem, we have, without furtherado:

Theorem 16.1. If ηθ is continuous in θ in the L1(µ) sense, where µ isthe measure of X, and θ → θ in probability, then EL(gn) → L∗ for thestandard plug-in rule.

In some cases, we can do better and derive rates of convergence by ex-amining the local behavior of ηθ(x). For example, if ηθ(x) = e−θ|x|, x ∈ R,then

|ηθ(x)− ηθ′(x)| ≤ |x||θ − θ′|,


and

EL(gn) − L∗ ≤ 2E|X|

∣∣∣θ − θ∣∣∣

≤ 2√

E X2

√E(

θ − θ)2,

yielding an explicit bound. In general, θ is multivariate, consisting at least ofthe triple (p, θ0, θ1), but the above example shows the way to happy analy-

sis. For the simple example given here, EL(gn) → L∗ if E(

θ − θ)2→

0. In fact, this seems to suggest that for this family, θ should be found to

minimize E(

θ − θ)2

. This is false. One is always best off minimizing

the probability of error. Other criteria may be relevant via continuity, butshould be considered with care.

How certain parameters are estimated for given families of distributionsis what mathematical statistics is all about. The maximum likelihood prin-ciple looms large: θ0 is estimated for densities fθ by

θ0 = arg maxθ

∏i:Yi=0

fθ(Xi)

for example. If you work out this (likelihood) product, you will often dis-cover a simple form for the estimate of the data. In discrimination, only ηmatters, not the class-conditional densities. Maximum likelihood in func-tion of the η’s was studied in Chapter 15. We saw there that this is oftenconsistent, but that maximum likelihood behaves poorly when the truedistribution is not in Pθ. We will work out two simple examples:

As an example of the maximum likelihood method in discrimination, weassume that

FΘ =fθ(x) =

xα−1e−x/β

βαΓ(α)Ix>0 : α, β > 0

is the class of all gamma densities with θ = (α, β). The likelihood productgiven (X1, Y1), . . . , (Xn, Yn), is, if θ0, θ1 are the unknown parameters,

n∏i=1

(pfθ1(Xi))Yi ((1− p)fθ0(Xi))

1−Yi .

This is the probability of observing the data sequence if the fθ’s were infact discrete probabilities. This product is simply(

p

βα11 Γ(α1)

)∑n

i=1Yi(

1− p

βα00 Γ(α0)

)n−∑n

i=1Yi

·e−∑

YiXi

β1−

∑(1−Yi)Xi

β0+(α1−1)

∑Yi logXi+(α0−1)

∑(1−Yi) logXi ,


where θ0 = (α0, β0), θ1 = (α1, β1). The first thing we notice is that thisexpression depends only on certain functions of the data, notably

∑YiXi,∑

(1−Yi)Xi,∑Yi logXi,

∑(1−Yi) logXi, and

∑Yi. These are called the

sufficient statistics for the problem at hand. We may in fact throw awaythe data and just store the sufficient statistics. The likelihood product hasto be maximized. Even in this rather simple univariate example, this is anontrivial task. Luckily, we immediately note that p occurs in the factorpN (1−p)n−N , where N =

∑Yi. This is maximal at p = N/n, a well-known

result. For fixed α0, α1, we can also get β0, β1; but for variable α0, α1, β0, β1,the optimization is difficult. In d-dimensional cases, one has nearly alwaysto resort to specialized algorithms.

As a last example, we return to FΘ = uniform densities on rectangles ofRd. Here the likelihood product once again has pN (1− p)n−N as a factor,leading to p = N/n. The other factor (if (a1, b1), (a0, b0) are the lower left,upper right vertices of the rectangles for θ1, θ0) is zero if for some i, Yi = 1,Xi /∈ rectangle(a1, b1), or Yi = 0, Xi /∈ rectangle(a0, b0). Otherwise, it is

1‖b1 − a1‖N

· 1‖b0 − a0‖n−N

,

where ‖b1 − a1‖ =∏dj=1(b

(j)1 − a

(j)1 ) denotes the volume of the rectangle

(a1, b1), and similarly for ‖b0−a0‖. This is maximal if ‖b1−a1‖ and ‖b0−a0‖are minimal. Thus, the maximum likelihood estimates are

a(k)i = min

j:Yj=iX

(k)i , 1 ≤ k ≤ d, i = 0, 1,

b(k)i = max

j:Yj=iX

(k)i , 1 ≤ k ≤ d, i = 0, 1,

where ai = (a(1)i , . . . , a

(d)i ), bi = (b(1)i , . . . , b

(d)i ), and Xi = (X(1)

i , . . . , X(d)i ).

Rates of convergence may be obtained via some of the (in)equalities ofChapter 6, such as

(1) EL(gn) − L∗ ≤ 2E|ηn(X)− η(X)|

(where ηn = pfθ1/(pfθ1

+ (1− p)fθ0

)),

(2) EL(gn) − L∗ ≤ 2√

E (ηn(X)− η(X))2,

(3) EL(gn) − L∗ = 2E∣∣∣∣η(X)− 1

2

∣∣∣∣ Ign(X) 6=g∗(X)

.

The rate with which θ approaches θ in Θ (measured with some metric) maybe very different from that with which L(gn) approaches L∗. As shown inTheorem 6.5, the inequality (2) is always loose, yet it is this inequality thatis often used to derive rates of convergence by authors and researchers.Let us take a simple example to illustrate this point. Assume FΘ is the

16.3 Minimum Distance Estimates 286

family of normal densities on the real line. If θ0 = (m0, σ0), θ1 = (m1, σ1)are the unknown means and standard deviations of the class-conditionaldensities, and p = PY = 1 is also unknown, then we may estimate p byp =

∑nj=1 Yj/n, and mi and σi by

mi =

∑nj=1XjIYj=i∑nj=1 IYj=i

and σ2i =

∑nj=1(Xj − mi)2IYj=i∑n

j=1 IYj=i, i = 0, 1,

respectively, when denominators are positive. If a denominator is 0, setmi = 0, σ2

i = 1. From Chebyshev’s inequality, we can verify that for fixedp ∈ (0, 1), E|p−p| = O (1/

√n), E|mi−mi| = O (1/

√n), E|σi−σi| =

O (1/√n) (Problem 16.3).

If we compute E |ηn(X)− η(X)|, we will discover that EL(gn) −L∗ = O (1/

√n). However, if we compute E

∣∣η(X)− 12

∣∣ Ign(X) 6=g∗(X),

we will find that EL(gn) − L∗ = O (1/n). Thus, while the parametersconverge at a rate O(1/

√n) dictated by the central limit theorem, and

while ηn converges to η in L1(µ) with the same rate, the error rate indiscrimination is much smaller. See Problems 16.7 to 16.9 for some practicein this respect.

Bibliographic remarks. McLachlan (1992) has a comprehensive treat-ment on parametric classification. Duda and Hart (1973) have many goodintroductory examples and a nice discussion on sufficient statistics, a topicwe do not deal with in this text. For maximum likelihood estimation, seeHjort (1986a; 1986b). 2

16.3 Minimum Distance Estimates

Here we describe a general parameter estimation principle that appears tobe more suitable for plug-in classification rules than the maximum likeli-hood method. The estimated parameter is obtained by the projection ofthe empirical measure on the parametric family.

The principle of minimum distance estimation may be described as fol-lows. Let PΘ = Pθ : θ ∈ Θ be a parametric family of distributions,and assume that Pθ∗ is the unknown distribution of the i.i.d. observationsZ1, . . . , Zn. Denote by νn the empirical measure

νn(A) =1n

n∑i=1

IZi∈A, A ⊂ Rd.

Let D(·, ·) be a metric on the set of all probability distributions on Rd. Theminimum distance estimate of θ∗ is defined as

θn = arg minθ∈Θ

D(νn, Pθ),


if it exists and is unique. If it is not unique, select one candidate for whichthe minimum is attained.

Figure 16.2. The member of PΘ

closest to the empirical measure νn ischosen by the minimum distance es-timate.

Consider for example the Kolmogorov-Smirnov distance

DKS(P,Q) = supz∈Rd

|F (z)−G(z)|,

where F and G are the distribution functions of the measures P and Q onRd, respectively. It is easy to see that DKS is a metric. Note that

supz∈Rd

|F (z)−G(z)| = supA∈A

|P (A)−Q(A)|,

where A is the class of sets of the form (−∞, z(1)) × · · · × (−∞, z(d)), forz = (z(1), . . . , z(d)) ∈ Rd. For the Kolmogorov-Smirnov distance betweenthe estimated and the true distributions, by the triangle inequality, we have

DKS(Pθn, Pθ∗) ≤ DKS(Pθn

, νn) +DKS(νn, Pθ∗)

≤ 2DKS(νn, Pθ∗),

where in the second inequality we used the definition of θn. Now notice thatthe upper bound is just twice the Kolmogorov-Smirnov distance betweenan empirical distribution and the true distribution. By a straightforwardapplication of Theorem 12.5, for every n and ε > 0,

PDKS(νn, Pθ∗) > ε = P

supA∈A

|νn(A)− Pθ∗(A)| > ε

≤ 8

(ned

)de−nε

2/32,

since the n-th shatter coefficient s(A, n) of the class of sets

A =

(−∞, z(1))× · · · × (−∞, z(d)), z = (z(1), . . . , z(d)) ∈ Rd


cannot exceed (ne/d)d. This can easily be seen by an argument similar tothat in the proof of Theorem 13.8. From the inequalities above, we see thatthe Kolmogorov-Smirnov distance between the estimated and the true dis-tributions is alwaysO(√

log n/n). The only condition we require is that θn be well defined.

Of course, rather than the Kolmogorov-Smirnov distance, it is the errorprobability of the plug-in classification rule in which we are primarily inter-ested. In order to make the connection, we adapt the notions of minimumdistance estimation and Kolmogorov-Smirnov distance to better suit theclassification problem we are after. Every parametric family of distribu-tions defines a class of sets in Rd as the collection A of sets of the formx : φ(x) = 1, where the classifiers φ are the possible plug-in rules definedby the parametric family. The idea here is to perform minimum distanceestimation with the generalized Kolmogorov-Smirnov distance with respectto a class of sets closely related to A.

Assume that both class-conditional distributions Pθ0 , Pθ1 belong to aparametric family PΘ, and the class-conditional densities fθ0 , fθ1 exist.Then the Bayes rule may be written as

g∗(x) =

1 if αθ(x) > 00 otherwise,

where αθ(x) = pfθ1(x)−(1−p)fθ0(x) and p = PY = 1. We use the shortnotation θ = (p, θ0, θ1). The function αθ may be thought of as the Radon-Nikodym derivative of the signed measure Qθ = pPθ1 − (1−p)Pθ0 . In otherwords, to each Borel set A ⊂ Rd, Qθ assigns the real number Qθ(A) =pPθ1(A)− (1−p)Pθ0(A). Given the data Dn = ((X1, Y1), . . . , (Xn, Yn)), wedefine the empirical counterpart of Qθ by

νn(A) =1n

n∑i=1

(2Yi − 1)IXi∈A.

The minimum distance classification rule we propose projects the empiricalsigned measure νn on the set of measures Qθ. The metric we use is alsospecifically fitted to the given pattern recognition problem: define the classof sets

A =x ∈ Rd : αθ(x) > 0 : p ∈ [0, 1], θ0, θ1 ∈ Θ

.

A is just the class of sets A ⊂ Rd such that Ix∈A is the Bayes rule forsome θ = (p, θ0, θ1). Also introduce

B = A ∩Bc : A,B ∈ A .

Given two signed measures Q,Q′, we define their generalized Kolmogorov-Smirnov distance by

DB(Q,Q′) = supA∈B

|Q(A)−Q′(A)|,


that is, instead of the class of half-infinite intervals as in the definition ofthe ordinary Kolmogorov-Smirnov distance, here we take the supremumover B, a class tailored to our discrimination problem. Now, we are readyto define our minimum distance estimate

θ = arg minθ

DB(Qθ, νn),

where the minimum is taken over all triples θ = (p, θ0, θ1) with p ∈ [0, 1],θ0, θ1 ∈ Θ. The corresponding classification rule is

gθ(x) =

1 if α

θ(x) > 0

0 otherwise.

The next theorem shows that if the parametric assumption is valid, thengθ

performs extremely well. The theorem shows that if A has finite vcdimension VA, then without any additional conditions imposed on theparametric class, the corresponding error probability L(g

θ) is not large:

L(gθ)− L∗ = O

(√VA log n/n

).

Theorem 16.2. Assume that both conditional densities fθ0 , fθ1 are in theparametric class FΘ. Then for the classification rule defined above, we havefor every n and ε > 0 that

PL(g

θ)− L∗ > ε

≤ 8s(B, n)e−nε

2/512,

where s(B, n) is the n-th shatter coefficient of B. Furthermore,

EL(g

θ)− L∗

≤ 32

√log(8es(B, n))

2n.

Remark. Recalling from Chapter 13 that s(B, n) ≤ s2(A, n) and thats(A, n) ≤ (n + 1)VA , where VA is the vc dimension of A, we obtain thebound

PL(g

θ)− L∗ > ε

≤ 8(n+ 1)2VAe−nε

2/512. 2

The proof of the theorem is based upon two key observations. Thefirst lemma provides a bound on L(g

θ) − L∗ in terms of the generalized

Kolmogorov-Smirnov distance between the estimated and the true param-eters. The second lemma is a straightforward extension of the Vapnik-Chervonenkis inequality to signed measures.

Lemma 16.1.

L(gθ)− L∗ ≤ 2DB(Qθ, Qθ).


Proof. Denote

Aθ = x : αθ(x) > 0 and Aθ

= x : αθ(x) > 0,

that is, gθ(x) = Iα

θ(x)>0, g∗(x) = Iαθ(x)>0, and Aθ, Aθ ∈ A. At the

first crucial step, we use the equality of Theorem 2.2:

L(gθ)− L∗ =

∫Ig

θ(x) 6=gθ(x)|αθ(x)|dx

= −∫A

θ∩Ac

θ

αθ(x)dx+∫Ac

θ

∩Aθ

αθ(x)dx

≤∫A

θ∩Ac

θ

αθ(x)dx−

∫A

θ∩Ac

θ

αθ(x)dx

+∫Ac

θ

∩Aθ

αθ(x)dx−∫Ac

θ

∩Aθ

αθ(x)dx

= Qθ(A

θ∩Acθ)−Qθ(Aθ ∩A

cθ) +Qθ(Ac

θ∩Aθ)−Q

θ(Ac

θ∩Aθ)

≤ 2DB(Qθ, Qθ).

since both Aθ∩Acθ and Ac

θ∩Aθ are members of BΘ. 2

The next lemma is an extension of the Vapnik-Chervonenkis inequality.The proof is left to the reader (Problem 16.11).

Lemma 16.2. For every n and ε > 0,

P DB(Qθ, νn) > ε ≤ 8s(B, n)e−nε2/32.

The rest of the proof of Theorem 16.2 shows that the generalized Kolmo-gorov-Smirnov distance with respect to B between the estimated and truedistributions is small with large probability. This can be done as we provedfor the Kolmogorov-Smirnov distance at the beginning of the section.

Proof of Theorem 16.2. By Lemma 16.1,

L(gθ)− L∗ ≤ 2DB(Qθ, Qθ)

≤ 2DB(Qθ, νn) + 2DB(Qθ, νn) (by the triangle inequality)

≤ 4DB(Qθ, νn) (by the definition of θ).

The theorem now follows by Lemma 16.2. 2

Finally, we examine robustness of the minimum distance rule againstmodeling errors, that is, what happens if the distributions are not in PΘ. A


good rule should still work reasonably well if the distributions are in somesense close to the modeled parametric class PΘ. Observe that if for someθ = (p, θ0, θ1) the Bayes rule can be written as

g∗(x) =

1 if αθ(x) > 00 otherwise,

then Lemma 16.1 remains valid even when the class-conditional distribu-tions are not in PΘ. Denote the true class-conditional distributions byP ∗0 , P

∗1 , let p∗ = PY = 1, and introduce Q∗ = p∗P ∗1 − (1− p)P ∗0 . Thus,

L(gθ)− L∗

≤ 2DB(Q∗, Qθ) (by Lemma 16.1)

≤ 2DB(Q∗, νn) + 2DB(νn, Qθ) (by the triangle inequality)

≤ 2DB(Q∗, νn) + 2DB(νn, Qθ) (by the definition of Qθ)

≤ 4DB(Q∗, νn) + 2DB(Q∗, Qθ) (again by the triangle inequality).

Lemma 16.2 now applies to the first term on the right-hand side. Thus, weconclude that if the Bayes rule is in A, then for all ε ≥ 4 infQθ

DB(Q∗, Qθ),

PL(g

θ)− L∗ > ε

≤ 8(n+ 1)2VAe−nε

2/211.

The constant in the exponent may be improved significantly by more carefulanalysis. In other words, if the Bayes rule is in A and the true distributionis close to the parametric family in the generalized Kolmogorov-Smirnovdistance specified above, then the minimum distance rule still performsclose to the Bayes error. Unfortunately, we cannot say the same if A doesnot contain the Bayes rule. Empirical error minimization, discussed in thenext section, is however very robust in all situations.

16.4 Empirical Error Minimization

In this section we explore the connection between parametric classificationand rule selection by minimizing the empirical error, studied in Chapter12.

Consider the class C of classifiers of the form

φ(x) =

0 if (1− p)fθ0(x) ≥ pfθ1(x)1 otherwise,

where p ∈ [0, 1], and θ0, θ1 ∈ Θ. The parametric assumption means thatthe Bayes rule is contained in C. Then it is a very natural approach tominimize the empirical error probability

Ln(φ) =1n

n∑i=1

Iφ(Xi) 6=Yi,


measured on the training data Dn over classifiers φ in the class C. Denotethe empirically selected rule (i.e., the one minimizing Ln(φ)) by φ∗n. Formost typical parametric classes Θ, the vc dimension VC is finite. Therefore,as a straightforward consequence of Theorem 12.6, we have

Corollary 16.1. If both conditional distributions are contained in theparametric family PΘ, then for the error probability L(φ∗n) = Pφ∗n(X) 6=Y |Dn of the empirically optimal rule φ∗n, we have for every n and ε > 0

P L(φ∗n)− L∗ > ε ≤ 8S(C, n)e−nε2/128.

The result above means that O(√

log n/n)

rate of convergence to theBayes rule is guaranteed for the empirically optimal rule, whenever thevc dimension VC is finite. This is the case, for example, for exponentialfamilies. If PΘ is an exponential family, with densities of the form

fθ(x) = cα(θ)β(x) exp

k∑i=1

πi(θ)ψi(x)

,

then by results from Chapter 13, S(C, n) ≤ (n + 1)k+1. Observe that inthis approach, nothing but properties of the class C are used to derive theO(√

log n/n)

rate of convergence.

Remark. Robustness. The method of empirical minimization is clearlyextremely robust against errors in the parametric model. Obviously, (seeTheorem 12.6) if the true conditional distributions are not contained in theclass PΘ, then L∗ can be replaced in Corollary 16.1 by infφ∈C Pφ(X) 6=Y . If the model is fairly good, then this number should be very close tothe Bayes risk L∗. 2

Remark. Lower bounds. The results of Chapter 14 imply that for somedistributions the error probability of the selected rule is about O(1/

√n)

away from the Bayes risk. In the parametric case, however, since the classof distributions is restricted by the parametric model, this is not necessarilytrue. In some cases, a much faster rate of convergence is possible than theO(√

log n/n) rate guaranteed by Corollary 16.1. See, for example, Problem16.3, where an example is given in which the error rate is O(1/n). 2


Problem 16.1. Show that if both conditional distributions of X, given Y = 0and Y = 1, are gaussian, then the Bayes decision is quadratic, that is, it can be


written as

g∗(x) =

0 if

∑d+d(d+1)/2

i=1aiψi(x) + a0 ≥ 0

1 otherwise,

where the functions ψi(x) are either of the form x(i) (the i-th component of thevector x), or x(i)x(j), and a0, . . . , ad+d(d+1)/2 ∈ R.

Problem 16.2. Let f be the normal density on the real line with mean m andstandard deviation σ2, and we draw an i.i.d. sample X1, . . . , Xn from f , and set

m =1

n

n∑i=1

Xi and σ2 =1

n

n∑i=1

(Xi − m)2.

Show that E |m− m| = O(1/√n)

and E |σ − σ| = O(1/√n)

by usingChebyshev’s inequality. Show that this rate is in fact tight. Prove also that theresult remains true if n is replaced by N , a binomial (n, p) random variable in-

dependent of X1, . . . , Xn, where p ∈ (0, 1). That is, m becomes (1/N)∑N

i=1Xi

if N > 0 and 0 otherwise, and similarly for σ.

Problem 16.3. Assume that p = 1/2, and that both class-conditional densi-ties f0 and f1 are gaussian on R with unit variance, but different means. Weuse the maximum likelihood estimates m0, m1 of the conditional means m0 =EX|Y = 0 andm1 = EX|Y = 1 to obtain the plug-in classifier g

θ. Show that

E(m0 −m0)

2

= O(1/n). Then go on to show that EL(g

θ)−L∗ ≤ O(1/n).

Problem 16.4. Show that the following classes of densities on R constitute ex-ponential families:

(1) gaussian family: e− (x−m)2

2σ2

√2πσ

: m ∈ R, σ > 0

;

(2) gamma family: xα−1e−x/β

βαΓ(α)Ix>0 : α, β > 0

;

(3) beta family:Γ(α+ β)

Γ(α)Γ(β)xα−1(1− x)β−1Ix∈0,1 : α, β > 0

;

(4) Rayleigh family:θλβ1−λe−β

2/(2θ)xλe−θx2/2

∞∑j=0

1

j!Γ(j + λ)

(x

2

)2j+λ−1

: θ, β > 0, λ ≥ 0

.

Problem 16.5. This exercise shows that one-parameter classes may be incredi-bly rich. Let C be the class of rules of the form

gθ(x) =

1 if x ∈ A+ θ0 if x /∈ A+ θ,


where x ∈ R, θ ∈ R is a parameter, and A is a union of intervals on the real line.Equivalently, gθ = IA+θ. Let

A =⋃i:bi=1

[i− 1

2, i+

1

2

),

where b1, b2, . . . are bits, obtained as follows: first list all sequences of length1, then those of length 2, et cetera, so that (b1, b2, . . .) is a concatenation of(1, 0), (1, 1, 1, 0, 0, 1, 0, 0), et cetera.

(1) Show that for all n, there exists a set x1, . . . , xn ⊂ R that can beshattered by a set from A+θ : θ ∈ R. Conclude that the vc dimensionof C is infinite.

(2) If we use C for empirical error minimization and L∗ = 0, what can yousay about ELn, the probability of error of the selected rule?

Problem 16.6. Continuation. Let X be uniform on θ+ [i− 1/2, i+ 1/2) withprobability 1/2i, i ≥ 1. Set Y = bi if X ∈ θ+ [i− 1/2, i+ 1/2), so that L∗ = 0.

(1) Derive the class of Bayes rules.(2) Work out the details of the generalized Kolmogorov-Smirnov distance

minimizer based on the class of (1).(3) Provide the best upper bound on ELn you can get with any method.(4) Regardless of discrimination, how would you estimate θ? Derive the

asymptotic behavior of ELn for the plug-in rule based on your es-timate of θ.

Problem 16.7. If FΘ = all uniform densities on rectangles of Rd and if weuse the maximum likelihood estimates of p, θ0, θ1 derived in the text, show thatELn − L∗ = O(1/n).

Problem 16.8. Continued. Assume that the true class-conditional densitiesare f0, f1, and f0, f1 /∈ Fθ. With the same maximum likelihood estimates givenabove, find f0, f1 for which ELn → 1, yet L∗ = 0.

Problem 16.9. Continued. Show that the O(1/n) rate above can, in general,not be improved.

Problem 16.10. Show that

PDKS(νn, Pθ∗) > ε ≤ 4nde−2(n−d)ε2

by noting that the supremum in the definition of DKS may be replaced by amaximum over nd carefully selected points of Rd. Hint: Use the idea of fingeringfrom Chapters 4 and 12.

Problem 16.11. Prove Lemma 16.2 by following the line of the proof of theVapnik-Chervonenkis inequality (Theorem 12.5). Hint: In the second symmetriza-tion step observe that

supA∈B

∣∣∣∣∣ 1nn∑i=1

σi(2Yi − 1)IXi∈A

∣∣∣∣∣


has the same distribution as that of

supA∈B

∣∣∣∣∣ 1nn∑i=1

σiIXi∈A

∣∣∣∣∣ .Problem 16.12. Minimum distance density estimation. Let FΘ = fθ : θ ∈ Θbe a parametric class of densities on Rd, and assume that the i.i.d. sampleX1, . . . , Xn is drawn from the density fθ ∈ FΘ. Define the class of sets

A =x ∈ Rd : fθ1(x) > fθ2(x) : θ1, θ2 ∈ Θ

,

and define the minimum distance estimate of θ by

θ = arg minθ′∈Θ

DA(Pθ′ , µn),

where Pθ′ is the distribution belonging to the density fθ′ , µn is the empirical dis-tribution defined by X1, . . . , Xn, and DA is the generalized Kolmogorov-Smirnovdistance defined by

DA(P,Q) = supA∈A

|P (A)−Q(A)|.

Prove that if A has finite vc dimension, then

E

∫|fθ(x)− fθ(x)|dx

= O

(1/√n).

For more information on minimum distance density estimation, see Yatracos(1985) and Devroye (1987). Hint: Follow the steps listed below:

(1)∫|fθ− fθ| ≤ 2DA(P

θ, Pθ) (use Scheffe’s theorem).

(2) DA(Pθ, Pθ) ≤ 2DA(Pθ, µn) (proceed as in the text).

(3) Finish the proof by applying Alexander’s improvement of the Vapnik-Chervonenkis inequality (Theorem 12.10).


17Generalized Linear Discrimination

The classifiers we study here have their roots in the Fourier series estimateor other series estimates of an unknown density, potential function methods(see Chapter 10), and generalized linear classifiers. All these estimators canbe put into the following form: classify x as belonging to class 0 if

k∑j=1

an,jψj(x) ≤ 0,

where the ψj ’s are fixed functions, forming a base for the series estimate,an,j is a fixed function of the training data, and k controls the amount ofsmoothing. When the ψj ’s are the usual trigonometric basis, then this leadsto the Fourier seriesclassifier studied by Greblicki and Pawlak (1981; 1982). When the ψj ’sform an orthonormal system based upon Hermite polynomials, we ob-tain the classifiers studied by Greblicki (1981), and Greblicki and Pawlak(1983; 1985). When ψj(x) is the collection of all products of componentsof x (such as 1, (x(i))k, (x(i))k(x(j))l, etcetera), we obtain the polynomialmethod of Specht (1971).

We start with classification based on Fourier series expansion, whichhas its origins in Fourier series density estimation, which, in turn, wasdeveloped by Cencov (1962), Schwartz (1967), Kronmal and Tarter (1968),Tarter and Kronmal (1970), and Specht (1971). Its use in classification wasconsidered by Greblicki (1981), Specht (1971), Greblicki and Pawlak (1981;1982; 1983), and others.

17.1 Fourier Series Classification 297

17.1 Fourier Series Classification

Let f be the density of X, and assume that f ∈ L2(λ), that is, f ≥ 0,∫f(x)dx = 1 and

∫f2(x)dx <∞, and recall that λ denotes the Lebesgue

measure on Rd. Let ψ1, ψ2, . . . be a complete orthonormal set of boundedfunctions in L2(λ) with supj,x |ψj(x)| ≤ B < ∞. An orthonormal set offunctions ψj in L2(µ) is such that

∫ψjψi = Ii=j for all i, j. It is com-

plete if∫αψi = 0 for all i for some function α ∈ L2(λ) implies that

α ≡ 0 almost everywhere (with respect to λ). If f ∈ L2(λ), then the class-conditional densities f0 and f1 exist, and are in L2(λ). Then the function

α(x) = pf1(x)− (1− p)f0(x) = (2η(x)− 1)f(x)

is in L2(λ) as well. Recall that the Bayes decision is given by

g∗(x) =

0 if α(x) ≤ 01 otherwise.

Let the bounded functions ψ1, ψ2, . . . form a complete orthonormal basis,and let the Fourier representation of α be given by

α =∞∑j=1

αjψj .

The above convergence is understood in L2(λ), that is, the infinite summeans that

limk→∞

∫ α(x)−k∑j=1

αjψj(x)

2

dx = 0.

The coefficients are given by

αj =∫ψj(x)α(x)dx.

We approximate α(x) by a truncation of its Fourier representation tofinitely many terms, and use the data Dn to estimate the coefficients ap-pearing in this sum. Formally, consider the classification rule

gn(x) =

0 if∑kn

j=1 αn,jψj(x) ≤ 01 otherwise,

(17.1)

where the estimates αn,j of αj are very easy to compute:

αn,j =1n

n∑i=1

(2Yi − 1)ψj(Xi).


Before discussing the properties of these rules, we list a few examples ofcomplete orthonormal systems on the real line. Some of these systems con-tain functions on a bounded interval. These, of course, can only be used ifthe distribution of X is concentrated on an interval. The completeness ofthese systems may typically be checked via the Stone-Weierstrass theorem(Theorem A.9). A general way of producing complete orthonormal systemson Rd is taking products of one-dimensional functions of the d components,as sketched in Problem 17.1. For more information on orthogonal series,we refer to Sansone (1969), Szego (1959), and Zygmund (1959).

(1) The standard trigonometric basis on the interval [−π, π] is formed bythe functions

ψ0(x) =1√2π, ψ2i−1(x) =

cos(ix)√π

, ψ2i(x) =sin(ix)√

π, i = 1, 2, . . . .

This is a complete orthonormal system in L2([−π, π]).

(2) The Legendre polynomials form a complete orthonormal basis overthe interval [−1, 1]:

ψi(x) =

√2i+ 1

21

2ii!di

dxi

((x2 − 1

)i), i = 0, 1, 2, . . . .

(3) The set of Hermite functions is complete and orthonormal over thewhole real line:

ψi(x) =ex

2/2√2ii!

√π

di

dxi

(e−x

2), i = 0, 1, 2, . . . .

(4) Functions of the Laguerre basis are defined on [0,∞) by

ψi(x) =(

Γ(i+ 1)Γ(i+ α+ 1)

x−αex)1/2 1

i!di

dxi(xi+αe−x

), i = 0, 1, 2, . . . ,

where α > −1 is a parameter of the system. A complete orthonormalsystem over the whole real line may be obtained by

ψ′2i(x) = ψi(x)Ix≥0, ψ′2i+1(x) = ψi(−x)Ix<0, i = 0, 1, 2, . . . .

(5) The Haar basis contains functions on [0, 1] that take three differentvalues. It is orthonormal and complete. Functions with finitely manyvalues have computational advantages in pattern recognition. Writethe integer i ≥ 0 as i = 2k + j, where k = blog2 ic (i.e., 0 ≤ j < 2k).Then

ψi(x) =

2k/2 if x ∈((j − 1)/2k, (j − 1/2)/2k

)2−k/2 if x ∈

((j − 1/2)/2k, j/2k

)0 otherwise.


(6) Functions on [0, 1] of the Rademacher basis take only two values, −1and 1. It is an orthonormal system, but it is not complete:

ψ0(x) ≡ 1, ψi(x) = (−1)b2ixc, i = 1, 2, . . . .

The basis may be completed as follows: write i =∑kj=1 2ij such that

0 ≤ i1 < i2 < · · · < ik. This form is unique. Define

ψ′(x) = ψi1(x) · · ·ψik(x),

where the ψj ’s are the Rademacher functions. The resulting basisψ′0, ψ

′1, . . . is orthonormal and complete, and is called the Walsh basis.

There is no system of basis functions that is better than another for alldistributions. In selecting the basis of a Fourier series rule, the designermust use a mixture of intuition, error estimation, and computational con-cerns. We have the following consistency theorem:

Theorem 17.1. Let ψ0, ψ1, . . . be a complete orthonormal basis on Rd

such that for some B <∞, |ψi(x)| ≤ B for each i and x. Assume that theclass-conditional densities f0 and f1 exist and are in L2(λ). If

kn →∞ andknn→ 0 as n→∞,

then the Fourier series classification rule defined in (17.1) is consistent:

limn→∞

EL(gn) = L∗.

If

kn →∞ andkn log n

n→ 0 as n→∞,

thenlimn→∞

L(gn) = L∗

with probability one, that is, the rule is strongly consistent.

Proof. First observe that the αn,j ’s are unbiased estimates of the αj ’s:

Eαn,j = E (2Y − 1)ψj(X) = E E (2Y − 1)ψj(X)|X= E ψj(X)E (2Y − 1)|X = E (2η(X)− 1)ψj(X)

=∫

(2η(x)− 1)ψj(x)f(x)dx =∫ψj(x)α(x)dx

= αj ,


and that their variance can be bounded from above as follows:

Varαn,j =Varψj(X1)(2η(X1)− 1)

n

=

∫ψ2j (x)(2η(x)− 1)2f(x)dx− α2

j

n

≤B2 − α2

j

n≤ B2

n,

where we used the boundedness of the ψj ’s. By Parseval’s identity,

∫α2(x)dx =

∞∑j=1

α2j .

Therefore, exploiting orthonormality of the ψj ’s, we have

∫ α(x)−kn∑j=1

αn,jψj(x)

2

dx

=∫α2(x)dx+

∫ kn∑j=1

αn,jψj(x)

2

dx

− 2∫α(x)

kn∑j=1

αn,jψj(x)dx

=∞∑j=1

α2j +

kn∑j=1

α2n,j − 2

kn∑j=1

αjαn,j

=kn∑j=1

(αn,j − αj)2 +

∞∑j=kn+1

α2j .

Thus, the expected L2-error is bounded as follows:

E

∫ α(x)−

kn∑j=1

αn,jψj(x)

2

dx

= E

kn∑j=1

(αn,j − αj)2

+∞∑

j=kn+1

α2j


=kn∑j=1

Varαn,j+∞∑

j=kn+1

α2j

≤ knB2

n+

∞∑j=kn+1

α2j .

Since∑∞j=1 α

2j < ∞, the second term tends to zero if kn → ∞. If at the

same time kn/n→ 0, then the expected L2-error converges to zero, that is,the estimate is consistent in L2. Now, convergence of the error probabilityfollows from Problem 2.11.

To prove strong consistency (i.e., convergence with probability one), fixε > 0, and assume that n is so large that

∞∑j=kn+1

α2j < ε/2.

Then

P

∫ α(x)−

kn∑j=1

αn,jψj(x)

2

dx > ε

= P

kn∑j=1

(αn,j − αj)2 +

∞∑j=kn+1

α2j > ε

≤ P

kn∑j=1

(αn,j − αj)2> ε/2

≤

kn∑j=1

P

(αn,j − αj)2> ε/(2kn)

(by the union bound)

=kn∑j=1

P

∣∣∣∣∣ 1nn∑i=1

(ψj(Xi)(2Yi − 1)−E ψj(X)(2Yi − 1)

)∣∣∣∣∣ >√

ε

2kn

≤ 2kne−nε/(4knB2),

where we used Hoeffding’s inequality (Theorem 8.1). Because kn log n/n→0, the upper bound is eventually smaller than e−2 logn = n−2, which issummable. The Borel-Cantelli lemma yields strong L2 consistency. Strongconsistency of the classifier then follows from Problem 2.11. 2

17.2 Generalized Linear Classification 302

Remark. It is clear from the inequality

E

∫ α(x)−

kn∑j=1

αn,jψj(x)

2

dx

≤ knB2

n+

∞∑j=kn+1

α2j ,

that the rate of convergence is determined by the choice of kn. If kn is small,then the first term, which corresponds to the estimation error, is small, butthe approximation error, expressed by the second term, is larger. For largekn, the situation is reversed. The optimal choice of kn depends on how fastthe second term goes to zero as kn grows. This depends on other propertiesof f , such as the size of its tail and its smoothness. 2

Remark. Consistency in Theorem 17.1 is not universal, since we neededto assume the existence of square integrable conditional densities. This,however, is not a restrictive assumption in some practical situations. Forexample, if the observations are corrupted by additive gaussian noise, thenthe conditions of the theorem hold (Problem 17.2). However, if µ does nothave a density, the method may be inconsistent (see Problem 17.4). 2

Fourier series classifiers have two rather unattractive features in general:

(i) They are not invariant under translations of the coordinate space.

(ii) Most series classifiers are not local in nature—points at arbitrary dis-tances from x affect the decision at x. In kernel and nearest neighborrules, the locality is easily controlled.

17.2 Generalized Linear Classification

When X is purely atomic or singular continuous, Theorem 17.1 is not ap-plicable. A theme of this book is that pattern recognition can be developedin a distribution-free manner since, after all, the distribution of (X,Y ) isnot known. Besides, even if we had an i.i.d. sample (X1, Y1), . . . , (Xn, Yn)at our disposal, we do not know of any test for verifying whether X has asquare integrable density. So, we proceed a bit differently to develop univer-sally consistent rules. To generalize Fourier series classifiers, let ψ1, ψ2, . . .be bounded functions on Rd. These functions do not necessarily form anorthonormal basis of L2. Consider the classifier

gn(x) =

0 if∑kn

j=1 αn,jψj(x) ≤ 01 otherwise,

where the coefficients αn,j are some functions of the data Dn. This may beviewed as a generalization of Fourier series rules, where the coefficients were


unbiased estimates of the Fourier coefficients of α(x) = (2η(x) − 1)f(x).Here we will consider some other choices of the coefficients. Observe thatgn is a generalized linear classifier, as defined in Chapter 13. An intuitivelyappealing way to determine the coefficients is to pick them to minimize theempirical error committed on Dn, as in empirical risk minimization. Thecritical parameter here is kn, the number of basis functions used in thelinear combination. If we keep kn fixed, then as we saw in Chapter 13, theerror probability of the selected rule converges quickly to that of the bestclassifier of the above form. However, for some distributions, this infimummay be far from the Bayes risk, so it is useful to let kn grow as n becomeslarger. However, choosing kn too large may result in overfitting the data.Using the terminology introduced in Chapter 12, let C(kn) be the class ofclassifiers of the form

φ(x) =

0 if∑kn

j=1 ajψj(x) ≤ 01 otherwise,

where the aj ’s range through R. Choose the coefficients to minimize theempirical error

Ln(φ) =1n

n∑i=1

Iφ(Xi) 6=Yi.

Let gn = φ∗n be the corresponding classifier. We recall from Chapter 13that the vc dimension of C(kn) is not more than kn. Therefore, by Theorem13.12, for every n > 2kn + 1,

PL(gn)− inf

φ∈C(kn)L(φ) > ε

≤ 8enH( kn

n )e−nε2/128,

whereH is the binary entropy function. The right-hand side is e−nε2/128+o(n)

if kn = o(n). However, to obtain consistency, we need to know how closeinfφ∈C(kn) L(φ) is to L∗. This obviously depends on the choice of the ψj ’s,as well as on the distribution. If kn is not allowed to grow with n, and isbounded by a number K, then universal consistency eludes us, as for somedistribution infφ∈C(K) L(φ)− L∗ > 0. It follows from Theorem 2.2 that forevery B > 0

infφ∈C(kn)

L(φ)− L∗

≤ infa1,...,akn

∫‖x‖≤B

∣∣∣∣∣∣(2η(x)− 1)−kn∑j=1

ajψj(x)

∣∣∣∣∣∣µ(dx) +∫‖x‖>B

µ(dx).

The limit supremum of the right-hand side can be arbitrarily close to zerofor all distributions if kn →∞ as n→∞ and the set of functions

∞⋃k=1

k∑j=1

ajψj(x); a1, a2, . . . ∈ R


is dense in L1(µ) on balls of the form x : ‖x‖ ≤ B for all µ. This meansthat given any probability measure µ, and function f with

∫|f |dµ < ∞,

for every ε > 0, there exists an integer k and coefficients a1, . . . , ak suchthat ∫ ∣∣∣∣∣∣f(x)−

k∑j=1

ajψj(x)

∣∣∣∣∣∣µ(dx) < ε.

Thus, we have obtained the following consistency result:

Theorem 17.2. Let ψ1, ψ2, . . . be a sequence of functions such that the setof all finite linear combinations of the ψj’s is dense in L1(µ) on balls ofthe form x : ‖x‖ ≤ B for any probability measure µ. Then then gn isstrongly universally consistent when

kn →∞ andknn→ 0 as n→∞.

Remark. To see that the statement of Theorem 17.2 is not vacuous, notethat denseness in L1(µ) on balls follows from denseness with respect tothe supremum norm on balls. Denseness in the L∞ sense may be checkedby the Stone-Weierstrass theorem (Theorem A.9). For example, the classof all polynomial classifiers satisfies the conditions of the theorem. Thisclass can be obtained by choosing the functions ψ1, ψ2, . . . as the simplepolynomials x(1), . . . , x(d), x(1)x(2), . . .. Similarly, the functions ψj can bechosen as bases of a trigonometric series. 2

Remark. The histogram rule. It is clear that Theorem 17.2 can bemodified in the following way: let ψj,k, j = 1, . . . , k; k = 1, 2, . . ., be func-tions such that for every f ∈ L1(µ) (with µ concentrated on a boundedball) and ε > 0 there is a k0 such that for all k > k0 there is a functionof the form

∑kj=1 ajψj,k whose distance from f in L1(µ) is smaller than ε.

Let C(kn) be the class of generalized linear classifiers based on the functionsψ1,kn , . . . , ψkn,kn . If kn →∞ and kn/n→ 0, then the classifier gn that min-imizes the empirical error over C(kn) is strongly universally consistent. Thismodification has an interesting implication. Consider for example functionsψ1,kn

, . . . , ψkn,knthat are indicators of sets of a partition of Rd. Then it is

easy to see that the classifier that minimizes the empirical error is just thehistogram classifier based on the same partition. The denseness assump-tion requires that the partition becomes infinitesimally fine as n → ∞. Infact, we have obtained an alternative proof for the strong universal consis-tency of the regular histogram rule (Theorem 9.4). The details are left asan exercise (Problem 17.3). 2



Problem 17.1. Let ψ1, ψ2, . . . be a sequence of real-valued functions on a bounded

interval [a, b] such that∫ baψi(x)ψj(x)dx = Ii=j, and the set of finite linear com-

binations∑k

i=1aiψi(x) is dense in the class of continuous functions on [a, b] with

respect to the supremum norm. Define the functions Ψi1,...,id : [a, b]d →R by

Ψi1,...,id(x) = ψi1(x(1)) · · ·ψid(x(d)), i1, . . . , id ∈ 1, 2, . . ..

Show that these functions form a complete orthonormal set of functions on [a, b]d.

Problem 17.2. Let Z be an arbitrary random variable on R, and V be a realrandom variable, independent of Z, that has a density h ∈ L2(λ). Show that thedensity f of the random variable X = Z + V exists, and

∫f2(x)dx <∞. Hint:

Use Jensen’s inequality to prove that∫f2(x)dx ≤

∫h2(x)dx.

Problem 17.3. Derive the strong universal consistency of the regular histogramrule from Theorem 17.2, as indicated in the remark following it.

Problem 17.4. Let ψ0, ψ1, ψ2, . . . be the standard trigonometric basis, andconsider the classification rule defined in (17.1). Show that the rule is not consis-tent for the following distribution: given Y = 1, letX be 0, and given Y = 0, letXbe uniformly distributed on [−π, π]. Assume furthermore that PY = 1 = 1/2.

Problem 17.5. The Haar basis is not bounded. Determine whether or not theLaguerre, Hermite, and Legendre bases are bounded.


18Complexity Regularization

This chapter offers key theoretical results that confirm the existence of cer-tain “good” rules. Although the proofs are constructive—we do tell youhow you may design such rules—the computational requirements are oftenprohibitive. Many of these rules are thus not likely to filter down to thesoftware packages and pattern recognition implementations. An attempt atreducing the computational complexity somewhat is described in the sec-tion entitled “Simple empirical covering.” Nevertheless, we feel that muchmore serious work on discovering practical algorithms for empirical riskminimization is sorely needed.

In Chapter 12, the empirical error probability was minimized over a classC of decision rules φ : Rd → 0, 1. We saw that if the vc dimension VC ofthe class is finite, then the error probability of the selected rule is withinconstant times

√VC log n/n of that of the best rule in C. Unfortunately,

classes with finite vc dimension are nearly always too small, and thus theerror probability of the best rule in C is typically far from the Bayes risk L∗

for some distribution (see Theorem 18.4). In Chapter 17 we investigatedthe special classes of generalized linear rules. Theorem 17.2, for example,shows that if we increase the size of the class in a controlled fashion as thesample size n increases, the error probability of the selected rule approachesL∗ for any distribution. Thus, VC increases with n!

Theorem 17.2 may be generalized straightforwardly for other types ofclassifiers. Consider a sequence of classes C(1), C(2), . . ., containing classifiersof the form φ : Rd → 0, 1. The training dataDn = ((X1, Y1), . . . , (Xn, Yn))are used to select a classifier φ∗n by minimizing the empirical error proba-

18.1 Structural Risk Minimization 307

bility Ln(φ) over the class C(kn), where the integer kn depends on n in aspecified way. The following generalization is based on the proof of Theorem17.2 (see Problem 18.1):

Theorem 18.1. Assume that C(1), C(2), . . . is a sequence of classes of de-cision rules such that for any distribution of (X,Y ),

limi→∞

infφ∈C(i)

L(φ) = L∗,

and that the vc dimensions VC(1) , VC(2) , . . . are all finite. If

kn →∞ andVC(kn) log n

n→ 0 as n→∞,

then the classifier φ∗n that minimizes the empirical error over the class C(kn)

is strongly universally consistent.

Theorem 18.1 is the missing link—we are now ready to apply the richtheory of Vapnik-Chervonenkis classes in the construction of universallyconsistent rules. The theorem does not help us, however, with the choice ofthe classes C(i), or the choice of the sequence kn. Examples for sequencesof classes satisfying the condition of Theorem 18.1 include generalized lin-ear decisions with properly chosen basis functions (Chapter 17), or neuralnetworks (Chapter 30).

A word of warning here. The universally consistent rules obtained viaTheorem 18.1 may come at a tremendous computational price. As we willsee further on, we will often have exponential complexities in n instead ofpolynomial time that would be obtained if we just minimized the empiricalrisk over a fixed vc class. The computational complexity of these rules areoften incomparably larger than that of some simple universally consistentrules as k-nearest neighbor, kernel, or partitioning rules.

18.1 Structural Risk Minimization

Let C(1), C(2), . . . be a sequence of classes of classifiers, from which we wishto select a sequence of classifiers with the help of the training data Dn.Previously, we picked φ∗n from the kn-th class C(kn), where the integer knis some prespecified function of the sample size n only. The integer knbasically determines the complexity of the class from which the decisionrule is selected. Theorem 18.1 shows that under mild conditions on thesequence of classes, it is possible to find sequences kn such that the ruleis universally consistent. Typically, kn should grow with n in order to assureconvergence of the approximation error infφ∈C(kn) L(φ)−L∗, but it cannotgrow too rapidly, for otherwise the estimation error L(φ∗n)− infφ∈C(kn) L(φ)might fail to converge to zero. Ideally, to get best performance, the two


types of error should be about the same order of magnitude. Clearly, aprespecified choice of the complexity kn cannot balance the two sides ofthe trade-off for all distributions. Therefore, it is important to find methodssuch that the classifier is selected from a class whose index is automaticallydetermined by the data Dn. This section deals with such methods.

The most obvious method would be based on selecting a candidate de-cision rule φn,j from every class C(j) (for example, by minimizing the em-pirical error over C(j)), and then minimizing the empirical error over theserules. However, typically, the vc dimension of

C∗ =∞⋃j=1

C(j)

equals infinity, which, in view of results in Chapter 14, implies that thisapproach will not work. In fact, in order to guarantee

infj≥1

infφ∈C(j)

L(φ) = L∗,

it is necessary thatVC∗ = ∞

(see Theorems 18.4 and 18.5).A possible solution for the problem is a method proposed by Vapnik and

Chervonenkis (1974c) and Vapnik (1982), called structural risk minimiza-tion. First we select a classifier φn,j from every class C(j) which minimizesthe empirical error over the class. Then we know from Theorem 12.5 thatfor every j, with very large probability,

L(φn,j) ≤ Ln(φn,j) +O

(√VC(j) log n

n

).

Now, pick a classifier that minimizes the upper bound over j ≥ 1. To makethings more precise, for every n and j, we introduce the complexity penalty

r(j, n) =

√32nVC(j) log(en).

Let φn,1, φn,2, . . . be classifiers minimizing the empirical error Ln(φ) overthe classes C(1), C(2), . . ., respectively. For φ ∈ C(j), define the complexity-penalized error estimate

Ln(φ) = Ln(φ) + r(j, n).

Finally, select the classifier φ∗n minimizing the complexity penalized errorestimate Ln(φn,j) over j ≥ 1. The next theorem states that this methodavoids overfitting the data. The only condition is that each class in thesequence has finite vc dimension.


Theorem 18.2. Let C(1), C(2), . . . be a sequence of classes of classifierssuch that for any distribution of (X,Y ),

limj→∞

infφ∈C(j)

L(φ) = L∗.

Assume also that the vc dimensions VC(1) , VC(2) , . . . are finite and satisfy

∆ =∞∑j=1

e−VC(j) <∞.

Then the classification rule φ∗n based on structural risk minimization, asdefined above, is strongly universally consistent.

Remark. Note that the condition on the vc dimensions is satisfied if weinsist that VC(j+1) > VC(j) for all j, a natural assumption. See also Problem18.3. 2

Instead of minimizing the empirical error Ln(φ) over the set of candidatesC∗, the method of structural risk minimization minimizes Ln(φ), the sum of

the empirical error, and a term√

32n VC(j) log(en), which increases as the vc

dimension of the class C(j) containing φ increases. Since classes with largervc dimension can be considered as more complex than those with smallervc dimension, the term added to the empirical error may be considered asa penalty for complexity. The idea of minimizing the sum of the empiricalerror and a term penalizing the complexity has been investigated in variousstatistical problems by, for example, Rissanen (1983), Akaike (1974), Bar-ron (1985; 1991), Barron and Cover (1991), and Lugosi and Zeger (1994).Barron (1991) minimizes the penalized empirical risk over a suitably cho-sen countably infinite list of candidates. This approach is close in spirit tothe skeleton estimates discussed in Chapter 28. He makes the connectionbetween structural risk minimization and the minimum description lengthprinciple, and obtains results similar to those discussed in this section. Thetheorems presented here were essentially developed in Lugosi and Zeger(1994).

Proof of Theorem 18.2. We show that both terms on the right-handside of the following decomposition converge to zero with probability one:


j≥1Ln(φn,j)

)+(

infj≥1

Ln(φn,j)− L∗).


For the first term we have

PL(φ∗n)− inf

j≥1Ln(φn,j) > ε

= P

L(φ∗n)− Ln(φ∗n) > ε

≤ P

supj≥1

(L(φn,j)− Ln(φn,j)

)> ε

= P

supj≥1

(L(φn,j)− Ln(φn,j)− r(j, n)

)> ε

≤∞∑j=1

P∣∣∣L(φn,j)− Ln(φn,j)

∣∣∣ > ε+ r(j, n)

≤∞∑j=1

8nVC(j) e−n(ε+r(j,n))2/32 (by Theorem 12.5)

≤∞∑j=1

8nVC(j) e−nr2(j,n)/32e−nε

2/32

= 8e−nε2/32

∞∑j=1

e−VC(j) = ∆e−nε2/32

using the defining expression for r(j, n), where ∆ = 8∑∞j=1 e

−VC(j) < ∞,by assumption. Thus, it follows from the Borel-Cantelli lemma that

L(φ∗n)− infj≥1

Ln(φn,j) → 0

with probability one as n → ∞. Now, we can turn to investigating thesecond term infj≥1 Ln(φn,j) − L∗. By our assumptions, for every ε > 0,there exists an integer k such that

infφ∈C(k)

L(φ)− L∗ ≤ ε.

Fix such a k. Thus, it suffices to prove that

lim supn→∞

infj≥1

Ln(φn,j)− infφ∈C(k)

L(φ) ≤ 0 with probability one.

Clearly, for any fixed k, if n is large enough, then

r(k, n) =

√32nVC(k) log(en) ≤ ε

2.


Thus, for such large n,

P

infj≥1


L(φ) > ε

≤ P

Ln(φn,k)− inf

φ∈C(k)L(φ) > ε

= P

Ln(φn,k) + r(k, n)− inf

φ∈C(k)L(φ) > ε

≤ P

Ln(φn,k)− inf

φ∈C(k)L(φ) >

ε

2

≤ P

supφ∈C(k)

∣∣∣Ln(φ)− L(φ)∣∣∣ > ε

2

≤ 8nVC(k) e−nε

2/128

by Theorem 12.5. Therefore, since VC(k) <∞, the proof is completed. 2

Theorem 18.2 shows that the method of structural risk minimization isuniversally consistent under very mild conditions on the sequence of classesC(1), C(2), . . .. This property, however, is shared with the minimizer of theempirical error over the class C(kn), where kn is a properly chosen functionof the sample size n (Theorem 18.1). The real strength, then, of structuralrisk minimization is seen from the next result.

Theorem 18.3. Let C(1), C(2), . . . be a sequence of classes of classifierssuch that the vc dimensions VC(1) , VC(2) , . . . are all finite. Assume furtherthat the Bayes rule

g∗ ∈ C∗ =∞⋃j=1

C(j),

that is, a Bayes rule is contained in one of the classes. Let k be the smallestinteger such that g∗ ∈ C(k). Then for every n and ε > 0 satisfying

VC(k) log(en) ≤ nε2

512,

the error probability of the classification rule based on structural risk min-imization φ∗n satisfies

P L(φ∗n)− L∗ > ε ≤ ∆e−nε2/128 + 8nVC(k) e−nε

2/512,

where ∆ =∑∞j=1 e

−VC(j) .

Proof. Theorem 18.3 follows by examining the proof of Theorem 18.2.The only difference is that by assumption, there is an integer k such that


infφ∈C(k) L(φ) = L∗. Therefore, for this k,


j≥1Ln(φn,j)

)+(

infj≥1


L(φ)),

and the two terms on the right-hand side may be bounded as in the proofof Theorem 18.2. 2

Theorem 18.3 implies that if g∗ ∈ C∗, there is a universal constant c0and a finite k such that

EL(φ∗n)− L∗ ≤ c0

√VC(k) log n

n,

that is, the rate of convergence is always of the order of√

log n/n, and theconstant factor VC(k) depends on the distribution. The number VC(k) maybe viewed as the inherent complexity of the Bayes rule for the distribution.The intuition is that the simplest rules are contained in C(1), and moreand more complex rules are added to the class as the index of the classincreases. The size of the error is about the same as if we had known kbeforehand, and minimized the empirical error over C(k). In view of the re-sults of Chapter 14 it is clear that the classifier described in Theorem 18.1does not share this property, since if L∗ > 0, then the error probability ofthe rule selected from C(kn) is larger than a constant times

√VC(kn) log n/n

for some distribution—even if g∗ ∈ C(k) for some fixed k. Just as in designsbased upon minimum description length, automatic model selection, andother complexity regularization methods (Rissanen (1983), Akaike (1974),Barron (1985; 1991), and Barron and Cover (1991)), structural risk mini-mization automatically finds where to look for the optimal classifier. Theconstants appearing in Theorem 18.3 may be improved by using refinedversions of the Vapnik-Chervonenkis inequality.

The condition g∗ ∈ C∗ in Theorem 18.3 is not very severe, as C∗ can bea large class with infinite vc dimension. The only requirement is that itshould be written as a countable union of classes of finite vc dimension.Note however that the class of all decision rules can not be decomposedas such (see Theorem 18.6). We also emphasize that in order to achievethe O

(√log n/n

)rate of convergence, we do not have to assume that the

distribution is a member of a known finite-dimensional parametric family(see Chapter 16). The condition is imposed solely on the form of the Bayesclassifier g∗.

By inspecting the proof of Theorem 18.2 we see that for every k,

EL(φ∗n)− L∗ ≤ c0

√VC(k) log n

n+(

infφ∈C(k)

L(φ)− L∗).

In fact, Theorem 18.3 is the consequence of this inequality. The first termon the right-hand side, which may be called estimation error, usually in-


creases with growing k, while the second, the approximation error, usuallydecreases with it. Importantly, the above inequality is true for every k, sothat

EL(φ∗n)− L∗ ≤ infk

(c0

√VC(k) log n

n+(

infφ∈C(k)

L(φ)− L∗))

.

Thus structural risk minimization finds a nearly optimal balance betweenthe two terms. See also Problem 18.6.

Remark. It is worthwhile mentioning that under the conditions of Theo-rem 18.3, an even faster, O

(√1/n

), rate of convergence is achievable at

the expense of magnifying the constant factor. More precisely, it is possibleto define the complexity penalties r(j, n) such that the resulting classifiersatisfies

EL(φ∗n)− L∗ ≤ c1√n,

where the constant c1 depends on the distribution. These penalties maybe defined by exploiting Alexander’s inequality (Theorem 12.10), and theinequality above can be proved by using the bound in Problem 12.10, seeProblem 18.6. 2

Remark. We see from the proof of Theorem 18.2 that

PL(φ∗n) > Ln(φ∗n) + ε

≤ ∆e−nε

2/32.

This means that Ln(φ∗n) does not underestimate the true error probabilityof φ∗n by much. This is a very attractive feature of Ln as an error estimate,as the designer can be confident about the performance of the rule φ∗n. 2

The initial excitement over a consistent rule with a guaranteedO(√

log n/n)rate of convergence to the Bayes error is tempered by a few sobering facts:

(1) The user needs to choose the sequence C(i) and has to know VC(j) (see,however, Problems 18.3, 18.4, and the method of simple empiricalcovering below).

(2) It is difficult to find the structural risk minimizer φ∗n efficiently. Afterall, the minimization is done over an infinite sequence of infinite sets.

The second concern above deserves more attention. Consider the follow-ing simple example: let d = 1, and let C(j) be the class of classifiers φ forwhich

x : φ(x) = 1 =j⋃i=1

Ai,

18.2 Poor Approximation Properties of VC Classes 314

where each Ai is an interval of R. Then, from Theorem 13.7, VC(j) = 2j,and we may take

r(j, n) =

√64jn

log(en) .

In structural risk minimization, we find those j (possibly empty) intervalsthat minimize

Ln(φ) + r(j, n),

and call the corresponding classifier φn,j . For j = 1 we have r(j, n) =√64n log(en). As r(n, n) > 1 + r(1, n), and r(j, n) is monotone in j, it is

easily seen that to pick the best j as well, we need only consider 1 ≤ j ≤ n.Still, this is a formidable exercise. For fixed j, the best j intervals may befound by considering all possible insertions of 2j interval boundaries amongthe n Xi’s. This brute force method takes computation time bounded frombelow by

(n+2j

2j

). If we let j run up to n, then we have as a lower bound

n∑j=1

(n+ 2j

2j

).

Just the last term alone,(3n2n

), grows as

√3

4πn

(274

)n, and is prohibitivelylarge for n ≥ 20. Fortunately, in this particular case, there is an algorithmwhich finds a classifier minimizing Ln(φ) + r(j, n) over C∗ =

⋃∞j=1 C(j)

in computational time polynomial in n, see Problem 18.5. Another exam-ple when the structural risk minimizer φ∗n is easy to find is described inProblem 18.6. However, such fast algorithms are not always available, andexponential running time prohibits the use of structural risk minimizationeven for relatively small values of n.

18.2 Poor Approximation Properties of VC Classes

We pause here for a moment to summarize some interesting by-productsthat readily follow from Theorem 18.3 and the slow convergence results ofChapter 7. Theorem 18.3 states that for a large class of distributions anO(√

log n/n)

rate of convergence to the Bayes error L∗ is achievable. Onthe other hand, Theorem 7.2 asserts that no universal rates of convergenceto L∗ exist. Therefore, the class of distributions for which Theorem 18.3 isvalid cannot be that large, after all. The combination of these facts results inthe following three theorems, which say that vc classes—classes of subsetsof Rd with finite vc dimension—have necessarily very poor approximationproperties. The proofs are left to the reader as easy exercises (see Problems18.7 to 18.9). For direct proofs, see, for example, Benedek and Itai (1994).

18.3 Simple Empirical Covering 315

Theorem 18.4. Let C be any class of classifiers of the form φ : Rd →0, 1, with vc dimension VC < ∞. Then for every ε > 0 there exists adistribution such that

infφ∈C

L(φ)− L∗ >12− ε.

Theorem 18.5. Let C(1), C(2), . . . be a sequence of classifiers such that thevc dimensions VC(1) , VC(2) , . . . are all finite. Then for any sequence ak ofpositive numbers converging to zero arbitrarily slowly, there exists a distri-bution such that for every k large enough,

infφ∈C(k)

L(φ)− L∗ > ak.

Theorem 18.6. The class C∗ of all (Borel measurable) decision rules ofform φ : Rd → 0, 1 cannot be written as

C∗ =∞⋃j=1

C(j),

where the vc dimension of each class C(j) is finite. In other words, theclass B of all Borel subsets of Rd cannot be written as a union of countablymany vc classes. In fact, the same is true for the class of all subsets of theset of positive integers.

18.3 Simple Empirical Covering

As Theorem 18.2 shows, the method of structural risk minimization pro-vides automatic protection against the danger of overfitting the data, bypenalizing complex candidate classifiers. One of the disadvantages of themethod is that the penalty terms r(j, n) require knowledge of the vc di-mensions of the classes C(j) or upper bounds of these dimensions. Nextwe discuss a method proposed by Buescher and Kumar (1994b) whichis applicable even if the vc dimensions of the classes are completely un-known. The method, called simple empirical covering, is closely related tothe method based on empirical covering studied in Problem 12.14. In sim-ple empirical covering, we first split the data sequence Dn into two parts.The first part is Dm = ((X1, Y1), . . . , (Xm, Ym)) and the second part isTl = ((Xm+1, Ym+1), . . . , (Xn, Yn)). The positive integers m and l = n−mwill be specified later. The first part Dm is used to cover C∗ =

⋃∞j=1 C(j) as

follows. For every φ ∈ C∗, define the binary m-vector bm(φ) by

bm(φ) = (φ(X1), . . . , φ(Xm)) .


There are N ≤ 2m different values of bm(φ). Usually, as VC∗ = ∞, N =2m, that is, all possible values of bm(φ) occur as φ is varied over C∗, butof course, N depends on the values of X1, . . . , Xm. We call a classifier φsimpler than another classifier φ′, if the smallest index i such that φ ∈ C(i)

is smaller than or equal to the smallest index j such that φ′ ∈ C(j). Forevery binary m-vector b ∈ 0, 1m that can be written as b = bm(φ) forsome φ ∈ C∗, we pick a candidate classifier φm,k with k ∈ 1, . . . , N suchthat bm(φm,k) = b, and it is the simplest such classifier, that is, there isno φ ∈ C∗ such that simultaneously bm(φm,k) = b(φ) and φ is simpler thanφm,k. This yields N ≤ 2m candidates φm,1, . . . , φm,N . Among these, weselect one that minimizes the empirical error

Ll(φm,k) =1l

l∑i=1

Iφm,k(Xm+i) 6=Ym+i,

measured on the independent testing sequence Tl. Denote the selected clas-sifier by φ∗n. The next theorem asserts that the method works under cir-cumstances similar to structural risk minimization.

Theorem 18.7. (Buescher and Kumar (1994b)). Let C(1) ⊆ C(2) ⊆. . . be a nested sequence of classes of classifiers such that for any distribu-tion of (X,Y ),

limj→∞

infφ∈C(j)

L(φ) = L∗.

Assume also that the vc dimensions VC(1) , VC(2) , . . . are all finite. If m/ log n→∞ and m/n→ 0, then the classification rule φ∗n based on simple empiricalcovering as defined above is strongly universally consistent.

Proof. We decompose the difference between the error probability of theselected rule φ∗n and the Bayes risk as follows:


1≤k≤NL(φm,k)

)+(

inf1≤k≤N

L(φm,k)− L∗).

The first term can be handled by Lemma 8.2 and Hoeffding’s inequality:

PL(φ∗n)− inf

1≤k≤NL(φm,k) > ε

≤ P

2 sup

1≤k≤N

∣∣∣L(φm,k)− Ll(φm,k)∣∣∣ > ε

= E

P

2 sup1≤k≤N

∣∣∣L(φm,k)− Ll(φm,k)∣∣∣ > ε

∣∣∣∣Dm

≤ E

2Ne−lε

2/2

≤ 2m+1e−lε2/2 = 2m+1e−(n−m)ε2/2.


Because m = o(n), the latter expression converges to zero exponentiallyrapidly. Thus, it remains to show that

inf1≤k≤N

L(φm,k)− L∗ → 0

with probability one. By our assumptions, for every ε > 0, there is aninteger k such that

infφ∈C(k)

L(φ)− L∗ < ε.

Then there exists a classifier φ(ε) ∈ C(k) with L(φ(ε))− L∗ ≤ ε. Therefore,we are done if we prove that

lim supn→∞

inf1≤k≤N

L(φm,k)− L(φ(ε)) ≤ 0 with probability one.

Clearly, there is a classifier φm,j among the candidates φm,1, . . . , φm,N , suchthat

bm(φm,j) = bm(φ(ε)).

Since by definition, φm,j is simpler than φ(ε), and the classes C(1), C(2), . . .

are nested, it follows that φm,j ∈ C(k). Therefore,

inf1≤k≤N

L(φm,k)− L(φ(ε)) ≤ L(φm,j)− L(φ(ε))

≤ supφ,φ′∈C(k):bm(φ)=bm(φ′)

|L(φ)− L(φ′)|,

where the last supremum is taken over all pairs of classifiers such thattheir corresponding binary vectors bm(φ) and bm(φ′) are equal. But fromProblem 12.14,

P

sup

φ,φ′∈C(k):bm(φ)=bm(φ′)

|L(φ)− L(φ′)| > ε

≤ 2S4(C(k), 2m)e−mε log 2/4,

which is summable if m/ log n→∞. Box

Remark. As in Theorem 18.3, we may assume again that there is an integerk such that infφ∈C(k) L(φ) = L∗. Then from the proof of the theorem abovewe see that the error probability of the classifier φ∗n, obtained by the methodof simple empirical covering, satisfies

P L(φ∗n)− L∗ > ε ≤ 2m+1e−(n−m)ε2/8 + 2S4(C(k), 2m)e−mε log 2/8.

Unfortunately, for any choice of m, this bound is much larger than theanalogous bound obtained for structural risk minimization. In particular,it does not yield an O(

√log n/n) rate of convergence. Thus, it appears

that the price paid for the advantages of simple empirical covering—noknowledge of the vc dimensions is required, and the implementation mayrequire significantly less computational time in general—is a slower rate ofconvergence. See Problem 18.10. 2




Problem 18.2. Define the complexity penalties r(j, n) so that under the con-ditions of Theorem 18.3, the classification rule φ∗n based upon structural riskminimization satisfies

EL(φ∗n)− L∗ ≤ c1√n,

where the constant c1 depends on the distribution. Hint: Use Alexander’s bound(Theorem 12.10), and the inequality of Problem 12.10.

Problem 18.3. Let C(1), C(2), . . . be a sequence of classes of decision rules withfinite vc dimensions. Assume that only upper bounds αj ≥ VC(j) on the vcdimensions are known. Define the complexity penalties by

r(j, n) =

√32

nαj log(en).

Show that if∑∞

j=1e−αj < ∞, then Theorems 18.2 and 18.3 carry over to the

classifier based on structural risk minimization defined by these penalties. Thispoints out that knowledge of relatively rough estimates of the vc dimensionssuffice.

Problem 18.4. Let C(1), C(2), . . . be a sequence of classes of classifiers such thatthe vc dimensions VC(1) , VC(2) , . . . are all finite. Assume furthermore that theBayes rule is contained in one of the classes and that L∗ = 0. Let φ∗n be therule obtained by structural risk minimization using the positive penalties r(j, n),satisfying

(1) For each n, r(j, n) is strictly monotone increasing in j.(2) For each j, limn→∞ r(j, n) = 0.

Show that EL(φ∗n) = O(logn/n) (Lugosi and Zeger (1994)). For related work,see Benedek and Itai (1994).

Problem 18.5. Let C(j) be the class of classifiers φ : Rd → 0, 1 satisfying

x : φ(x) = 1 =

j⋃i=1

Ai,

where the Ai’s are bounded intervals in R. The purpose of this exercise is to pointout that there is a fast algorithm to find the structural risk minimizer φ∗n overC∗ =

⋃∞j=1

C(j), that is, which minimizes Ln = Ln + r(j, n) over C∗, where the

penalties r(j, n) are defined as in Theorems 18.2 and 18.3. The property belowwas pointed out to us by Miklos Csuros and Reka Szabo.

(1) Let A∗1,j , . . . , A∗j,j be the j intervals defining the classifier φn,j minimizing

Ln over C(j). Show that the optimal intervals for C(j+1),A∗1,j+1, . . . , A∗j+1,j+1

satisfy the following property: either j of the intervals coincide withA∗1,j , . . . , A

∗j,j , or j − 1 of them are among A∗1,j , . . . , A

∗j,j , and the re-

maining two intervals are subsets of the j-th interval.


(2) Use property (1) to define an algorithm that finds φ∗n in running timepolynomial in the sample size n.

Problem 18.6. Assume that the distribution of X is concentrated on the unitcube, that is, PX ∈ [0, 1]d = 1. Let Pj be a partition of [0, 1]d into cubes ofsize 1/j, that is, Pj contains jd cubic cells. Let C(j) be the class of all histogram

classifiers φ : [0, 1]d → 0, 1 based on Pj . In other words, Pj contains all 2jd

classifiers which assign the same label to points falling in the same cell of Pj . What

is the vc dimension VC(j) of C(j)? Point out that the classifier minimizing Ln overC(j) is just the regular histogram rule based on Pj . (See Chapter 9.) Thus, we haveanother example in which the empirically optimal classifier is computationallyinexpensive. The structural risk minimizer φ∗n based on C∗ =

⋃∞j=1

C(j) is also

easy to find. Assume that the a posteriori probability η(x) is uniformly Lipschitz,that is, for any x, y ∈ [0, 1]d,

|η(x)− η(y)| ≤ c‖x− y‖,

where c is some constant. Find upper bounds for the rate of convergence ofEL(φ∗n) to L∗.

Problem 18.7. Prove Theorem 18.4. Hint: Use Theorem 7.1.

Problem 18.8. Prove Theorem 18.5. Hint: Use Theorem 7.2.

Problem 18.9. Prove Theorem 18.6. Hint: Use Theorems 7.2 and 18.3.

Problem 18.10. Assume that the Bayes rule g∗ is contained in C∗ =⋃∞j=1

C(j).

Let φ∗n be the classifier obtained by simple empirical covering. Determine thevalue of the design parameter m that minimizes the bounds obtained in theproof of Theorem 18.7. Obtain a tight upper bound for EL(φ∗n) − L∗. Compareyour results with Theorem 18.3.


19Condensed and EditedNearest Neighbor Rules

19.1 Condensed Nearest Neighbor Rules

Condensing is the process by which we eliminate data points, yet keep thesame behavior. For example, in the nearest neighbor rule, by condensing wemight mean the reduction of (X1, Y1), . . . , (Xn, Yn) to (X ′

1, Y′1), . . . , (X ′

m, Y′m)

such that for all x ∈ Rd, the 1-nn rule is identical based on the two samples.This will be called pure condensing. This operation has no effect on Ln,and therefore is recommended whenever space is at a premium. The spacesavings should be substantial whenever the classes are separated. Unfortu-nately, pure condensing is computationally expensive, and offers no hopeof improving upon the performance of the ordinary 1-nn rule.

Figure 19.1. Pure condens-ing: Eliminating the markedpoints does not change the de-cision.

19.1 Condensed Nearest Neighbor Rules 321

Hart (1968) has the first simple algorithm for condensing. He picksa subset that correctly classifies the remaining data by the 1-nn rule.Finding a minimal such subset is computationally difficult, but heuristicsmay do the job. Hart’s heuristic is also discussed in Devijver and Kittler(1982, p.120). For probabilistic analysis, we take a more abstract setting.Let (X ′

1, Y′1), . . . , (X ′

m, Y′m) be a sequence that depends in an arbitrary

fashion on the data Dn, and let gn be the 1-nearest neighbor rule with(X ′

1, Y′1), . . . , (X ′

m, Y′m), where for simplicity, m is fixed beforehand. The

data might, for example, be obtained by finding the subset of the data ofsize m for which the error with the 1-nn rule committed on the remain-ing n −m data is minimal (this will be called Hart’s rule). Regardless, ifLn = (1/n)

∑ni=1 Ign(Xi) 6=Yi and Ln = Pgn(X) 6= Y |Dn, then we have

the following:

Theorem 19.1. (Devroye and Wagner (1979c)). For all ε > 0 andall distributions,

P|Ln − Ln| ≥ ε

≤ 8

(ne

d+ 1

)(d+1)m(m−1)

e−nε2/32.

Remark. The estimate Ln is called the resubstitution estimate of the errorprobability. It is thoroughly studied in Chapter 23, where several results ofthe aforementioned kind are stated. 2

Proof. Observe that

Ln =1n

n∑j=1

I(Xj ,Yj)/∈⋃m

i=1Bi×Y ′i

,

where Bi is the Voronoi cell of X ′i in the Voronoi partition corresponding

to X ′1, . . . , X

′m, that is, Bi is the set of points of Rd closer to X ′

i than toany other X ′

j (with appropriate distance-tie breaking). Similarly,

Ln = P

(X,Y ) /∈

m⋃i=1

Bi × Y ′i

∣∣∣∣∣Dn

.

We use the simple upper bound

|Ln − Ln| ≤ supA∈Am

|νn(A)− ν(A)|,

where ν denotes the measure of (X,Y ), νn is the corresponding empir-ical measure, and Am is the family of all subsets of Rd × 0, 1 of theform

⋃mi=1Bi × yi, where B1, . . . , Bm are Voronoi cells corresponding to


x1, . . . , xm, xi ∈ Rd, yi ∈ 0, 1. We use the Vapnik-Chervonenkis inequal-ity to bound the above supremum. By Theorem 13.5 (iv),

s(Am, n) ≤ s(A, n)m,

where A is the class of sets B1 ×y1. But each set in A is an intersectionof at most m− 1 hyperplanes. Therefore, by Theorems 13.5 (iii), 13.9, and13.3,

s(A, n) ≤ supn0,n1:n0+n1=n

1∏j=0

(nje

d+ 1

)(d+1)(k−1) ≤

(ne

d+ 1

)(d+1)(k−1)

,

where nj denotes the number of points in Rd×j. The result now followsfrom Theorem 12.5. 2

Remark. With Hart’s rule, at least m data points are correctly classifiedby the 1-nn rule (if we handle distance ties satisfactorily). Therefore, Ln ≤1−m/n. 2

The following is a special case: Let m < n be fixed and let Dm be anarbitrary (possibly random) subset of m pairs from (X1, Y1), . . . , (Xn, Yn),used in gn. Let the remaining n−m points be denoted by Tn−m. We writeLn(Dm) for the probability of error with the 1-nn based upon Dm, and wedefine

Ln,m(Dm, Tn−m) =1

n−m

∑(Xi,Yi)∈Tn−m

Ign(Xi) 6=Yi.

In Hart’s rule, Ln,m would be zero, for example. Then we have:

Theorem 19.2. For all ε > 0,

P|Ln − Ln,m| > ε

≤ 2

(n

m

)e−2(n−m)ε2

≤ 2nme−2(n−m)ε2 ,

where Ln is the probability of error with gn (note that Dm depends in anarbitrary fashion upon Dn), and Ln,m is Ln,m(Dm, Tn−m) with the dataset Dm.

Proof. List the m-element subsets i1, . . . , im of 1, 2, . . . , n, and defineD

(i)m as the sequence of m pairs from Dn indexed by i = i1, . . . , im,


1 ≤ i ≤(nm

). Accordingly, denote T (i)

n−m = Dn −D(i)m . Then

P|Ln − Ln,m| > ε

≤

(nm)∑i=1

P|Ln(D(i)

m )− Ln,m(D(i)m , T

(i)n−m)| > ε

= E

(n

m)∑i=1

P|Ln(D(i)

m )− Ln,m(D(i)m , T

(i)n−m)| > ε|D(i)

m

≤ E

(n

m)∑i=1

2e−2(n−m)ε2

(by Hoeffding’s inequality, because given D(i)

m ,

(n−m)Ln,m(D(i)n , T

(i)n−m) is binomial ((n−m), Ln(D

(i)m )))

= 2(n

m

)e−2(n−m)ε2 . 2

By checking the proof, we also see that if Dm is selected to minimizethe error estimate Ln,m(Dm, Tn−m), then the error probability Ln of theobtained rule satisfies

P|Ln − inf

Dm

Ln(Dm)| > ε

≤ P

2 supDm

|Ln(Dm)− Ln,m(Dm, Tn−m)| > ε

(by Theorem 8.4)

≤ 2(n

m

)e−(n−m)ε2/2 (as in the proof of Theorem 19.2). (19.1)

Thus, for the particular rule that mimics Hart’s rule (with the exceptionthat m is fixed), if m is not too large—it must be much smaller thann/ log n—Ln is likely to be close to the best possible we can hope forwith a 1-nn rule based upon a subsample of size m. With some work (seeProblem 12.1), we see that

E∣∣∣∣Ln − inf

Dm

Ln(Dm)∣∣∣∣ = O

(√m logm

n

).

By Theorem 5.1,

E

infDm

Ln(Dm)≤ ELm → LNN as m→∞,


where Lm is the probability of error with the 1-nn rule based upon a sampleof m data pairs. Hence, if m = o(n/ log n), m→∞,

lim supn→∞

ELn ≤ LNN

for the 1-nn rule based on m data pairs Dm selected to minimize the er-ror estimate Ln,m(Dm, Tn−m). However, this is very pessimistic indeed. Itreassures us that with only a small fraction of the original data, we obtainat least as good a performance as with the full data set—so, this methodof condensing is worthwhile. This is not very surprising. Interestingly, how-ever, the following much stronger result is true.

Theorem 19.3. If m = o(n/ log n) and m→∞, and if Ln is the probabil-ity of error for the condensed nearest neighbor rule in which Ln,m(Dm, Tn−m)is minimized over all data sets Dm, then

limn→∞

ELn = L∗.

Proof. By (19.1), it suffices to establish that

E

infDm

Ln(Dm)→ L∗

as m → ∞ such that m = o(n), where Ln(Dm) is the probability of errorof the 1-nn rule with Dm. As this is one of the fundamental properties ofnearest neighbor rules not previously found in texts, we offer a thoroughanalysis and proof of this result in the remainder of this section, culminatingin Theorem 19.4. 2

The distribution-free result of Theorem 19.3 sets the stage for manyconsistency proofs for rules that use condensing (or editing, or prototyping,as defined in the next two sections). It states that inherently, partitions ofthe space by 1-nn rules are rich.

Historical remark. Other condensed nearest neighbor rules are pre-sented by Gates (1972), Ullmann (1974), Ritter, Woodruff, Lowry, and Isen-hour (1975), Tomek (1976b), Swonger (1972), Gowda and Krishna (1979),and Fukunaga and Mantock (1984). 2

Define Z = Iη(X)>1/2. Let (X ′i, Yi, Zi), i = 1, 2, . . . , n, be i.i.d. tuples,

independent of (X,Y, Z), where X ′i may have a distribution different from

X, but the support set of the distribution µ′ of X ′i is identical to that of

µ, the distribution of X. Furthermore, PYi = 1|X ′i = x = η(x), and

Zi = Iη(X′i)>1/2 is the Bayes decision at X ′

i.


Lemma 19.1. Let Ln = PZ ′(1)(X) 6= Z|X ′

1, Z1, . . . , X′n, Zn

be the prob-

ability of error for the 1-nn rule based on (X ′i, Zi), 1 ≤ i ≤ n, that is,

Z ′(1)(X) = Zi if X ′i is the nearest neighbor of X among X ′

1, . . . , X′n. Then

limn→∞

ELn = 0.

Proof. Denote by X ′(1)(X) the nearest neighbor of X among X ′

1, . . . , X′n.

Notice that the proof of Lemma 5.1 may be extended in a straightforwardway to show that ‖X ′

(1)(X) − X‖ → 0 with probability one. Since this isthe only property of the nearest neighbor of X that we used in derivingthe asymptotic formula for the ordinary 1-nn error, limn→∞ELn equalsLNN corresponding to the pair (X,Z). But we have P Z = 1|X = x =Iη(x)>1/2. Thus, the Bayes probability of error L∗ for the pattern recog-nition problem with (X,Z) is zero. Hence, for this distribution, the 1-nnrule is consistent as LNN = L∗ = 0. 2

Lemma 19.2. Let Z ′(1)(X) be as in the previous lemma. Let

Ln = PZ ′(1)(X) 6= Y |X ′

1, Y1, . . . , X′n, Yn

be the probability of error for the discrimination problem for (X,Y ) (not(X,Z)). Then

limn→∞

ELn = L∗,

where L∗ is the Bayes error corresponding to (X,Y ).

Proof.

ELn = PZ ′(1)(X) 6= Y

≤ P

Z ′(1)(X) 6= Z

+ PY 6= Z

= o(1) + L∗

by Lemma 19.1. 2

Theorem 19.4. Let Dm be a subset of size m drawn from Dn. If m→∞and m/n→ 0 as n→∞, then

limn→∞

P

infDm

Ln(Dm) > L∗ + ε

= 0

for all ε > 0, where Ln(Dm) denotes the conditional probability of error ofthe nearest neighbor rule with Dm, and the infimum ranges over all

(nm

)subsets.

19.2 Edited Nearest Neighbor Rules 326

Proof. Let D be the subset of Dn consisting of those pairs (Xi, Yi) forwhich Yi = Iη(Xi)>1/2 = Zi. If |D| ≥ m, let D∗ be the first m pairs of D,and if |D| < m, let D∗ = (X1, Y1), . . . , (Xm, Ym). Then

infDm

Ln(Dm) ≤ Ln(D∗).

If N = |D| ≥ m, then we know that the pairs in D∗ are i.i.d. and drawnfrom the distribution of (X ′, Z), where X ′ has the same support set as X;see Problem 19.2 for properties of X ′. In particular,

P

infDm

Ln(Dm) > L∗ + ε

≤ PN < m+ P N ≥ m,Ln(D∗) > L∗ + ε≤ PN < m+ P Ln(D∗) > L∗ + ε

≤ PBinomial(n, p) < m+ELn(D∗) − L∗

ε

(where p = PY = Z = 1− L∗ ≥ 1/2, and by Markov’s inequality)

= o(1),

by the law of large numbers (here we use m/n → 0), and by Lemma 19.2(here we use m→∞). 2

19.2 Edited Nearest Neighbor Rules

Edited nearest neighbor rules are 1-nn rules that are based upon carefullyselected subsets (X ′

1, Y′1), . . . , (X ′

m, Y′m). This situation is partially dealt

with in the previous section, as the frontier between condensed and editednearest neighbor rules is ill-defined. The idea of editing based upon thek-nn rule was first suggested by Wilson (1972) and later studied by Wag-ner (1973) and Penrod and Wagner (1977). Wilson suggests the followingscheme: compute (Xi, Yi, Zi), where Zi is the k-nn decision at Xi basedon the full data set with (Xi, Yi) deleted. Then eliminate all data pairs forwhich Yi 6= Zi. The remaining data pairs are used with the 1-nn rule (notthe k-nn rule). Another rule, based upon data splitting is dealt with by De-vijver and Kittler (1982). A survey is given by Dasarathy (1991), Devijver(1980), and Devijver and Kittler (1980). Repeated editing was investigatedby Tomek (1976a). Devijver and Kittler (1982) propose a modification ofWilson’s leave-one-out method of editing based upon data splitting.

19.3 Sieves and Prototypes 327

19.3 Sieves and Prototypes

Let gn be a rule that uses the 1-nn classification based upon prototype datapairs (X ′

1, Y′1), . . . , (X ′

m, Y′m) that depend in some fashion on the original

data. If the pairs form a subset of the data pairs (and thus, m ≤ n),we have edited or condensed nearest neighbor rules. However, the (X ′

i, Y′i )

pairs may be strategically picked outside the original data set. For example,in relabeling (see Section 11.7), m = n, X ′

i = Xi and Y ′i = g′n(Xi), whereg′n(Xi) is the k-nn decision at Xi. Under some conditions, the relabelingrule is consistent (see Theorem 11.2). The true objective of prototyping isto extract information from the data by insisting that m be much smallerthan n.

Figure 19.2. A 1-nn rule basedon 4 prototypes. In this example,all the data points are correctlyclassified based upon the prototype1-nn rule.

Chang (1974) describes a rule in which we iterate the following stepuntil a given stopping rule is satisfied: merge the two closest nearest neigh-bors of the same class and replace both pairs by a new average proto-type pair. Kohonen (1988; 1990) recognizes the advantages of such proto-typing in general as a device for partitioning Rd—he calls this learningvector quantization. This theme was picked up again by Geva and Sitte(1991), who pick X ′

1, . . . , X′m as a random subset of X1, . . . , Xn and al-

low Y ′i to be different from Yi. Diverging a bit from Geva and Sitte, wemight minimize the empirical error with the prototyped 1-nn rule over allY ′1 , . . . , Y

′m, where the empirical error is that committed on the remaining

data. We show that this simple strategy leads to a Bayes-risk consistentrule whenever m → ∞ and m/n → 0. Note, in particular, that we maytake (X ′

1, . . . , X′m) = (X1, . . . , Xm), and that we “throw away” Y1, . . . , Ym,

as these are not used. We may, in fact, use additional data with missingYi-values for this purpose—the unclassified data are thus efficiently usedto partition the space.

Let

Ln(gn) =1

n−m

n∑i=m+1

Ign(Xi) 6=Yi

be the empirical risk on the remaining data, where gn is the 1-nn rule based


upon (X1, Y′1), . . . , (Xm, Y

′m). Let g∗n be the 1-nn rule with the choice of

Y ′1 , . . . , Y′m that minimizes Ln(gn). Let L(g∗n) denote its probability of error.

Theorem 19.5. L(g∗n) → L∗ in probability for all distributions wheneverm→∞ and m/n→ 0.

Proof. There are 2m different possible functions gn. Thus,

P

supgn

|Ln(gn)− L(gn)| > ε

∣∣∣∣X1, . . . , Xn

≤ 2m sup

gn

P|Ln(gn)− L(gn)| > ε

∣∣∣X1, . . . , Xm

≤ 2m+1e−2(n−m)ε2

by Hoeffding’s inequality. Also,

P L(g∗n) > L∗ + 3ε

≤ PL(g∗n)− Ln(g∗n) > ε

+P

Ln(gn)− L(gn) > ε

(gn minimizes L(gn))

+P L(gn) > L∗ + ε (here we used Ln(gn) ≥ Ln(g∗n))

≤ 2EP

supgn

|Ln(gn)− L(gn)| > ε

∣∣∣∣X1, . . . , Xn

+P

L(g+

n ) > L∗ + ε

(where g+n is the 1-nn rule based on (X1, Z1), . . . , (Xm, Zm)

with Zi = Iη(Xi)>1/2 as in Lemma 19.1)

≤ 2m+2e−(n−m)ε2 + o(1)

(if m→∞, by Lemma 19.1)

= o(1)

if m→∞ and m/n→ 0. 2

If we let X ′1, . . . , X

′m have arbitrary values—not only among those taken

by X1, . . . , Xn—then we get a much larger, more flexible class of classifiers.For example, every linear discriminant is nothing but a prototype 1-nn rulewith m = 2—just take (X ′

1, 1), (X ′2, 0) and place X ′

1 and X ′2 in the right

places. In this sense, prototype 1-nn rules generalize a vast class of rules.The most promising strategy of choosing prototypes is to minimize theempirical error committed on the training sequence Dn. Finding this opti-mum may be computationally very expensive. Nevertheless, the theoreticalproperties provided in the next result may provide useful guidance.


Theorem 19.6. Let C be the class of nearest neighbor rules based on proto-type pairs (x1, y1), . . . , (xm, ym), m ≥ 3, where the (xi, yi)’s range throughRd×0, 1. Given the training data Dn = (X1, Y1), . . . , (Xn, Yn), let gn bethe nearest neighbor rule from C minimizing the error estimate

Ln(φ) =1n

n∑i=1

Iφ(Xi) 6=Yi, φ ∈ C.

Then for each ε > 0,

PL(gn)− inf

φ∈CL(φ) > ε

≤ 8 · 2mn(d+1)m(m−1)/2e−nε

2/128.

The rule is consistent if m→∞ such that m2 logm = o(n). For d = 1 andd = 2 the probability bound may be improved significantly. For d = 1,

PL(gn)− inf

φ∈CL(φ) > ε

≤ 8 · (2n)me−nε

2/128,

and for d = 2,

PL(gn)− inf

φ∈CL(φ) > ε

≤ 8 · 2mn9m−18e−nε

2/128.

In both cases, the rule is consistent if m→∞ and m logm = o(n).

Proof. It follows from Theorem 12.6 that

PL(gn)− inf

φ∈CL(φ) > ε

≤ 8S(C, n)e−nε

2/128,

where S(C, n) is the n-th shatter coefficient of the class of sets x : φ(x) =1, φ ∈ C. All we need is to find suitable upper bounds for S(C, n). Eachclassifier φ is a partitioning rule based on the m Voronoi cells defined byx1, . . . , xm. Therefore, S(C, n) is not more than 2m times the number ofdifferent ways n points in Rd can be partitioned by Voronoi partitionsdefined by m points. In each partition, there are at most m(m− 1)/2 cellboundaries that are subsets of d − 1-dimensional hyperplanes. Thus, thesought number is not greater than the number of different ways m(m−1)/2hyperplanes can partition n points. By results of Chapter 13, this is at mostn(d+1)m(m−1)/2, proving the first inequality.

The other two inequalities follow by sharper bounds on the number ofcell boundaries. For d = 1, this is clearly at most m. To prove the thirdinequality, for each Voronoi partition construct a graph whose vertices arex1, . . . , xm, and two vertices are connected with an edge if and only iftheir corresponding Voronoi cells are connected. It is easy to see that thisgraph is planar. But the number of edges of a planar graph with m vertices


cannot exceed 3m− 6 (see Nishizeki and Chiba (1988, p.10)) which provesthe inequality.

The consistency results follow from the stated inequalities and from thefact that infφ∈C L(φ) tends to L∗ as m→∞ (check the proof of Theorem5.15 again). 2


Problem 19.1. Let N ≤ n be the size of the data set after pure condensing.Show that lim infn→∞ EN/n > 0 whenever L∗ > 0. True or false: if L∗ = 0,then EN = o(n). Hint: Consider the real line, and note that all points whoseright and left neighbors are of the same class are eliminated.

Problem 19.2. Let (X1, Y1), (X2, Y2), . . . be an i.i.d. sequence of pairs of ran-dom variables in Rd × 0, 1 with PY1 = 1|X1 = x = η(x). Let (X ′, Z) bethe first pair (Xi, Yi) in the sequence such that Yi = Iη(Xi)>1/2. Show thatthe distribution µ′ of X ′ is absolutely continuous with respect to the commondistribution µ of the Xi’s, with density (i.e., Radon-Nikodym derivative)

dµ′

dµ(x) =

1−min(η(x), 1− η(x))

1− L∗,

where L∗ is the Bayes error corresponding to (X1, Y1). Let Y be a 0, 1-valuedrandom variable with PY = 1|X ′ = x = η(x). If L′ denotes the Bayes errorcorresponding to (X ′, Y ), then show that L′ ≤ L∗.

Problem 19.3. Consider the following edited nn rule. The pair (Xi, Yi) is elim-inated from the training sequence if the k-nn rule (based on the remaining n− 1pairs) incorrectly classifies Xi. The 1-nn rule is used with the edited data. Showthat this rule is consistent whenever X has a density if k → ∞ and k/n → 0.Related papers: Wilson (1972), Wagner (1973), Penrod and Wagner (1977), andDevijver and Kittler (1980).


20Tree Classifiers

Classification trees partition Rd into regions, often hyperrectangles parallelto the axes. Among these, the most important are the binary classificationtrees, since they have just two children per node and are thus easiest tomanipulate and update. We recall the simple terminology of books on datastructures. The top of a binary tree is called the root. Each node has eitherno child (in that case it is called a terminal node or leaf), a left child, aright child, or a left child and a right child. Each node is the root of a treeitself. The trees rooted at the children of a node are called the left andright subtrees of that node. The depth of a node is the length of the pathfrom the node to the root. The height of a tree is the maximal depth ofany node.

Trees with more than two children per node can be reduced to binarytrees by

20. Tree Classifiers 332

Figure 20.1. A binary tree.


a simple device—just associate a left child with each node by selectingthe oldest child in the list of children. Call the right child of a node itsnext sibling (see Figures 20.2 and 20.3). The new binary tree is calledthe oldest-child/next-sibling binary tree (see, e.g., Cormen, Leiserson, andRivest (1990) for a general introduction). We only mention this particularmapping because it enables us to only consider binary trees for simplicity.

Figure 20.2. Ordered tree: Thechildren are ordered from oldest toyoungest.

Figure 20.3. The correspondingbinary tree.

In a classification tree, each node represents a set in the space Rd. Also,each node has exactly two or zero children. If a node u represents theset A and its children u′, u′′ represent A′ and A′′, then we require thatA = A′ ∪ A′′ and A′ ∩ A′′ = ∅. The root represents Rd, and the leaves,taken together, form a partition of Rd.

Assume that we know x ∈ A. Then the question “is x ∈ A?” shouldbe answered in a computationally simple manner so as to conserve time.Therefore, if x = (x(1), . . . , x(d)), we may just limit ourselves to questionsof the following forms:

(i) Is x(i) ≤ α? This leads to ordinary binary classification trees withpartitions into hyperrectangles.

(ii) Is a1x(1) + · · · + adx

(d) ≤ α? This leads to bsp trees (binary spacepartition trees). Each decision is more time consuming, but the spaceis more flexibly cut up into convex polyhedral cells.

(iii) Is ‖x− z‖ ≤ α? (Here z is a point of Rd, to be picked for each node.)This induces a partition into pieces of spheres. Such trees are calledsphere trees.

(iv) Is ψ(x) ≥ 0? Here, ψ is a nonlinear function, different for each node.Every classifier can be thought of as being described in this format—decide class one if ψ(x) ≥ 0. However, this misses the point, as tree


classifiers should really be built up from fundamental atomic opera-tions and queries such as those listed in (i)–(iii). We will not considersuch trees any further.

Figure 20.4. Partition inducedby an ordinary binary tree.

Figure 20.5. Correspondingtree.

Figure 20.6. Partition inducedby a bsp tree.

Figure 20.7. Partition inducedby a sphere tree.

We associate a class in some manner with each leaf in a classificationtree. The tree structure is usually data dependent, as well, and indeed, itis in the construction itself where methods differ. If a leaf represents regionA, then we say that the classifier gn is natural if

gn(x) =

1 if∑

i:Xi∈AYi >

∑i:Xi∈A

(1− Yi), x ∈ A

0 otherwise.

That is, in every leaf region, we take a majority vote over all (Xi, Yi)’swith Xi in the same region. Ties are broken, as usual, in favor of class 0. Inthis set-up, natural tree classifiers are but special cases of data-dependentpartitioning rules. The latter are further described in detail in Chapter 21.

20.1 Invariance 335

Figure 20.8. A natural classi-fier based on an ordinary binarytree. The decision is 1 in regionswhere points with label 1 form amajority. These areas are shaded.

Regular histograms can also be thought of as natural binary tree classifiers—the construction and relationship is obvious. However, as n → ∞, his-tograms change size, and usually, histogram partitions are not nested as ngrows. Trees offer the exciting perspective of fully dynamic classification—as data are added, we may update the tree slightly, say, by splitting a leafor so, to obtain an updated classifier.

The most compelling reason for using binary tree classifiers is to explaincomplicated data and to have a classifier that is easy to analyze and un-derstand. In fact, expert system design is based nearly exclusively upondecisions obtained by going down a binary classification tree. Some arguethat binary classification trees are preferable over bsp trees for this simplereason. As argued in Breiman, Friedman, Olshen, and Stone (1984), treesallow mixing component variables that are heterogeneous—some compo-nents may be of a nonnumerical nature, others may represent integers, andstill others may be real numbers.

20.1 Invariance

Nearly all rules in this chapter and in Chapters 21 and 30 show some sortof invariance with respect to certain transformations of the input. This isoften a major asset in pattern recognition methods. We say a rule gn isinvariant under transformation T if

gn(T (x);T (X1), Y1, . . . , T (Xn), Yn) = gn(x;X1, Y1, . . . , Xn, Yn)

for all values of the arguments. In this sense, we may require translationinvariance, rotation invariance, linear translation invariance, and mono-tone transformation invariance (T (·) maps each coordinate separately bya strictly increasing but possibly nonlinear function).

Monotone transformation invariance frees us from worries about the kindof measuring unit. For example, it would not matter whether earthquakeswere measured on a logarithmic (Richter) scale or a linear scale. Rota-tion invariance matters of course in situations in which input data have

20.2 Trees with the X-Property 336

no natural coordinate axis system. In many cases, data are of the ordinalform—colors and names spring to mind—and ordinal values may be trans-lated into numeric values by creating bit vectors. Here, distance loses itsphysical meaning, and any rule that uses ordinal data perhaps mixed inwith numerical data should be monotone transformation invariant.

Tree methods that are based upon perpendicular splits are usually (butnot always) monotone transformation invariant and translation invariant.Tree methods based upon linear hyperplane splits are sometimes lineartransformation invariant.

The partitions of space cause some problems if the data points can lineup along hyperplanes. This is just a matter of housekeeping, of course,but the fact that some projections of X to a line have atoms or somecomponents of X have atoms will make the proofs heavier to digest. Forthis reason only, we assume throughout this chapter that X has a densityf . As typically no conditions are put on f in our consistency theorems, itwill be relatively easy to generalize them to all distributions. The densityassumption affords us the luxury of being able to say that, with probabilityone, no d + 1 points fall in a hyperplane, no d points fall in a hyperplaneperpendicular to one axis, no d−1 points fall in a hyperplane perpendicularto two axes, etcetera. If a rule is monotone transformation invariant, we canwithout harm transform all the data as follows for the purpose of analysisonly. Let f1, . . . , fd be the marginal densities of X (see Problem 20.1), withcorresponding distribution functions F1, . . . , Fd. Then replace in the dataeach Xi by T (Xi), where

T (x) =(F1(x(1)), . . . , Fd(x(d))

).

Each component of T (Xi) is now uniformly distributed on [0, 1]. Of course,as we do not know T beforehand, this device could only be used in the anal-ysis. The transformation T will be called the uniform marginal transfor-mation. Observe that the original density is now transformed into anotherdensity.

20.2 Trees with the X-Property

It is possible to prove the convergence of many tree classifiers all at once.What is needed, clearly, is a partition into small regions, yet, most ma-jority votes should be over sufficiently large sample. In many of the casesconsidered, the form of the tree is determined by the Xi’s only, that is,the labels Yi do not play a role in constructing the partition, but they areused in voting. This is of course rather simplistic, but as a start, it is veryconvenient. We will say that the classification tree has the X-property, forlack of a better mnemonic. Let the leaf regions be A1, . . . , AN (with Npossibly random). Define Nj as the number of Xi’s falling in Aj . As the leaf


regions form a partition, we have∑Nj=1Nj = n. By diam(Aj) we mean the

diameter of the cell Aj , that is, the maximal distance between two points ofAj . Finally, decisions are taken by majority vote, so for x ∈ Aj , 1 ≤ j ≤ N ,the rule is

gn(x) =

1 if∑

i:Xi∈Aj

Yi >∑

i:Xi∈Aj

(1− Yi), x ∈ Aj

0 otherwise.

A(x) denotes the set of the partition A1, . . . , AN into which x falls, andN(x) is the number of data points falling in this set. Recall that the generalconsistency result given in Theorem 6.1 is applicable in such cases. Considera natural classification tree as defined above and assume the X-property.Theorem 6.1 states that then ELn → L∗ if

(1) diam(A(X)) → 0 in probability,(2) N(X) →∞ in probability.

A more general, but also more complicated consistency theorem is provedin Chapter 21. Let us start with the simplest possible example. We verifythe conditions of Theorem 6.1 for the k-spacing rule in one dimension.This rule partitions the real line by using the k-th, 2k-th, (and so on) orderstatistics (Mahalanobis, (1961); see also Parthasarathy and Bhattacharya(1961)).

Figure 20.9. A 3-spacing classi-fier.

Formally, let k < n be a positive integer, and let X(1), X(2), . . . , X(n) bethe order statistics of the data points. Recall that X(1), X(2), . . . , X(n) areobtained by permuting X1, . . . , Xn in such a way that X(1) ≤ X(2) ≤ · · · ≤X(n). Note that this ordering is unique with probability one as X has adensity. We partition R into N intervals A1, . . . , AN , where N = dn/ke,such that for j = 1, . . . , N − 1, Aj satisfies

X(k(j−1)+1), . . . , X(kj) ∈ Aj ,

and the rightmost cell AN satisfies

X(k(N−1)+1), . . . , X(n) ∈ AN .

We have not specified the endpoints of each cell of the partition. For simplic-ity, let the borders between Aj and Aj+1 be put halfway between the right-most data point in Aj and leftmost data point in Aj+1, j = 1, . . . , N − 1.

The classification rule gn is defined in the usual way:

gn(x) =

1 if∑ni=1 IXi∈A(x),Yi=1 >

∑ni=1 IXi∈A(x),Yi=0

0 otherwise.


Theorem 20.1. Let gn be the k-spacing classifier given above. Assumethat the distribution of X has a density f on R. Then the classificationrule gn is consistent, that is, limn→∞ELn = L∗, if k →∞ and k/n→ 0as n tends to infinity.

Remark. We discuss various generalizations of this rule in Chapter 21. 2

Proof. We check the conditions of Theorem 6.1, as the partition has theX-property. Condition (2) is obvious from k →∞.

To establish condition (1), fix ε > 0. Note that by the invariance ofthe rule under monotone transformations, we may assume without lossof generality that f is the uniform density on [0, 1]. Among the intervalsA1, . . . , AN , there are at most 1/ε disjoint intervals of length greater thanε in [0, 1]. Thus,

Pdiam(A(X)) > ε

≤ 1εE

max1≤j≤N

µ(Aj)

≤ 1εE(

max1≤j≤N

µn(Aj) + max1≤j≤N

|µ(Aj)− µn(Aj)|)

≤ 1ε

(k

n+ E

supA|µ(A)− µn(A)|

),

where the supremum is taken over all intervals in R. The first term withinthe parentheses converges to zero by the second condition of the theorem,while the second term goes to zero by an obvious extension of the classicalGlivenko-Cantelli theorem (Theorem 12.4). This completes the proof. 2

We will encounter several trees in which the partition is determined by asmall fraction of the data, such as binary k-d trees and quadtrees. In thesecases, condition (2) of Theorem 6.1 may be verified with the help of thefollowing lemma:

Lemma 20.1. Let p1, . . . , pk be a probability vector. Let N1, . . . , Nk bemultinomially distributed random variables with parameters n and p1, . . . , pk.Then if the random variable X is independent of N1, . . . , Nk, and PX =i = pi, we have for any M ,

PNX ≤M ≤ (2M + 4)kn

.

(Note: this probability goes to 0 if k/n → 0 uniformly over all probabilityvectors with k components!)

20.3 Balanced Search Trees 339

Proof. Let Zi be binomial with parameters n and pi. Then

PNX ≤M ≤∑

i:npi≤2M

pi +∑

i:npi>2M

piPZi ≤M

≤ 2Mk

n+

∑i:npi>2M

piPZi −EZi ≤M − npi

≤ 2Mk

n+

∑i:npi>2M

piPZi −EZi ≤ −EZi

2

≤ 2Mk

n+

∑i:npi>2M

4piVarZi(EZi)2

(by Chebyshev’s inequality)

≤ 2Mk

n+

∑i:npi>2M

4pi1npi

≤ (2M + 4)kn

. 2

The previous lemma implies that for any binary tree classifier constructedon the basis of X1, . . . , Xk with k + 1 regions, N(X) → ∞ in probabilitywhenever k/(n − k) → 0 (i.e., k/n → 0). It suffices to note that we maytake M arbitrarily large but fixed in Lemma 20.1. This remark saves usthe trouble of having to verify just how large or small the probability massof the region is. In fact, it also implies that we should not worry so muchabout regions with few data points. What matters more than anything elseis the number of regions. Stopping rules based upon cardinalities of regionscan effectively be dropped in many cases!

20.3 Balanced Search Trees

Balanced multidimensional search trees are computationally attractive. Bi-nary trees with n leaves have O(log n) height, for example, when at eachnode, the size of every subtree is at least α times the size of the othersubtree rooted at the parent, for some constant α > 0. It is thus importantto verify the consistency of balanced search trees used in classification. Weagain consider binary classification trees with the X-property and major-ity votes over the leaf regions. Take for example a tree in which we splitevery node perfectly, that is, if there are n points, we find the median ac-cording to one coordinate, and create two subtrees of sizes b(n− 1)/2c andd(n − 1)/2e. The median itself stays behind and is not sent down to thesubtrees. Repeat this for k levels of nodes, at each level cutting along the


next coordinate axe in a rotational manner. This leads to 2k leaf regions,each having at least n/2k − k points and at most n/2k points. Such a treewill be called a median tree.

Figure 20.10. Median tree with four leaf regions inR2.

Setting up such a tree is very easy, and hence such trees may appeal tocertain programmers. In hypothesis testing, median trees were studied byAnderson (1966).

Theorem 20.2. Natural classifiers based upon median trees with k levels(2k leaf regions) are consistent (ELn → L∗) whenever X has a density,if

n

k2k→∞ and k →∞.

(Note: the conditions of k are fulfilled if k ≤ log2 n−2 log2 log2 n, k →∞.)

We may prove the theorem by checking the conditions of Theorem 6.1.Condition (2) follows trivially by the fact that each leaf region contains atleast n/2k − k points and the condition n/(k2k) →∞. Thus, we need onlyverify the first condition of Theorem 6.1. To make the proof more trans-parent, we first analyze a closely related hypothetical tree, the theoreticalmedian tree. Also, we restrict the analysis to d = 2. The multidimensionalextension is straightforward. The theoretical median tree rotates throughthe coordinates and cuts each hyperrectangle precisely so that the two newhyperrectangles have equal µ-measure.

Figure 20.11. Theoretical median tree withthree levels of cuts.

Observe that the rule is invariant under monotone transformations of thecoordinate axes. Recall that in such cases there is no harm in assumingthat the marginal distributions are all uniform on [0, 1]. We let Hi, Videnote the horizontal and vertical sizes of the rectangles after k levels ofcuts. Of course, we begin with H1 = V1 = 1 when k = 0. We now show


that, for the theoretical median tree, diam(A(X)) → 0 in probability, ask → ∞. Note that diam(A(X)) ≤ H(X) + V (X), where H(X) and V (X)are the horizontal and vertical sizes of the rectangle A(X). We show thatif k is even,

EH(X) + V (X) =2

2k/2,

from which the claim follows. After the k-th round of splits, since all 2k

rectangles have equal probability measure, we have

EH(X) + V (X) =2k∑i=1

12k

(Hi + Vi).

Apply another round of splits, all vertical. Then each term 12k (Hi + Vi)

spawns, so to speak, two new rectangles with horizontal and vertical sizes(H ′

i, Vi) and (H ′′′i , Vi) with Hi = H ′

i +H ′′′i that contribute

12k+1

(H ′i + Vi) +

12k+1

(H ′′′i + Vi) =

12k+1

Hi +12kVi.

The next round yields horizontal splits, with total contribution now (seeFigure 20.12)

12k+2

(H ′i + V ′i +H ′

i + V ′′i +H ′′′i + V ′′′i +H ′′′

i + V ′′′′i ) =1

2k+2(2Hi + 2Vi)

=1

2k+1(Hi + Vi).

Thus, over two iterations of splits, we see that EH(X)+V (X) is halved,and the claim follows by simple induction.

We show now what happens in the real median tree when cuts are basedupon a random sample. We deviate of course from the theoretical mediantree, but consistency is preserved. The reason, seen intuitively, is that ifthe number of points in a cell is large, then the sample median will beclose to the theoretical median, so that the shrinking-diameter property ispreserved. The methodology followed here shows how one may approachthe analysis in general by separating the theoretical model from the sample-based model.

Proof of Theorem 20.2. As we noted before, all we have to showis that diam(A(X)) → 0 in probability. Again, we assume without loss ofgenerality that the marginals of X are uniform [0, 1], and that d = 2. Again,we show that EH(X) + V (X) → 0.


Figure 20.12. A rectangle after tworounds of splits.

If a rectangle of probability mass pi and sizes Hi, Vi is split into fourrectangles as in Figure 20.12, with probability masses p′i, p

′′i , p

′′′i , p

′′′′i , then

the contribution pi(Hi + Vi) to EH(X) + V (X) becomes

p′i(H′i + V ′i ) + p′′i (H

′′i + V ′′i ) + p′′′i (H ′′′

i + V ′′′i ) + p′′′′i (H ′′′′i + V ′′′′i )

after two levels of cuts. This does not exceedpi2

(1 + ε)(Hi + Vi),

ifmax(p′i + p′′i , p

′′′i + p′′′′i ) ≤ 1

2√

1 + εpi,

max(p′i, p′′i ) ≤

12√

1 + ε(p′i + p′′i ),

andmax(p′′′i , p

′′′′i ) ≤ 1

2√

1 + ε(p′′′i + p′′′′i ),

that is, when all three cuts are within (1/2)√

1 + ε of the true median. Wecall such “(1/2)

√1 + ε” cuts good. If all cuts are good, we thus note that in

two levels of cuts, EH(X) + V (X) is reduced by (1 + ε)/2. Also, all pi’sdecrease at a controlled rate. Let G be the event that all 1 + 2 + · · ·+ 2k−1

cuts in a median tree with k levels are good. Then, at level k, all pi’s areat most

(√1 + ε/2

)k. Thus,

2k∑i=1

pi(Hi + Vi) ≤(√

1 + ε

2

)k 2k∑i=1

(Hi + Vi)

≤(3 + 2k/2+1

)(√1 + ε

2

)k,

since∑2k

i=1(Hi + Vi) ≤ 4 + 2 + 4 + 8 + · · ·+ 2k/2 ≤ 2k/2+1 + 3 if k is even.Hence, after k levels of cuts,

EH(X) + V (X) ≤(3 + 2k/2+1

)PGc+

(3 + 2k/2+1

)(√1 + ε

2

)k.

20.4 Binary Search Trees 343

The last term tends to zero if ε is small enough. We bound PGc by 2k

times the probability that one cut is bad. Let us cut a cell with N pointsand probability content p in a given direction. A quick check of the mediantree shows that given the position and size of the cell, the N points insidethe cell are distributed in an i.i.d. manner according to the restriction of µto the cell. After the cut, we have b(N−1)/2c and d(N−1)/2e points in thenew cells, and probability contents p′ and p′′. It is clear that we may assumewithout loss of generality that p = 1. Thus, if all points are projected downin the direction of the cut, and F and FN denote the distribution functionand empirical distribution function of the obtained one-dimensional data,then

P cut is not good|N

≤ Pp′ >

√1 + ε

2or p′′ >

√1 + ε

2

∣∣∣∣N≤ P

for some x, F (x) >

√1 + ε

2, and FN (x) ≤ 1

2

∣∣∣∣N≤ P

supx

(F (x)− FN (x)) >12(√

1 + ε− 1)∣∣∣∣N

≤ 2 exp(−1

2N(√

1 + ε− 1)2)

(by Theorem 12.9)

≤ 2 exp(−(

n

2k+1− k

2

)(√1 + ε− 1

)2)(as N ≥ n/(2k)− k).

Hence, for n large enough,

PGc ≤ 2k+1e−(n/2k+2)(√

1+ε−1)2

and 2k/2PGc → 0 if n/(k2k) →∞. 2

20.4 Binary Search Trees

The simplest trees to analyze are those whose structure depends in astraightforward way on the data. To make this point, we begin with thebinary search tree and its multivariate extension, the k-d tree (see Cormen,Leiserson, and Rivest (1990) for the binary search tree; for multivariate bi-nary trees, we refer to Samet (1990b)). A full-fledged k-d tree is definedas follows: we promote the first data point X1 to the root and partition


X2, . . . , Xn into two sets: those whose first coordinate exceeds that ofX1, and the remaining points. Within each set, points are ordered by orig-inal index. The former set is used to build the right subtree of X1 and thelatter to construct the left subtree of X1. For each subtree, the same con-struction is applied recursively with only one variant: at depth l in the tree,the (l mod d + 1)-st coordinate is used to split the data. In this manner,we rotate through the coordinate axes periodically.

Attach to each leaf two new nodes, and to each node with one childa second child. Call these new nodes external nodes. Each of the n + 1external nodes correspond to a region of Rd, and collectively the externalnodes define a partition of Rd (if we define exactly what happens on theboundaries between regions).

Figure 20.13. A k-d tree of 15 random points on the plane andthe induced partition.

Put differently, we may look at the external nodes as the leaves of a newtree with 2n + 1 nodes and declare this new tree to be our new binaryclassification tree. As there are n + 1 leaf regions and n data points, thenatural binary tree classifier it induces is degenerate—indeed, all externalregions contain very few points. Clearly, we must have a mechanism fortrimming the tree to insure better populated leaves. Let us look at justthree naıve strategies. For convenience, we assume that the data pointsdetermining the tree are not counted when taking a majority vote over thecells. As the number of these points is typically much smaller than n, thisrestriction does not make a significant difference.

(1) Fix k < n, and construct a k-d tree with k internal nodes and k + 1external nodes, based on the first k data points X1, . . . , Xk. Classify


by majority vote over all k+1 regions as in natural classification tees(taking the data pairs (Xk+1, Yk+1), . . . , (Xn, Yn) into account). Callthis the chronological k-d tree.

(2) Fix k and truncate the k-d tree to k levels. All nodes at level k aredeclared leaves and classification is again by majority vote over theleaf regions. Call this the deep k-d tree.

(3) Fix k and trim the tree so that each node represents at least k pointsin the original construction. Consider the maximal such tree. Thenumber of regions here is random, with between 1 and n/k regions.Call this the well-populated k-d tree.

20.5 The Chronological k-d Tree 346

Let the leaf regions be A1, . . . , AN (with N possibly random), anddenote the leaf nodes by u1, . . . , uN . The strict descendants of ui in the fullk-d tree have indices that we will collect in an index set Ii. Define |Ii| = Ni.As the leaf regions form a partition, we have

N∑i=1

Ni = n−N,

because the leaf nodes themselves are not counted in Ii. Voting is by ma-jority vote, so the rule is

gn(x) =

1 if∑i∈Ii

Yi >∑i∈Ii

(1− Yi), x ∈ Ai,

0 otherwise.

20.5 The Chronological k-d Tree

Here we have N = k + 1. Also, (µ(A1), . . . , µ(Ak+1)) are distributed asuniform spacings. That is, if U1, . . . , Uk are i.i.d. uniform [0, 1] randomvariables defining k+1 spacings U(1), U(2)−U(1), . . . , U(k)−U(k−1), 1−U(k)

by their order statistics U(1) ≤ U(2) ≤ · · · ≤ U(k), then these spacings arejointly distributed as (µ(A1), . . . , µ(Ak+1)). This can be shown by induc-tion. When Uk+1 is added, Uk+1 first picks a spacing with probability equalto the size of the spacing. Then it cuts that spacing in a uniform manner.As the same is true when the chronological k-d tree grows by one leaf, theproperty follows by induction on k.

Theorem 20.3. We have ELn → L∗ for all distributions of X with adensity for the chronological k-d tree classifier whenever

k →∞, and k/n→ 0.

Proof. We verify the conditions of Theorem 6.1. As the number of regionsis k+1, and the partition is determined by the first k data points, condition(2) immediately follows from Lemma 20.1 and the remark following it.

Condition (1) of Theorem 6.1 requires significantly more work. We ver-ify condition (1) for d = 2, leaving the straightforward extension to Rd,d > 2, to the reader. Throughout, we assume without loss of general-ity that the marginal distributions are uniform [0, 1]. We may do so bythe invariance properties discussed earlier. Fix a point x ∈ Rd. We insertpoints X1, . . . , Xk into an initially empty k-d tree and let R1, . . . , Rk bethe rectangles containing x just after these points were inserted. Note thatR1 ⊇ R2 ⊇ · · · ⊇ Rk. Assume for simplicity that the integer k is a perfectcube and set l = k1/3, m = k2/3. Define the distances from x to the sidesof Ri by Hi,H

′i, Vi, V

′i (see Figure 20.14).


Figure 20.14. A rectangle Ri contain-ing x with its distances to the sides.

We construct a sequence of events that forces the diameter of Rk to besmall with high probability. Let ε be a small positive number to be specifiedlater.

Denote the four squares with opposite vertices x, x+(±ε,±ε) by C1, C2, C3,C4. Then define the following events:

E1 =4⋂i=1

Ci ∩ X1, . . . , Xl 6= ∅ ,

E2 = max(Vl, V ′l ,Hl,H′l) ≤ ε ,

E3 = min (max(Vl, V ′l ),max(Hl,H′l)) ≤ ε < max(Vl, V ′l ,Hl,H

′l) ,

E4 = at least three of Vm, V′m,Hm,H

′m are ≤ ε ,

E5 = max(Vm, V ′m,Hm,H′m) ≤ ε ,

E6 = max(Vk, V ′k,Hk,H′k) ≤ ε .

If E2, E5, or E6 hold, then diam(Rk) ≤ ε√

8. Assume that we find a setB ⊂ Rd such that PX ∈ B = 1, and for all x ∈ B, and sufficiently smallε > 0, PE2∪E5∪E6 → ∞ as k →∞. Then, by the Lebesgue dominatedconvergence theorem, diam(A(X)) → 0 in probability, and condition (1)of Theorem 6.1 would follow. In the remaining part of the proof we definesuch a set B, and show that PE2∪E5∪E6 → ∞. This will require somework.

We define the set B in terms of the density f of X. x ∈ B if and only if

(1) min1≤i≤4

∫Cif(z)dz > 0 for all ε > 0 small enough;

(2) infrectangles R containing x,

of diameter ≤ ε√

8

∫Rf(z)dzλ(R)

≥ f(x)2

for all ε > 0 small enough;

(3) f(x) ≥ 0.

That PX ∈ B = 1 follows from a property of the support (Problem 20.6),a corollary of the Jessen-Marcinkiewicz-Zygmund theorem (Corollary A.2;this implies (2) for almost all x), and the fact that for µ-almost all x,f(x) > 0.


It is easy to verify the following:

PEc6 = PEc1+PE1 ∩ Ec2 ∩ Ec3+PE1 ∩ Ec2 ∩ E3 ∩ Ec4+PE1 ∩ Ec2 ∩ E3 ∩ E4 ∩ Ec5 ∩ Ec6.

We show that each term tends to 0 at x ∈ B.

Term 1. By the union bound,

PEc1 ≤4∑i=1

PX1 /∈ Ci, . . . , Xl /∈ Ci

≤4∑i=1

(1− µ(Ci))l

≤ exp(−l min

1≤i≤4µ(Ci)

)→ 0,

by part (1) of the definition of B.

Term 2. E1 ⊆ E3 by a simple geometric argument. Hence PE1∩Ec3 = 0.

Term 3. To show that PE1∩Ec2∩E3∩Ec4 → 0, we assume without lossof generality that max(Vl, V ′l ) ≤ ε while max(Hl,H

′l) = a > ε. Let X ′

ibe a subset of Xi, l < i ≤ m consisting of those Xi’s that fall in Rl. Weintroduce three notions in this sequence: first, Zi is the absolute value ofthe difference of the x(2)-coordinates of x and X ′

i. Let Wi be the absolutevalue of the difference of the x(1)-coordinates of x and X ′

i. We re-index thesequence X ′

i (and Wi and Zi) so that i runs from 1 to N , where

N =m∑

i=l+1

IXi∈Rl.

To avoid trivialities, assume that N ≥ 1 (this will be shown to happen withprobability tending to one). Call X ′

i a record if Zi = min(Z1, . . . , Zi). CallX ′i a good point if Wi ≤ ε. An X ′

i causes min(Hm,H′m) ≤ ε if that X ′

i isa good point and a record, and if it defines a vertical cut. The alternatingnature of the cuts makes our analysis a bit heavier than needed. We showhere what happens when all directions are picked independently of eachother, leaving the rotating-cuts case to the reader (Problem 20.8). Thus, ifwe set

Si = IX′i is a record, X′

i defines a vertical cut,


we have,

E1 ∩ Ec2 ∩ E3 ∩ Ec4 ⊆ N = 0 ∪

N∑i=1

SiIWi≤ε = 0

∩ N = 0

.

Re-index again and let X ′1, X

′2, . . . all be records. Note that given X ′

i, X′i+1

is distributed according to f restricted to the rectangle R′ of

height min(Vl, Zi) above x,

height min(V ′l , Zi) below x,

width Hl to the left of x,

width H ′l to the right of x.

Call these four quantities v, v′, h, h′, respectively. Then

PWi+1 ≤ ε|X ′i ≥

(v + v′)εf(x)/2v + v′

=εf(x)

2

because the marginal distribution of an independent X1 is uniform andthus, PX1 ∈ R′|R′ ≤ v + v′, while, by property (2) of B, PX1 ∈R′,W1 ≤ ε|R′ ≥ (v + v′)εf(x)/2.

Recall the re-indexing. Let M be the number of records (thus, M is thelength of our sequence X ′

i). Then

E1 ∩ Ec2 ∩ E3 ∩ Ec4 ∩ N > 0 ⊆

M∑i=1

IWi≤εIX′i

is a vertical cut = 0

.

But as cuts have independently picked directions, and since PWi+1 ≤ε|X ′

i ≥ εf(x)/2, we see that

PE1 ∩ Ec2 ∩ E3 ∩ Ec4, N > 0 ≤ E

(1− εf(x)

4

)MIN>0

.

We rewriteM =∑Ni=1 IX′

i is a record and recall that the indicator variablesin this sum are independent and are of mean 1/i (Problem 20.9). Hence,for c > 0,

EcM

= E

N∏i=1

(c

i+ 1− 1

i

)

≤ Ee−(1−c)

∑N

i=11/i

(use 1− x ≤ e−x)

≤ E

1(N + 1)1−c

(use

∑Ni=1 1/i ≥ log(N + 1)).

20.6 The Deep k-d Tree 350

The latter formula remains valid even if N = 0. Thus, with c = 1−εf(x)/4,

PE1 ∩ Ec2 ∩ E3 ∩ Ec4 ≤ E

(1− εf(x)

4

)M

≤ E

1(N + 1)1−c

.

N is binomial with parameters (m − l, µ(Rl)). We know from the intro-duction of this section that µ(Rl) is distributed as the minimum of l i.i.d.uniform [0, 1] random variables. Thus, for δ > 0, Eµ(Rl) = 1/(l+1), andPµ(Rl) < δ/l ≤ δ. Hence,

E

1(N + 1)1−c

≤ Pµ(Rl) < δ/l+ E

1

(Binomial(m− l, δ/l) + 1)1−c

≤ δ +

(l

2(m− l)δ

)1−c

+ P

Binomial(m− l, δ/l) <(m− l)δ

2l

.

The first term is small by choice of δ. The second one is O(k−(1−c)/3). The

third one is bounded from above, by Chebyshev’s inequality, by

(m− l)(δ/l)(1− δ/l)((m− l)δ/2l)2

≤ 4l(m− l)δ

→ 0.

Term 4. This term is handled exactly as Term 3. Note, however, thatl and m now become m and k respectively. The convergence to 0 requiresnow m/(k −m) → 0, which is still the case.

This concludes the proof of Theorem 20.3. 2

20.6 The Deep k-d Tree

Theorem 20.4. The deep k-d tree classifier is consistent (i.e., ELn →L∗) for all distributions such that X has a density, whenever

limn→∞

k = ∞ and lim supn→∞

k

log n< 2.

Proof. In Problem 20.10, you are asked to show that k → ∞ impliesdiam(A(X)) → 0 in probability. Theorem 20.3 may be invoked here. We

20.7 Quadtrees 351

now show that the assumption lim supn→∞ k/ log n < 2 implies thatN(X) →∞ in probability. Let D be the depth (distance from the root) of X whenX is inserted into a k-d tree having n elements. Clearly, N(X) ≥ D − k,so it suffices to show that D − k → ∞ in probability. We know thatD/(2 log n) → 1 in probability (see, e.g., Devroye (1988a) and Problem20.10). This concludes the proof of the theorem. 2

20.7 Quadtrees

Quadtrees or hyperquadtrees are unquestionably the most prominent treesin computer graphics. Easy to manipulate and compact to store, they havefound their way into mainstream computer science. Discovered by Finkeland Bentley (1974) and surveyed by Samet (1984), (1990a), they take sev-eral forms. We are given d-dimensional data X1, . . . , Xn. The tree is con-structed as the k-d tree. In particular, X1 becomes the root of the tree.It partitions X2, . . . , Xn into 2d (possibly empty) sets according to mem-bership in one of the 2d (hyper-) quadrants centered at X1 (see Figure20.15).

Figure 20.15. Quadtree and the induced partition of R2.The points on the right are shown in the position in space.The root is specially marked.

The partitioning process is repeated at the 2d child nodes until a certainstopping rule is satisfied. In analogy with the k-d tree, we may define thechronological quadtree (only k splits are allowed, defined by the first kpoints X1, . . . , Xk), and the deep quadtree (k levels of splits are allowed).Other, more balanced versions may also be introduced. Classification isby majority vote over all (Xi, Yi)’s—with k < i ≤ n in the chronologicalquadtree—that fall in the same region as x. Ties are broken in favor ofclass 0. We will refer to this as the (chronological, deep) quadtree classifier.

20.8 Best Possible Perpendicular Splits 352

Theorem 20.5. Whenever X has a density, the chronological quadtreeclassifier is consistent (ELn → L∗) provided that k →∞ and k/n→ 0.

Proof. Assume without loss of generality that all marginal distributionsare uniform [0, 1]. As theX-property holds, Theorem 6.1 applies. By Lemma20.1, since we have k(2d − 1) + 1 external regions,

PN(X) ≤M ≤ (2M + 4)(k(2d − 1) + 1)n− k

→ 0

for all M > 0, provided that k/n → 0. Hence, we need only verify thecondition diam(A(X)) → 0 in probability. This is a bit easier than inthe proof of Theorem 20.3 for the k-d tree and is thus left to the reader(Problem 20.18). 2

Remark. Full-fledged random quadtrees with n nodes have expected heightO(log n) whenever X has a density (see, e.g., Devroye and Laforest (1990)).With k nodes, every region is thus reached in only O(log k) steps on theaverage. Furthermore, quadtrees enjoy the same monotone transformationinvariance that we observed for k-d trees and median trees. 2

20.8 Best Possible Perpendicular Splits

For computational reasons, classification trees are most often producedby determining the splits recursively. At a given stage of the tree-growingalgorithm, some criterion is used to determine which node of the tree shouldbe split next, and where the split should be made. As these criteria typicallyuse all the data, the resulting trees no longer have the X-property. In thissection we examine perhaps the most natural criterion. In the followingsections we introduce some alternative splitting criteria.

A binary classification tree can be obtained by associating with eachnode a splitting function φ(x) obtained in a top-down fashion from thedata. For example, at the root, we may select the function

φ(x) = sx(i) − α,

where i, the component cut, α ∈ R, the threshold, and s ∈ −1,+1, apolarization, are all dependent upon the data. The root then splits the dataDn = (X1, Y1), . . . , (Xn, Yn) into two sets, D′

n, D′′n, with D′

n ∪D′′n = Dn,

|D′n|+ |D′′

n| = n, such that

D′n = (x, y) ∈ Dn : φ(x) ≥ 0 ,

D′′n = (x, y) ∈ Dn : φ(x) < 0 .


A decision is made whether to split a node or not, and the procedure isapplied recursively to the subtrees. Natural majority vote decisions aretaken at the leaf level. All such trees will be called perpendicular splittingtrees.

In Chapter 4, we introduced univariate Stoller splits, that is, splits thatminimize the empirical error. This could be at the basis of a perpendicularsplitting tree. One realizes immediately that the number of possibilities forstopping is endless. To name two, we could stop after k splitting nodeshave been defined, or we could make a tree with k full levels of splits (sothat all leaves are at distance k from the root). We first show that ford > 1, any such strategy is virtually doomed to fail. To make this case,we will argue on the basis of distribution functions only. For convenience,we consider a two-dimensional problem. Given a rectangle R now assignedto one class y ∈ 0, 1, we see that the current probability of error in R,before splitting is PX ∈ R, Y 6= y. Let R′ range over all rectangles ofthe form R ∩ ((−∞, α] × R), R ∩ ([α,∞) × R), R ∩ (R × (−∞, α]), orR ∩ (R × [α,∞)), and let R′′ = R − R′. Then after a split based upon(R′, R′′), the probability of error over the rectangle R is

PX ∈ R′, Y = 1+ PX ∈ R′′, Y = 0

as we assign class 0 to R′ and class 1 to R′′. Call ∆(R) the decreasein probability of error if we minimize over all R′. Compute ∆(R) forall leaf rectangles and then proceed to split that rectangle (or leaf) forwhich ∆(R) is maximal. The data-based rule based upon this would pro-ceed similarly, if PA is replaced everywhere by the empirical estimate(1/n)

∑ni=1 I(Xi,Yi)∈A, where A is of the form R′ × 1, R′′ × 0, or

R× 1− y, as the case may be.Let us denote by L0, L1, L2, . . . the sequence of the overall probabilities

of error for the theoretically optimal sequence of cuts described above. Herewe start with R = R2 and y = 0, for example. For fixed ε > 0, we nowconstruct a simple example in which

L∗ = 0,

and

Lk ↓1− ε

2as k →∞.

Thus, applying the best split incrementally, even if we use the true proba-bility of error as our criterion for splitting, is not advisable.

The example is very simple: X has uniform distribution on [0, 1]2 withprobability ε and on [1, 2]2 with probability 1−ε. Also, Y is a deterministicfunction of X, so that L∗ = 0.


Figure 20.16. Repeated Stoller splits arenot consistent in this two-dimensional ex-ample. Cuts will always be made in the left-most square.

The way Y depends on X is shown in Figure 20.16: Y = 1 if

X ∈ [1, 3/2]2 ∪ [3/2, 2]2 ∪ (A2 ∪A4 ∪A6 . . .)× [0, 1],

where

A1 = [0, 1/4),

A2 = [1/4, 1/4 + 3/8),

A3 = [1/4 + 3/8, 1/4 + 3/8 + 3/16),

...

Ak = [1/4 + 3/8 + · · ·+ 3/2k, 1/4 + 3/8 + · · ·+ 3/2k+1),

and so forth. We verify easily that PY = 1 = 1/2. Also, the errorprobability before any cut is made is L0 = 1/2. The best split has R′ =(−∞, 1/4) ×R so that A1 is cut off. Therefore, L1 = ε/4 + (1 − ε)/2. Wecontinue and split off A2 and so forth, leading to the tree of Figure 20.17.

Figure 20.17. The tree obtainedby repeated Stoller splits.

20.9 Splitting Criteria Based on Impurity Functions 355

Verify that L2 = ε/8 + (1− ε)/2 and in general that

Lk =ε

2k+1+

1− ε

2↓ 1− ε

2,

as claimed.

20.9 Splitting Criteria Based on ImpurityFunctions

In 1984, Breiman, Friedman, Olshen, and Stone presented their cart pro-gram for constructing classification trees with perpendicular splits. Oneof the key ideas in their approach is the notion that trees should be con-structed from the bottom up, by combining small subtrees. The startingpoint is a tree with n + 1 leaf regions defined by a partition of the spacebased on the n data points. Such a tree is much too large and is pruned bysome methods that will not be explored here. When constructing a start-ing tree, a certain splitting criterion is applied recursively. The criteriondetermines which rectangle should be split, and where the cut should bemade. To keep the classifier invariant under monotone transformation ofthe coordinate axes, the criterion should only depend on the coordinatewiseranks of the points, and their labels. Typically the criterion is a functionof the numbers of points labeled by 0 and 1 in the rectangles after the cutis made. One such class of criteria is described here.

Let α ∈ R, and let i be a given coordinate (1 ≤ i ≤ d). Let R be ahyperrectangle to be cut. Define the following quantities for a split at α,perpendicular to the i-th coordinate:

Xi(R,α) =

(Xj , Yj) : Xj ∈ R,X(i)j ≤ α

,

X ci (R,α) =

(Xj , Yj) : Xj ∈ R,X(i)

j > α

are the sets of pairs falling to the left and to the right of the cut, respectively.

Ni(R,α) = |Xi(R,α)|, N ′i(R,α) = |X c

i (R,α)|

are the numbers of such pairs. Finally, the numbers of points with label 0and label 1 in these sets are denoted, respectively, by

Ni,0(R,α) = |Xi(R,α) ∩ (Xj , Yj) : Yj = 0| ,Ni,1(R,α) = |Xi(R,α) ∩ (Xj , Yj) : Yj = 1| ,N ′i,0(R,α) = |X c

i (R,α) ∩ (Xj , Yj) : Yj = 0| ,N ′i,1(R,α) = |X c

i (R,α) ∩ (Xj , Yj) : Yj = 1| .


Following Breiman, Friedman, Olshen, and Stone (1984), we define an im-purity function for a possible split (i, α) by

Ii(R,α) = ψ

(Ni0

Ni0 +Ni1,

Ni1Ni0 +Ni1

)Ni+ψ

(N ′i0

N ′i0 +N ′

i1

,N ′i1

N ′i0 +N ′

i1

)N ′i ,

where we dropped the argument (R,α) throughout. Here ψ is a nonnegativefunction with the following properties:

(1) ψ(

12,12

)≥ ψ(p, 1− p) for any p ∈ [0, 1];

(2) ψ(0, 1) = ψ(1, 0) = 0;

(3) ψ(p, 1− p) increases in p on [0, 1/2] and decreases in p on [1/2, 1].

A rectangle R is split at α along the i-th coordinate if Ii(R,α) is minimal.Ii penalizes splits in which the subregions have about equal proportionsfrom both classes. Examples of such functions ψ include

(1) The entropy function ψ(p, 1−p) = −p log p−(1−p) log(1−p) (Breimanet al. (1984, pp.25,103)).

(2) The Gini function ψ(p, 1− p) = 2p(1− p), leading to the Gini indexof diversity Ii. (Breiman et al. (1984, pp.38,103)).

(3) The probability of misclassification ψ(p, 1−p) = min(p, 1−p). In thiscase the splitting criterion leads to the empirical Stoller splits studiedin the previous section.

We have two kinds of splits:

(A) The forced split: force a split along the i-th coordinate, but minimizeIi(α,R) over all α and rectangles R.

(B) The free split: choose the most advantageous coordinate for splitting,that is, minimize Ii(α,R) over all α, i, and R.

Unfortunately, regardless of which kind of policy we choose, there are dis-tributions for which no splitting based on an impurity function leads to aconsistent classifier. To see this, note that the two-dimensional example ofthe previous section applies to all impurity functions. Assume that we hadinfinite sample size. Then Ii(α,R) would approach aψ(p, 1−p)+bψ(q, 1−q),where p is the probability content of R′ (one of the rectangles obtained af-ter the cut is made), b is that of R′′ (the other rectangle), p = PY =1|X ∈ R′ and q = PY = 1|X ∈ R′′. If X is uniformly distributed onthe checkerboard shown in Figure 20.18, regardless where we try to cut,p = q = 1/2, and every cut seems to be undesirable.


Figure 20.18. No cut decreases the value of the impu-rity function.

This simple example may be made more interesting by mixing it with adistribution with much less weight in which x(1)- and x(2)-direction splitsare alternatingly encouraged all the time. Therefore, impurity functionsshould be avoided in their raw form for splitting. This may explain partiallywhy in cart, the original tree is undesirable and must be pruned from thebottom up. See Problems 20.22 to 20.24 for more information. In the nextsection and in the last section of this chapter we propose other ways ofgrowing trees with much more desirable properties.

The derivation shown above does not indicate that the empirical versionwill not work properly, but simple versions of it will certainly not. SeeProblem 20.22.

Remark. Malicious splitting. The impurity functions suggested aboveall avoid leaving the proportions of zeros and ones intact through splitting.They push towards more homogeneous regions. Assume now that we do theopposite. Through such splits, we can in fact create hyperplane classifica-tion trees that are globally poor, that is, that are such that every trimmedversion of the tree is also a poor classifier. Such splitting methods must ofcourse use the Yi’s. Our example shows that any general consistency theo-rem for hyperplane classification trees must come with certain restrictionson the splitting process—the X property is good; sometimes it is necessaryto force cells to shrink; sometimes the position of the split is restricted byempirical error minimization or some other criterion.

The ham-sandwich theorem (see Edelsbrunner (1987)) states that given2n class-0 points and 2m class-1 points in Rd, d > 1, there exists a hyper-plane cut that leaves n class-0 points andm class-1 points in each halfspace.So, assume that X has a density and that p = PY = 1 = 1/2. In a sam-ple of size n, let y be the majority class (ties are broken in favor of class0).

Figure 20.19. Ham-sandwich cut: Each halfs-pace contains exactly half the points from eachclass.

Regardless of the sample make-up, if y = 0, we may construct a hyperplane-

20.10 A Consistent Splitting Criterion 358

tree classifier in which, during the construction, every node represents aregion in which the majority vote would be 0. This property has nothing todo with the distribution of (X,Y ), and therefore, for any trimmed versionof the tree classifier,

Ln ≥ pIy=0

and PLn ≥ p ≥ Py = 0 ≥ 1/2 if p = 1/2. Obviously, as we may takeL∗ = 0, these classifiers are hopeless. 2

Bibliographic remarks. Empirical Stoller splits without forced rotationwere recommended by Payne and Meisel (1977) and Rounds (1980), buttheir failure to be universally good was noted by Gordon and Olshen (1978).The last two authors recommended a splitting scheme that combined sev-eral ideas, but roughly speaking, they perform empirical Stoller splits withforced rotation through the coordinate axes (Gordon and Olshen (1978),(1980)). Other splitting criteria include the aid criterion of Morgan andSonquist (1963), which is a predecessor of the Gini index of diversity usedin cart (Breiman, Friedman, Olshen, and Stone (1984), see also Gelfandand Delp (1991), Guo and Gelfand (1992) Gelfand, Ravishankar, and Delp(1989), (1991), and Ciampi (1991)). Michel-Briand and Milhaud (1994)also observed the failure of multivariate classification trees based on theaid criterion.

The Shannon entropy or modifications of it are recommended by Talmon(1986), Sethi and Sarvarayudu (1982), Wang and Suen (1984), Goodmanand Smyth (1988), and Chou (1991). Permutation statistics are used in Liand Dubes (1986), still without forced rotations through the coordinateaxes. Quinlan (1993) has a more involved splitting criterion. A generaldiscussion on tree splitting may be found in the paper by Sethi (1991). Aclass of impurity functions is studied in Burshtein, Della Pietra, Kanevsky,and Nadas (1992). Among the pioneers of tree splitting (with perpendicularcuts) are Sebestyen (1962), and Henrichon and Fu (1969). For related work,we refer to Stoffel (1974), Sethi and Chatterjee (1977), Argentiero, Chin andBeaudet (1982), You and Fu (1976), Anderson and Fu (1979), Qing-Yunand Fu (1983), Hartmann, Varshney, Mehrotra, and Gerberich (1982), andCasey and Nagy (1984). References on nonperpendicular splitting methodsare given below in the section on bsp trees.

20.10 A Consistent Splitting Criterion

There is no reason for pessimism after the previous sections. Rest assured,there are several consistent splitting strategies that are fully automatic anddepend only upon the populations of the regions. In this section we providea solution for the simple case when X is univariate and nonatomic. It is

20.10 A Consistent Splitting Criterion 359

possible to generalize the method for d > 1 if we force cuts to alternatedirections. We omit here the detailed analysis for multidimensional casesfor two reasons. First, it is significantly heavier than for d = 1. Secondly,in the last section of this chapter we introduce a fully automatic way ofbuilding up consistent trees, that is, without forcing the directions of thesplits.

To a partition A1, . . . , AN of R, we assign the quantity

Q =N∑i=1

N0(Ai)N1(Ai),

where

N0(A) =n∑j=1

IXj∈A,Yj=0, N1(A) =n∑j=1

IXj∈A,Yj=1

are the respective numbers of points labeled with 0 and 1 falling in theregion A. The tree-growing algorithm starts with the trivial partition R,and at each step it makes a cut that yields the minimal value of Q. Itproceeds recursively until the improvement in the value of Q falls below athreshold ∆n.

Remark. Notice that this criterion always splits a cell that has many pointsfrom both classes (see the proof of the theorem below). Thus, it avoids theanomalies of impurity-function criteria described in the previous section.On the other hand, it does not necessarily split large cells, if they are almosthomogeneous. For a comparison, recall that the Gini criterion minimizesthe quantity

Q′ =N∑i=1

N0(Ai)N1(Ai)N0(Ai) +N1(Ai)

,

thus favoring cutting cells with very few points. We realize that the criterionQ introduced here is just one of many with similarly good properties, andalbeit probably imperfect, it is certainly one of the simplest. 2

Theorem 20.6. Let X have a nonatomic distribution on the real line, andconsider the tree classifier obtained by the algorithm described above. If thethreshold satisfies

∆n

n→∞ and

∆n

n2→ 0,

then the classification rule is strongly consistent.

Proof. There are two key properties of the algorithm that we exploit:

Property 1. If min(N0(A), N1(A)) >√

2∆n for a cell A, then A gets cutby the algorithm. To see this, observe that (dropping the argument A from

20.11 BSP Trees 360

the notation) if N0 ≥ N1, and we cut A so that the number of 0-labeledpoints in the two child regions differ by at most one, then the contributionof these two new regions to Q is

dN0/2eN ′1 + bN0/2cN ′′

1 ≤ dN0N1/2e,

where N ′1 and N ′′

1 are the numbers of class-1 points in the two child regions.Thus, the decrease of Q if A is split is at least bN0N1/2c. If min(N0, N1) >√

2∆n, then ∆n ≤ bN0N1/2c, and a cut is made.

Property 2. No leaf node has less than ∆n/n points. Assume that aftera region is cut, in one of the child regions, the total number of points isN ′

0 +N ′1 ≤ k. Then the improvement in Q caused by the split is bounded

by

N0N1− (N ′0N

′1 +N ′′

0N′′1 ) ≤ N0N1− (N0− k)(N1− k) ≤ (N0 +N1)k ≤ nk.

Therefore, if k < ∆n/n, then the improvement is smaller than ∆n. Thus,no cut is made that leaves behind a child region with fewer than ∆n/npoints.

It follows from Property 1 that if a leaf region has more than 4√

2∆n

points, then since the class in minority has less than√

2∆n points in it, itmay be cut into intervals containing between 2

√2∆n and 4

√2∆n points

without altering the decision, since the majority vote within each regionremains the same.

Summarizing, we see that the tree classifier is equivalent to a classifierthat partitions the real line into intervals, each containing at least ∆n/n,and at most 4

√2∆n data points. Thus, in this partition, each interval has a

number of points growing to infinity as o(n). We emphasize that the numberof points in a region of the studied tree classifier may be large, but suchregions are almost homogeneous, and therefore the classifier is equivalentto another classifier which has o(n) points in each region. Consistency ofsuch partitioning classifiers is proved in the next chapter—see Theorem21.3. 2

20.11 BSP Trees

Binary space partition trees (or bsp trees) partition Rd by hyperplanes.Trees of this nature have evolved in the computer graphics literature viathe work of Fuchs, Kedem, and Naylor (1980) and Fuchs, Abram, andGrant (1983) (see also Samet (1990b), Kaplan (1985), and Sung and Shirley(1992)). In discrimination they are at the same time generalizations of lineardiscriminants, of histograms, and of binary tree classifiers. bsp trees were

20.11 BSP Trees 361

recommended for use in discrimination by Henrichon and Fu (1969), Mi-zoguchi, Kizawa, and Shimura (1977) and Friedman (1977). Further stud-ies include Sklansky and Michelotti (1980), Argentiero, Chin, and Beaudet(1982), Qing-Yun and Fu (1983), Breiman, Friedman, Olshen, and Stone(1984), Loh and Vanichsetakul (1988), and Park and Sklansky (1990).

There are numerous ways of constructing bsp trees. Most methods ofcourse use the Y -values to determine good splits. Nevertheless, we shouldmention first simple splits with theX-property, if only to better understandthe bsp trees.

Figure 20.20. A raw bsp tree and its induced partition in the plane.Every region is split by a line determined by the two data points withsmallest index in the region.

We call the raw bsp tree the tree obtained by letting X1, . . . , Xd de-termine the first hyperplane. The d data points remain associated withthe root, and the others (Xd+1, . . . , Xk) are sent down to the subtrees,where the process is repeated as far as possible. The remaining pointsXk+1, . . . , Xn are used in a majority vote in the external regions. Notethat the number of regions is not more than k/d. Thus, by Lemma 20.1, ifk/n→ 0, we have N(X) →∞ in probability. Combining this with Problem20.25, we have our first result:

Theorem 20.7. The natural binary tree classifier based upon a raw bsptree with k →∞ and k/n→ 0 is consistent whenever X has a density.

Hyperplanes may also be selected by optimization of a criterion. Typi-cally, this would involve separating the classes in some way. All that wassaid for perpendicular splitting remains valid here. It is worthwhile re-calling therefore that there are distributions for which the best empirical

20.12 Primitive Selection 362

Stoller split does not improve the probability of error. Take for examplethe uniform distribution in the unit ball of Rd in which

Y =

1 if ‖X‖ ≥ 1/20 if ‖X‖ < 1/2.

Figure 20.21. No single split improves on the errorprobability for this distribution.

Here, no linear split would be helpful as the 1’s would always hold astrong majority. Minimizing other impurity functions such as the Gini cri-terion may be helpful, however (Problem 20.27).

Bibliographic remarks. Hyperplane splits may be generalized to in-clude quadratic splits (Henrichon and Fu (1969)). For example, Mui andFu (1980) suggest taking d′ < d and forming quadratic classifiers as in nor-mal discrimination (see Chapter 4) based upon vectors in Rd′ . The cuts arethus perpendicular to d − d′ axes but quadratic in the subspace Rd′ . Linand Fu (1983) employ the Bhattacharyya distance or 2-means clusteringfor determining splits. As a novelty, within each leaf region, the decision isnot by majority vote, but rather by a slightly more sophisticated rule suchas the k-nn rule or linear discrimination. Here, no optimization is requiredalong the way. Loh and Vanichsetakul (1988) allow linear splits but use Fratios to select desirable hyperplanes.

20.12 Primitive Selection

There are two reasons for optimizing a tree configuration. First of all, itjust does not make sense to ignore the class labels when constructing atree classifier, so the Yi’s must be used to help in the selection of a besttree. Secondly, some trees may not be consistent (or provably consistent),yet when optimized over a family of trees, consistency drops out. We takethe following example: let Gk be a class of binary tree classifiers with theX-property, with the space partitioned into k + 1 regions determined byX1, . . . , Xk only. Examples include the chronological k-d tree and somekinds of bsp trees. We estimate the error for g ∈ Gk by

Ln(g) =1

n− k

n∑i=k+1

Ig(Xi) 6=Yi,


realizing the danger of using the same data that were used to obtainmajority votes to estimate the error. An optimistic bias will be intro-duced. (For more on such error estimates, see Chapter 23.) Let g∗n be theclassifier (or one of the classifiers) in Gk for which Ln is minimum. As-sume that |Gk| < ∞—for example, Gk could be all k! chronological k-dtrees obtained by permuting X1, . . . , Xk. We call g∗n then the permutation-optimized chronological k-d classifier. When k = b

√log nc, one can verify

that k! = o(nε) for any ε > 0, so that the computational burden—atleast on paper—is not out of sight. We assume that Gk has a consistentclassifier sequence, that is, as n → ∞, k usually grows unbounded, andELn(gk) → L∗ for a sequence gk ∈ Gk.

Example. Among the chronological k-d trees modulo permutations, thefirst one (i.e., the one corresponding to the identity permutation) was shownto be consistent in Theorem 20.3, if X has a density, k →∞, and k/n→ 0.2

Example. Let Gk be the class of bsp trees in which we take as possiblehyperplanes for splitting the root:

(1) the hyperplane through X1, . . . , Xd;(2) the d hyperplanes through X1, . . . , Xd−1 that are parallel to one

axis;(3) the

(d2

)hyperplanes through X1, . . . , Xd−2 that are parallel to two

axes;...

(d) the(dd−1

)= d hyperplanes through X1 that are parallel to d − 1

axes.Thus, conservatively estimated, |Gk| ≤ k!2d because there are at most 2d

possible choices at each node and there are k! permutations of X1, . . . , Xk.Granted, the number of external regions is very variable, but it remainsbounded by k+ 1 in any case. As Gk contains the chronological k-d tree, ithas a consistent sequence of classifiers when k →∞ and n/(k log k) →∞.2

Theorem 20.8. Let g∗n = arg ming∈GkLn(g). Then, if Gk has a consistent

sequence of classifiers, if the number of regions in the partitions for allg ∈ Gk are at most k + 1, and if k → ∞ and n/ log |Gk| → ∞, thenELn(g∗n) → L∗, where Ln(g∗n) is the conditional probability of error ofg∗n.

In the examples cited above, we must take k → ∞, n/k → ∞. Further-more, log |Gk| = O(log k!) = O(k log k) in both cases. Thus, g∗n is consistentwhenever X has a density, k → ∞ and k = o(n/ log n). This is a simpleway of constructing a basically universally consistent bsp tree.


Proof. Let g+ = arg ming∈GkLn(g).

Ln(g∗n)− L∗ = Ln(g∗n)− Ln(g∗n)

+ Ln(g∗n)− Ln(g+)

+ Ln(g+)− Ln(g+)

+Ln(g+)− L∗.

Clearly,

Ln(g∗n)− L∗ ≤ Ln(g+)− L∗ + 2 maxg∈Gk

|Ln(g)− Ln(g)|

def= I + II.

Obviously, I → 0 in the mean by our assumption, and for ε > 0,

P

2 maxg∈Gk

|Ln(g)− Ln(g)| > ε

≤ |Gk|max

g∈Gk

P|Ln(g)− Ln(g)| >

ε

2

.

Next we bound the probabilities on the right-hand side. Let pi0, pi1 denotePX ∈ region i, Y = 0 and PX ∈ region i, Y = 1 respectively, withregions determined by g, 1 ≤ i ≤ k + 1. Let

Ni0 =n∑

j=k+1

IXj∈region i,Yj=0 and Ni1 =n∑

j=k+1

IXj∈region i,Yj=1.

Then

Ln(g) =k+1∑i=1

pi1INi0≥Ni1 +k+1∑i=1

pi0INi0<Ni1,

and

Ln(g) =k+1∑i=1

Ni1n− k

INi0≥Ni1 +k+1∑i=1

Ni0n− k

INi0<Ni1.

Thus,

|Ln(g)− Ln(g)| ≤k+1∑i=1

∣∣∣∣pi1 − Ni1n− k

∣∣∣∣+ k+1∑i=1

∣∣∣∣pi0 − Ni0n− k

∣∣∣∣ .Introduce the notation Z for the random variable on the right-hand side of


the above inequality. By the Cauchy-Schwarz inequality,

EZ|X1, . . . , Xk

≤k+1∑i=1

√√√√E

(pi1 −

Ni1n− k

)2

|X1, . . . , Xk

+k+1∑i=1

√√√√E

(pi0 −

Ni0n− k

)2

|X1, . . . , Xk

(as given X1, . . . , Xk, Ni1 is binomial (n− k, pi1))

≤k+1∑i=1

(√pi1n− k

+√

pi0n− k

)

≤ 2k + 2√n− k

√√√√ 12k + 2

k+1∑i=1

(pi1 + pi0)

(by another use of the Cauchy-Schwarz inequality)

=

√2k + 2n− k

.

Thus, EZ|X1, . . . , Xk → 0 as k/n→ 0. Note that

PZ > ε|X1, . . . , Xk

≤ PZ −EZ|X1, . . . , Xk >

ε

2

∣∣∣X1, . . . , Xk

(if EZ|X1, . . . , Xk < ε/2, which happens when

√2k + 2n− k

≤ ε

2)

≤ exp(−( ε

2

)2/2(n− k)

4(n− k)2

)(by McDiarmid’s inequality, as changing the value of an Xj , j ≥ k,

changes Z by at most 2/(n− k); see Theorem 9.2)

= exp(− (n− k)ε2

32

).

Thus, taking expected values, we see that

PII > ε ≤ |Gk|e−(n−k)ε2/64

if√

(2k + 2)/(n− k) ≤ ε/2. This tends to 0 for all ε > 0 if k/n → 0 andn/ log |Gk| → ∞. 2

20.13 Constructing Consistent Tree Classifiers 366

20.13 Constructing Consistent Tree Classifiers

Thus far, we have taken you through a forest of beautiful trees and we haveshown you a few tricks of the trade. When you read (or write) a researchpaper on tree classifiers, and try to directly apply a consistency theorem,you will get frustrated however—most real-life tree classifiers use the datain intricate ways to suit a certain application. It really helps to have afew truly general results that have universal impact. In this section we willpoint you to three different places in the book where you may find usefulresults in this respect. First of all, there is a consistency theorem—Theorem21.2—that applies to rules that partition the space and decide by majorityvote. The partition is arbitrary and may thus be generated by using someor all of X1, Y1, . . . , Xn, Yn. If a rule satisfies the two (weak) conditionsof Theorem 21.2, it must be universally consistent. To put it differently,even the worst rule within the boundaries of the theorem’s conditions mustperform well asymptotically.

Second, we will briefly discuss the design of tree classifiers obtained byminimizing the empirical error estimate (Ln) over possibly infinite classesof classifiers. Such classifiers, however hard to find by an algorithm, haveasymptotic properties that are related to the vc dimension of the classof rules. Consistency follows almost without work if one can calculate orbound the vc dimension appropriately. While Chapters 13 and 21 deal inmore detail with the vc dimension, it is necessary to give a few exampleshere.

Third, we point the reader to Chapter 22 on data splitting, where theprevious approach is applied to the minimization of the holdout estimate,obtained by trees based upon part of the sample, and using another partto select the best tree in the bunch. Here too the vc dimension plays acrucial role.

Theorem 21.2 allows space partitions that depend quite arbitrarily onall the data and extends earlier universally applicable results of Gordonand Olshen (1978), (1980), (1984), and Breiman, Friedman, Olshen, andStone (1984). A particularly useful format is given in Theorem 21.8. If thepartition is by recursive hyperplane splits as in bsp trees and the numberof splits is at most mn, if mn log n/n→ 0, and if

diam(A(X) ∩ SB) → 0

with probability one for all SB (where SB is the ball of radius B centeredat the origin), then the classification rule is strongly consistent. The lastcondition forces a randomly picked region in the partition to be small.However, mn log n/n → 0 guarantees that no devilish partition can be in-consistent. The latter condition is certainly satisfied if each region containsat least kn points, where kn/ log n→∞.

Next we take a look at full-fledged minimization of Ln, the empiricalerror, over certain classes of tree classifiers. Here we are not concerned

20.14 A Greedy Classifier 367

with the (often unacceptable) computational effort. For example, let Gkbe the class of all binary tree classifiers based upon a tree consisting of kinternal nodes, each representing a hyperplane cut (as in the bsp tree), andall possible 2k+1 labelings of the k + 1 leaf regions. Pick such a classifierfor which

Ln(g) =1n

n∑i=1

Ig(Xi) 6=Yi

is minimal. Observe that the chosen tree is always natural; that is, it takesmajority votes over the leaf regions. Thus, the minimization is equivalentto the minimization of the resubstitution error estimate (defined in Chapter23) over the corresponding class of natural tree classifiers. We say that asequence Gk of classes is rich if we can find a sequence gk ∈ Gk suchthat L(gk) → L∗. For hyperplanes, this is the case if k → ∞ as n → ∞—just make the hyperplane cuts form a regular histogram grid and recallTheorem 9.4. Let S(Gk, n) be the shatter coefficient of Gk (for a definition,see Chapter 12, Definition 12.3). For example, for the hyperplane family,

S(Gk, n) ≤ nk(d+1).

Then by Corollary 12.1 we have ELn(g∗n) → L∗ for the selected classifierg∗n when Gk is rich (k →∞ here) and logS(Gk, n) = o(n), that is,

k = o

(n

log n

).

Observe that no conditions are placed on the distribution of (X,Y ) here!Consistency follows from basic notions—one combinatorial to keep us fromoverfitting, and one approximation-theoretical (the richness).

The above result remains valid under the same conditions on k in thefollowing classes:

(1) All trees based upon k internal nodes each representing a perpendic-ular split. (Note: S(Gk, n) ≤ ((n+ 1)d)k.)

(2) All trees based upon k internal nodes, each representing a quadtreesplit. (Note: S(Gk, n) ≤ (nd + 1)k.)

20.14 A Greedy Classifier

In this section, we define simply a binary tree classifier that is grown viaoptimization of a simple criterion. It has the remarkable property that itdoes not require a forced rotation through the coordinate axes or specialsafeguards against small or large regions or the like. It remains entirelyparameter-free (nothing is picked by the user), is monotone transformation


invariant, and fully automatic. We show that in Rd it is always consistent.It serves as a prototype for teaching about such rules and should not beconsidered as more than that. For fully practical methods, we beleive, onewill have to tinker with the approach.

The space is partitioned into rectangles as shown below:

Figure 20.22. A tree based on partitioning the plane into rectangles.The right subtree of each internal node belongs to the inside of a rectan-gle, and the left subtree belongs to the complement of the same rectangle(ic denotes the complement of i). Rectangles are not allowed to overlap.

A hyperrectangle defines a split in a natural way. The theory presentedhere applies for many other types of cuts. These will be discussed after themain consistency theorem is stated.

A partition is denoted by P, and a decision on a set A ∈ P is by majorityvote. We write gP for such a rule:

gP(x) =

1 if∑

i:Xi∈AYi >

∑i:Xi∈A

(1− Yi), x ∈ A

0 otherwise.

Given a partition P, a legal rectangle A is one for which A ∩ B = ∅ orA ⊆ B for all sets B ∈ P. If we refine P by adding a legal rectangle Tsomewhere, then we obtain the partition T . The decision gT agrees withgP except on the set B ∈ P that contains T .

We introduce the convenient notation

νj(A) = PX ∈ A, Y = j, j ∈ 0, 1,

νj,n(A) =1n

n∑i=1

IXi∈A,Yi=j, j ∈ 0, 1.


An estimate of the quality of gP is

Ln(P) def=∑R∈P

Ln(R),

where

Ln(R) =1n

n∑i=1

IXi∈R,gP(Xi) 6=Yi

= min(ν0,n(R), ν1,n(R)).

Here we use two different arguments for Ln (R and P), but the distinctionshould be clear. We may similarly define Ln(T ). Given a partition P, thegreedy classifier selects that legal rectangle T for which Ln(T ) is minimal(with any appropriate policy for breaking ties). Let R be the set of Pcontaining T . Then the greedy classifier picks that T for which

Ln(T ) + Ln(R− T )− Ln(R)

is minimal. Starting with the trivial partition P0 = Rd, we repeat theprevious step k times, leading thus to k + 1 regions. The sequence of par-titions is denoted by P0,P1, . . . ,Pk.

We put no safeguards in place—the rectangles are not forced to shrink.And in fact, it is easy to construct examples in which most rectangles donot shrink. The main result of the section, and indeed of this chapter, isthat the obtained classifier is consistent:

Theorem 20.9. For the greedy classifier with k →∞ and k = o(

3√n/ log n

),

assuming that X has nonatomic marginals, we have Ln → L∗ with proba-bility one.

Remark. We note that with techniques presented in the next chapter, itis possible to improve the second condition on k to k = o

(√n/ log n

)(see

Problem 20.31). 2

Before proving the theorem, we mention that the same argument may beused to establish consistency of greedily grown trees with many other typesof cuts. We have seen in Section 20.8 that repeated Stoller splits do notresult in good classifiers. The reason is that optimization is over a collectionof sets (halfspaces) that is not guaranteed to improve matters—witness theexamples provided in previous sections. A good cutting method is one thatincludes somehow many (but not too many) small sets. For example, let ussplit at the root making d+ 1 hyperplane cuts at once, that is, by findingthe d + 1 cuts that together produce the largest decrease in the empiricalerror probability. Then repeat this step recursively in each region k times.


The procedure is consistent under the same conditions on k as in Theorem6.1, whenever X has a density (see Problem 20.30). The d+ 1 hyperplanecuts may be considered as an elementary cut which is repeated in a greedymanner. In Figures 20.23 to 20.25 we show a few elementary cuts that maybe repeated greedily for a consistent classifier. The straightforward proofsof consistency are left to the reader in Problem 20.30.

Proof of Theorem 20.9. We restrict ourselves to R2, but the proofremains similar in Rd (Problem 20.29). The notation Ln(·) was introducedabove, where the argument is allowed to be a partition or a set in a parti-tion. We similarly define

L(R) = miny∈0,1

PX ∈ R, Y 6= y

= min(ν0(R), ν1(R)),

L(P) =∑R∈P

L(R).

Figure 20.23. An elementary cuthere is composed of d+1 hyperplanecuts. They are jointly optimized.

Figure 20.24. 2d rectangular cutsdetermine an elementary cut. All 2dcuts have arbitrary directions; thereare no forced directions.


Figure 20.25. Simplex cuts. A cutis determined by a polyhedron withd+1 vertices. The simplices are notallowed to overlap, just as for rect-angular cuts in the greedy classifier.Three consecutive simplex cuts inR2 are shown here.

Figure 20.26. At every iterationwe refine the partition by selectingthat rectangle R in the partition andthat 3× 3× · · · × 3 rectangular gridcut of R for which the empiricalerror is minimal. Two consecutiverectangular grid cuts are shown.

For example, let Gl denote a partition of R2 into a rectangular l× l grid.

Figure 20.27. Gl: An l × l grid (with l = 7here).

It is clear that for all ε > 0, there exists an l = l(ε) and an l × l grid Glsuch that

L(Gl) ≤ L∗ + ε.

If Q is another finer partition into rectangles (i.e., each set of Q is a rect-angle and intersects at most one rectangle of Gl), then necessarily

L(Q) ≤ L(Gl) ≤ L∗ + ε.

We will call Q a refinement of Gl. The next lemma is a key property ofpartitioning classifiers. In our eyes, it is the main technical property of thisentire chapter. We say that the partition T is an extension of P by a setQ—where Q ⊆ R ∈ P—if T contains all cells of P other than R, plus Qand R−Q.


Lemma 20.2. Let Gl be a finite partition with L(Gl) ≤ L∗ + ε. Let P be afinite partition of Rd, and let Q be a refinement of both P and Gl. Thenthere exists a set Q ∈ Q (note: Q is contained in one set of P only) andan extension of P by Q to TQ such that, if L(P) ≥ L∗ + ε,

L(TQ)− (L∗ + ε) ≤(

1− 1|Q|

)(L(P)− (L∗ + ε)) .

Proof. First fix R ∈ P and let Q1, . . . , QN be the sets of Q contained inR. Define

pi = ν0(Qi), qi = ν1(Qi), L(Qi) = min(pi, qi),

p =N∑i=1

pi, q =N∑i=1

qi, L(R) = min(p, q),

L(R−Qi) = min(p− pi, q − qi).

First we show that there exists an integer i such that

L(R)− L(Qi)− L(R−Qi) ≥L(R)−

∑Ni=1 L(Qi)

N,

or equivalently,

max1≤i≤N

∆i ≥min(p, q)−

∑Ni=1 min(pi, qi)N

,

where ∆i = min(p, q)−min(pi, qi)−min(p−pi, q− qi). To see this, assumewithout loss of generality that p ≤ q. If pi ≤ qi for all i, then

min(p, q)−N∑i=1

min(pi, qi) = p−N∑i=1

pi = 0,

so we are done. Assume therefore that pi > qi for i ∈ A, where A is a setof indices with |A| ≥ 1. For such i,

∆i = p− qi − (p− pi) = pi − qi,

and thus,

∑i∈A

∆i =∑i∈A

(pi − qi) = p−∑i/∈A

pi −∑i∈A

qi = p−N∑i=1

min(pi, qi).

But then,

max1≤i≤N

∆i ≥1|A|

∑i∈A

∆i ≥p−

∑Ni=1 min(pi, qi)|A|

≥p−

∑Ni=1 min(pi, qi)N

,


and the claim follows. To prove the lemma, notice that since L(Q) ≤ L∗+ε,it suffices to show that

maxQ∈Q

(L(P)− L(TQ)) ≥ L(P)− L(Q)|Q|

,

or equivalently, that

maxQ∈Q

(L(RQ)− L(Q)− L(RQ −Q)) ≥ L(P)− L(Q)|Q|

,

where RQ is the unique cell of P containing Q. However,

maxQ∈Q

(L(RQ)− L(Q)− L(RQ −Q))

≥ maxR∈P,Q⊆R

(L(R)− L(Q)− L(R−Q))

≥ maxR∈P

L(R)−∑Q⊆R L(Q)

|R|(by the inequality shown above, where |R| denotes

the number of sets of Q contained in R)

≥

∑R∈P

(L(R)−

∑Q⊆R L(Q)

)∑R∈P |R|

=

∑R∈P L(R)−

∑Q∈Q L(Q)

|Q|

=L(P)− L(Q)

|Q|,

and the lemma is proved. 2

Let us return to the proof of Theorem 20.9. The previous lemma appliedto our situation (P = Pi) shows that we may extend Pi by a rectangle Q ∈Q (as Q will be a collection of rectangles refining both Gl and Pi) such thatL(TQ)− (L∗+ ε) is smaller by a guaranteed amount than L(Pi)− (L∗+ ε).(TQ is the partition obtained by extending Pi.) To describe Q, we do thefollowing (note that Q must consist entirely of rectangles for otherwisethe lemma is useless): take the rectangles of Pi and extend all four sides(in the order of birth of the rectangles) until they hit a side of anotherrectangle or an extended border of a previous rectangle if they hit anythingat all. Figure 20.28 illustrates the partition into rectangles. Note that thispartition consists of 4i line segments, and at most 9i rectangles (this canbe seen by noting that we can write in each nonoriginal rectangle of Pithe number of an original neighboring rectangle, each number appearingat most 8 times).


Figure 20.28. Extensions of rectan-gles to get a rectangular grid.

The rectangular grid partition thus obtained is intersected with Gl to yieldQ. To apply Lemma 20.2, we only need a bound on |Q|. To the rectangularpartition just created, we add each of the 2l− 2 lines of the grid Gl one byone. Start with the horizontal lines. Each time one is added, it creates atmost 9i new rectangles. Then, each vertical line adds at most 9i + l newrectangles for a total of not more than 9i+9i(l−1)+(l−1)(9i+l) ≤ l2+18li(assuming l ≥ 1). Hence,

|Q| ≤ l2 + 18li.

Apply the lemma, and observe that there is a rectangle Q ∈ Q such thatthe extension of Pi by Q to P ′i would yield

L(P ′i)− (L∗ + ε) ≤ (L(Pi)− (L∗ + ε))(

1− 1l2 + 18li

).

At this point in the proof, the reader can safely forget Q. It has done whatit was supposed to do.

Of course, we cannot hope to find P ′i because we do not know the dis-tribution. Let us denote the actual new partition by Pi+1 and let Ri+1 bethe selected rectangle by the empirical minimization. Observe that

L(Pi+1)− L(P ′i)

=(L(Pi+1)− Ln(Pi+1)

)+(Ln(Pi+1)− Ln(P ′i)

)+(Ln(P ′i)− L(P ′i)

)≤

(L(Pi+1)− Ln(Pi+1)

)+(Ln(P ′i)− L(P ′i)

).

As Pi+1 and R′i have most sets in common, many terms cancel in the last

double sum. We are left with just L(·) and Ln(·) terms for sets of the formR−Ri+1, Ri+1, R−Q,Q with R ∈ Pi. Thus,

L(Pi+1)− L(P ′i) ≤ 2 supR′|Ln(R′)− L(R′)|,

where the supremum is taken over all rectangles of the above describedform. These are sets of the form obtainable in Pi+1. Every such set can bewritten as the difference of a (possibly infinite) rectangle and at most i+1


nonoverlapping contained rectangles. As i+ 1 ≤ k in our analysis, we callZk the class of all sets that may be described in this form: Z0−Z1−· · ·−Zk,where Z0, Z1, . . . , Zk are rectangles, each Zi ⊆ Z0, Z1, . . . , Zk are mutuallydisjoint. Rectangles may be infinite or half infinite. Hence, for i < k,

L(Pi+1)− L(P ′i) ≤ 2 supZ∈Zk

|Ln(Z)− L(Z)|.

For a fixed Z ∈ Zk,

|Ln(Z)− L(Z)| = |min(ν0,n(Z), ν1,n(Z))−min(ν0(Z), ν1(Z))|≤ |ν0,n(Z)− ν0(Z)|+ |ν1,n(Z)− ν1(Z)|.

Thus,

L(Pi+1)− L(P ′i) ≤ Vndef= 2 sup

Z∈Zk

(|ν0,n(Z)− ν0(Z)|+ |ν1,n(Z)− ν1(Z)|

).

Define the (good) eventG = Vn < δ,

where δ > 0 will be picked later. On G, we see that for all i < k,

L(Pi+1)− (L∗ + ε)

≤ L(P ′i)− (L∗ + ε) + δ

≤ (L(Pi)− (L∗ + ε))(

1− 1l2 + 18li

)+ δ (Lemma 20.2).

We now introduce a convenient lemma.

Lemma 20.3. Let an, bn be sequences of positive numbers with bn ↓ 0,b0 < 1, and let δ > 0 be fixed. If an+1 ≤ an(1− bn) + δ, then

an+1 ≤ a0e−∑n

j=0bj + δ(n+ 1).

Proof. We have

an+1 ≤ a0

n∏j=0

(1− bj) + δ

n−1∑m=1

n∏j=m

(1− bj)

+ δ

= I + II + III.

Clearly, I ≤ a0 exp(−∑nj=0 bj

). A trivial bound for II + III is δ(n+ 1).

2


Conclude that on G,

L(Pk)− (L∗ + ε) ≤ (L(P0)− (L∗ + ε)) e−∑k−1

j=01/(l2+18lj) + δk

≤ e−∫ k−1

01/(l2+18lu)du + δk

≤ e−1

18l log((l2+18l(k−1))/(l2)) + δk

=1(

1 + 18(k−1)l

)1/(18l)+ δk.

Thus,

PL(Pk) > L∗ + 2ε ≤ PGc = PVn ≥ δ

when δk < ε/2 and1(

1 + 18(k−1)l

)1/(18l)<ε

2.

Finally, introduce the notation

Ln(R) = PX ∈ R, Y 6= gP(X),

Ln(P) =∑R∈P

Ln(R).

As L(Pk) takes the partition into account, but not the majority vote, andLn(Pk) does, we need a small additional argument:

Ln(Pk) =(Ln(Pk)− Ln(Pk)

)+(Ln(Pk)− L(Pk)

)+ L(Pk)

≤ 2∑R∈Pk

(|ν0(R)− ν0,n(R)|+ |ν1(R)− ν1,n(R)|) + L(Pk)

def= 2Wn + L(Pk).

Putting things together,

P Ln(Pk) > L∗ + 4ε≤ P Wn > ε+ P L(Pk) > L∗ + 2ε≤ P Wn > ε+ P Vn ≥ δ

under the given conditions on δ and k. Observe that if for a set Z,

Un(Z) def= |ν0(Z)− ν0,n(Z)|+ |ν1(Z)− ν1,n(Z)|,


then

Vn ≤ suprectangles Z

Un(Z) + supdisjoint sets of k

rectangles Z1,...,Zk

k∑i=1

Un(Zi)

≤ (k + 1) suprectangles Z

Un(Z),

Wn ≤ (k + 1) suprectangles Z

Un(Z).

We may now use the Vapnik-Chervonenkis inequality (Theorems 12.5 and13.8) to conclude that for all ε > 0,

P

sup

rectangles ZUn(Z) > ε

≤ 16n2de−nε

2/32.

Armed with this, we have

PWn > ε ≤ 16n2de−n(

ε2(k+1)

)2/32,

PVn ≥ δ ≤ 16n2de−n(

δ2(k+1)

)2/32.

Both terms tend to 0 with n if n/(k2 log n) →∞ and nδ2/(k2 log n) →∞.Take δ = ε/(3k) to satisfy our earlier side condition, and note that we needn/(k3 log n) →∞. We, in fact, have strong convergence to 0 for Ln(Pk)−L∗by the Borel-Cantelli lemma. 2

The crucial element in the proof of Theorem 20.9 is the fact that thenumber of sets in the partition Q grows at most linearly in i, the iterationnumber. Had it grown quadratically, say, it would not have been goodenough—we would not have had guaranteed improvements of large enoughsizes to push the error probability towards L∗. In the multidimensionalversion and in extension to other types of cuts (Problems 20.29, 20.30) thisis virtually the only thing that must be verified. For hyperplane cuts, anadditional inequality of the vc type is needed, extending that for classes ofhyperplanes.

Our proof is entirely combinatorial and geometric and comes with explicitbounds. The only reference to the Bayes error we need is the quantity l(ε),which is the smallest value l for which

infall l×l grids Gl

L(Gl) ≤ L∗ + ε.

It depends heavily on the distribution of course. Call l(ε) the grid complexityof the distribution, for lack of a better term. For example, if X is discreteand takes values on the hypercube 0, 1d, then l(ε) ≤ 2d. If X takes values


on a1, . . . , aMd, then l(ε) ≤ (M + 1)d. If X has an arbitrary distributionon the real line and η(x) is monotone, then l(ε) ≤ 2. However, if η isunimodal, l(ε) ≤ 3. In one dimension, l(ε) is sensitive to the number ofplaces where the Bayes decision changes. In two dimensions, however, l(ε)measures the complexity of the distribution, especially near regions withη(x) ≈ 1/2. When X is uniform on [0, 1]2 and Y = 1 if and only if thecomponents of X sum to less than one, then l(ε) = Θ(1/

√ε) as ε ↓ 0

(Problem 20.32), for example. The grid complexity is eminently suited tostudy rates of convergence, as it is explicitly featured in the inequalities ofour proof. There is no room in this book for following this thread, however.


Problem 20.1. Let X be a random vector with density f on Rd. Show that eachcomponent has a density as well.

Problem 20.2. Show that both condition (1) and condition (2) of Theorem 6.1are necessary for consistency of trees with the X-property.

Problem 20.3. For a theoretical median tree in Rd with uniform marginals andk levels of splitting, show that EH(X)+V (X) = 1/(2k/d) when k is a multipleof d.

Problem 20.4. α-balanced trees. Consider the following generalization of themedian tree: at a node in the tree that represents n points, if a split occurs, bothsubtrees must be of size at least α(n − 1) for some fixed α > 0. Repeat thesplits for k levels, resulting in 2k leaf regions. The tree has the X-property andthe classifier is natural. However, the points at which cuts are made are pickedamong the data points in an arbitrary fashion. The splits rotate through thecoordinate axes. Generalize the consistency theorem for median trees.

Problem 20.5. Consider the following tree classifier. First we find the medianaccording to one coordinate, and create two subtrees of sizes b(n − 1)/2c andd(n − 1)/2e. Repeat this at each level, cutting the next coordinate axis in arotational manner. Do not cut a node any further if either all data points in thecorresponding region have identical Yi values or the region contains less than kpoints. Prove that the obtained natural classifier is consistent whenever X has adensity if k →∞ and logn/k → 0.

Problem 20.6. Let µ be a probability measure with a density f on Rd, anddefine

C =x = (x(1), . . . , x(d)) : µ

([x(1), x(1) + ε]× · · · × [x(d), x(d) + ε]

)> 0 all ε > 0

.

Show that µ(C) = 1. Hint: Proceed as in the proof of Lemma A.1 in the Ap-pendix.

Problem 20.7. Let U(1), . . . , U(n) be uniform order statistics defining spacingsS1, . . . , Sn+1 with Si = U(i) − U(i−1), if U(n+1) = 1 and U(0) = 0. Show that


(1) S1, . . . , Sn+1 are identically distributed;(2) PS1 > x = (1− x)n, x ∈ [0, 1];(3) ES1 = 1/(n+ 1).

Problem 20.8. In the proof of Theorem 20.3, we assumed in the second partthat horizontal and vertical cuts were meted out independently. Return to theproof and see how you can modify it to take care of the forced alternating cuts.

Problem 20.9. Let X1, X2, . . . , Xn be i.i.d. with common density f on the realline. Call Xi a record if Xi = min(X1, . . . , Xi). Let Ri = IXi is a record. Showthat R1, . . . , Rn are independent and that PRi = 1 = 1/i. Conclude that theexpected number of records is ∼ logn.

Problem 20.10. The deep k-d tree. Assume that X has a density f and thatall marginal densities are uniform.

(1) Show that k →∞ implies that diam(A(X)) → 0 in probability. Do thisby arguing that diam(A(X)) → 0 in probability for the chronological k-dtree with the same parameter k.

(2) In a random k-d tree with n elements, show that the depth D of the lastinserted node satisfies D/(2 logn) → 1 in probability. Argue first thatyou may restrict yourself to d = 1. Then write D as a sum of indicators ofrecords. Conclude by computing ED and VarD or at least boundingthese quantities in an appropriate manner.

(3) Improve the second condition of Theorem 20.4 to (k−2 logn)/√

logn→−∞. (Note that it is possible that lim supn→∞ k/ logn = 2.) Hint: Showthat |D−2 logn| = O

(√logn

)in probability by referring to the previous

part of this exercise.

Problem 20.11. Consider a full-fledged k-d tree with n+1 external node regions.Let Ln be the probability of error if classification is based upon this tree and ifexternal node regions are assigned the labels (classes) of their immediate parents(see Figure 20.29).

Figure 20.29. The external nodes are

assigned the classes of their parents.

Show that whenever X has a density, ELn → LNN ≤ 2L∗ just as for the1-nearest neighbor rule.

Problem 20.12. Continuation. Show that Ln → LNN in probability.


Problem 20.13. Meisel and Michalopoulos (1973) propose a binary tree classi-fier with perpendicular cuts in which all leaf regions are homogeneous, that is,all Yi’s for the Xi’s in the same region are identical.

Figure 20.30. Example of a tree partition into homoge-

neous regions.

(1) If L∗ = 0, give a stopping rule for constructing a consistent rule wheneverX has a density.

(2) If L∗ = 0, show that there exists a consistent homogeneous partitionclassifier with o(n) expected number of cells.

(3) If L∗ > 0, show a (more complicated) way of constructing a homogeneouspartition that yields a tree classifier with ELn → L∗ whenever X hasa density. Hint: First make a consistent nonhomogeneous binary treeclassifier, and refine it to make it homogeneous.

(4) If L∗ > 0, then show that the expected number of homogeneous regionsis at least cn for some c > 0. Such rules are therefore not practical, unlessL∗ = 0. Overfitting will occur.

Problem 20.14. Let X be uniform on [0, 1] and let Y be independent of X withPY = 1 = p. Find a tree classifier based upon simple interval splitting forwhich each region has one data point, and

lim infp→0

lim infn→∞ ELnp

≥ 3.

We know that for the nearest neighbor rule,

limp→0

lim supn→∞ ELnp

= 2,

so that the given interval splitting method is worse than the nearest neighbormethod. (This provides a counterexample to a conjecture of Breiman, Friedman,Olshen, and Stone (1984, pp.89–90).) Hint: Sort the points, and adjust the sizesof the intervals given to the class 0 and class 1 points in favor of the 1’s. Bydoing so, the asymptotic probability of error can be made as close as desired to(p2 + 3p(1− p))(1− p).

Problem 20.15. Continuation. Let X be uniform on [0, 1]2 and let Y be inde-pendent of X with PY = 1 = p. Find a data-based tree classifier based uponperpendicular cuts for which each region has one data point, diam(A(X)) → 0in probability (recall Theorems 6.1 and 21.2), and

lim infp→0

lim infn→∞ ELnp

= ∞.

Conclude that in Rd, we can construct tree classifiers with one point per cell thatare much worse than the nearest neighbor rule. Hint: The next two problemsmay help you with the construction and analysis.


Problem 20.16. Continuation. Draw an i.i.d. sample X1, . . . , Xn from theuniform distribution on [0, 1]2, and let Y1, . . . , Yn be i.i.d. and independent of theXi’s with PY = 1 = p. Construct a binary tree partition with perpendicularcuts for Xi : Yi = 1 such that every leaf region has one and only one point anddiam(A(X)) → 0 in probability.

(1) How would you proceed, avoiding putting Xi’s on borders of regions?(2) Prove diam(A(X)) → 0 in probability.(3) Add the (Xi, Yi) pairs with Yi = 0 to the leaf regions, so that every

region has one class-1 observation and zero or more class-0 observations.Give the class-1 observation the largest area containing no class-0 points,as shown in Figure 20.31. Show that this can always be done by addingperpendicular cuts and keeping at least one observation per region.

Figure 20.31. Cutting a rectangle by

giving a large area to the single class-1

point.

(4) Partition all rectangles with more than one point further to finally obtaina one-point-per-leaf-region partition. If there are N points in a regionof the tree before the class-1 and class-0 points are separated (thus,N − 1 class-0 points and one class-1 point), then show that the expectedproportion of the region’s area given to class 1, given N , times N tendsto ∞. (An explicit lower bound will be helpful.) Hint: Use the nextproblem.

(5) Write the probability of error for the rule in terms of areas of rectangles,and use part (4) to get an asymptotic lower bound.

(6) Now let p tend to zero and get an asymptotic expression for your lowerbound in terms of p. Compare this with 2p(1− p), the asymptotic errorprobability of the 1-nearest neighbor rule.

Problem 20.17. Continuation. Draw a sample X1, . . . , Xn of n i.i.d. observa-tions uniformly distributed on [0, 1]d. The rectangles defined by the origin andX1, . . . , Xn as opposite vertices are denoted by R1, . . . , Rn, respectively. Theprobability content of Ri is clearly µ(Ri) =

∏d

i=1X

(j)i . Study

Mn = maxµ(Ri) : 1 ≤ i ≤ n,Ri does not contain Rj for any j 6= i,

the probability content of the largest empty rectangle with a vertex at the origin.

For d = 1, Mn is just the minimum of the Xi’s, and thus nMnL→ E , where E is

the exponential distribution. Also EMn = 1/(n+ 1). For d > 1, Mn is larger.Show that nMn →∞ in probability and try obtaining the first term in the rateof increase.

Problem 20.18. Show that diam(A(X)) → 0 in probability for the chronologicalquadtree whenever k → 0 and X has a density. Hint: Mimic the proof for thechronological quadtree.


Problem 20.19. Show that the deep quadtree is consistent if X has a densityand k levels of splits are applied, where k →∞ and k/ logn→ 0.

Problem 20.20. Consider a full-fledged quadtree with n nodes (and thus n(2d−1) + 1 leaf regions). Assign to each region the Y -label (class) of its parent inthe quadtree. With this simple classifier, show that whenever X has a density,ELn → 2Eη(X)(1− η(X)) = LNN. In particular, lim supn→∞ ELn ≤ 2L∗

and the classifier is consistent when L∗ = 0.

Problem 20.21. InR2, partition the space as follows: X1, X2 define nine regionsby vertical and horizontal lines through them. X3, . . . , Xk are sent down to theappropriate subtrees in the 9-ary tree, and within each subtree with at least twopoints, the process is repeated recursively. A decision at x is by a majority vote(among (Xk+1, Yk+1), . . . , (Xn, Yn)) among those Xi’s in the same rectangle ofthe partition as x Show that if k →∞, k/n→ 0, the rule is consistent wheneverX has a density.

Problem 20.22. On the two-dimensional counterexample shown in the text formultivariate Stoller splits, prove that if splits are performed based upon a sampledrawn from the distribution, and if we stop after k splits with k depending onn in such a way that k/ logn→ 0, then Ln, the conditional probability of error,satisfies lim infn→∞ ELn ≥ (1 − ε)/2. Hint: Bound the probability of eversplitting [1, 2]2 anywhere by noting that the maximal difference between theempirical distribution functions for the first coordinate of X given Y = 0 andY = 1 is O

(1/√n)

when restricted to [1, 2]2.

Problem 20.23. Let X have the uniform distribution on [0, 5] and let Y =I2<X<3, so that L∗ = 0. Construct a binary classification tree by selecting ateach iteration the split that minimizes the impurity function I, where ψ is theGini criterion. Consider just the first three splits made in this manner. Let Lnbe the probability of error with the given rule (use a majority vote over the leafregions). Show that Ln → 0 in probability. Analyze the algorithm when the Ginicriterion is replaced by the probability-of-error criterion.

Problem 20.24. Let X be uniform on [0, 1] and let Y be independent of X,with PY = 1 = 2/3. Draw a sample of size n from this distribution. Investigatewhere the first cut might take place, based upon minimizing the impurity functionwith ψ(p, 1 − p) (the Gini criterion). (Once this is established, you will havediscovered the nature of the classification tree, roughly speaking.)

Problem 20.25. Complete the consistency proof of Theorem 20.7 for the rawbsp tree for R2 by showing that diam(A(X)) → 0 in probability.

Problem 20.26. Balanced bsp trees. We generalize median trees to allowsplitting the space along hyperplanes.

Figure 20.32. A balanced bsp tree: Each hyper-

plane cut splits a region into two cells of the same

cardinality.


Assume that X has a density and that the tree possesses the X-property. Keepsplitting until there are 2k leaf regions, as with median trees. Call the treesbalanced bsp trees.

(1) Show that there are ways of splitting in R2 that lead to nonconsistentrules, regardless how k varies with n.

(2) If every splitting hyperplane is forced to contain d data points (in Rd)and these data points stay with the splitting node (they are not sent downto subtrees), then show that once again, there exists a splitting methodthat leads to nonconsistent rules, regardless of how k varies with n.

Problem 20.27. Let X have a uniform distribution on the unit ball of Rd.Let Y = I‖X‖≥1/2, so that L∗ = 0. Assume that we split the space by ahyperplane by minimizing an impurity function based upon the Gini criterion. Ifn is very large, where approximately would the cut take place (modulo a rotation,of course)?

Problem 20.28. There exist singular continuous distributions that admit uni-form [0, 1] marginals in Rd. Show, for example, that if X is uniformly distributedon the surface of the unit sphere of R3, then its three components are all uni-formly distributed on [−1, 1].

Problem 20.29. Verify that Theorem 20.9 remains valid in Rd.

Problem 20.30. Prove that Theorem 20.9 remains valid if rectangular cuts arereplaced by any of the elementary cuts shown on Figures 20.23 to 20.25, and suchcuts are performed recursively k times, always by maximizing the decrease of theempirical error.

Problem 20.31. Show that Theorem 20.11 remains valid if k → ∞ and k =

o(√

n/ logn). Hint: In the proof of the theorem, the bound on Vn and Wn is

loose. You may get more efficient bounds by applying Theorem 21.1 from thenext chapter.

Problem 20.32. Study the behavior of the grid complexity as ε ↓ 0 for thefollowing cases:

(1) X is uniform on the perimeter of the unit circle of R2 with probability1/2 and X is uniform on [0, 1]2 otherwise. Let Y = 1 if and only if X ison the perimeter of that circle (so that L∗ = 0).

(2) X =(X(1), X(2)

)is uniform on [0, 1]2 and Y = 1 if and only if X(1) +

X(2) ≤ 1.


21Data-Dependent Partitioning

21.1 Introduction

In Chapter 9 we investigated properties of the regular histogram rule. His-togram classifiers partition the observation space Rd and classify the inputvector X according to a majority vote among the Yi’s whose correspondingXi’s fall in the same cell of the partition asX. Partitions discussed in Chap-ter 9 could depend on the sample size n, but were not allowed to dependon the data Dn itself. We dealt mostly with grid partitions, but will nowallow other partitions as well. Just consider clustered training observationsX1, . . . , Xn. Near the cluster’s center finer partitions are called for. Simi-larly, when the components have different physical dimensions, the scale ofone coordinate axis is not related at all to the other scales, and some data-adaptive stretching is necessary. Sometimes the data are concentrated onor around a hyperplane. In all these cases, although consistent, the regularhistogram method behaves rather poorly, especially if the dimension of thespace is large. Therefore, it is useful to allow data-dependent partitions,while keeping a majority voting scheme within each cell.

The simplest data-dependent partitioning methods are based on statis-tically equivalent blocks in which each cell contains the same number ofpoints. In one-dimensional problems statistically equivalent blocks reduceto k-spacing estimates where the k-th, 2k-th, etc. order statistics determinethe partition of the real line.

Sometimes, it makes sense to cluster the data points into groups suchthat points in a group are close to each other, and define the partition so

21.2 A Vapnik-Chervonenkis Inequality for Partitions 385

that each group is in a different cell.Many other data-dependent partitioning schemes have been introduced

in the literature. In most of these algorithms the cells of the partition cor-respond to the leaves of a binary tree, which makes computation of thecorresponding classification rule fast and convenient. Tree classifiers weredealt with in Chapter 20. Analysis of universal consistency for these al-gorithms and corresponding density estimates was begun by Abou-Jaoude(1976b) and Gordon and Olshen (1984), (1978), (1980) in a general frame-work, and was extended, for example, by Breiman, Friedman, Olshen, andStone (1984), Chen and Zhao (1987), Zhao, Krishnaiah, and Chen (1990),Lugosi and Nobel (1993), and Nobel (1994).

This chapter is more general than the chapter on tree classifiers, as everypartition induced by a tree classifier is a valid partition of space, but notvice versa. The example below shows a rectangular partition of the planethat cannot be obtained by consecutive perpendicular cuts in a binaryclassification tree.

Figure 21.1. A rectangular partition of[0, 1]2 that cannot be achieved by a tree ofconsecutive cuts.

In this chapter we first establish general sufficient conditions for theconsistency of data-dependent histogram classifiers. Because of the com-plicated dependence of the partition on the data, methods useful for han-dling regular histograms have to be significantly refined. The main tool isa large deviation inequality for families of partitions that is related to theVapnik-Chervonenkis inequality for families of sets. The reader is askedfor forgiveness—we want to present a very generally applicable theoremand have to sacrifice (temporarily) by increasing the density of the text.However, as you will discover, the rewards will be sweet.

21.2 A Vapnik-Chervonenkis Inequality forPartitions

This is a technical section. We will use its results in the next section toestablish a general consistency theorem for histogram rules with data-dependent partitions. The main goal of this section is to extend the basicVapnik-Chervonenkis inequality (Theorem 12.5) to families of partitions


from families of sets. The line of thought followed here essentially appearsin Zhao, Krishnaiah, and Chen (1990) for rectangular partitions, and moregenerally in Lugosi and Nobel (1993). A substantial simplification in theproof was pointed out to us by Andrew Barron.

By a partition of Rd we mean a countable collection P = A1, A2, . . .of subsets of Rd such that ∪∞j=1Aj = Rd and Ai ∩ Aj = ∅ if i 6= j. Eachset Aj is called a cell of the partition P.

Let M be a positive number. For each partition P, we define P(M) asthe restriction of P to the ball SM (recall that SM ⊂ Rd denotes the closedball of radius M centered at the origin). In other words, P(M) is a partitionof SM , whose cells are obtained by intersecting the cells of P with SM . Weassume throughout that P is such that |P(M)| < ∞ for each M < ∞. Wedenote by B(P(M)) the collection of all 2|P

(M)| sets obtained by unions ofcells of P(M).

Just as we dealt with classes of sets in Chapter 12, here we introducefamilies of partitions. Let F be a (possibly infinite) collection of partitionsof Rd. F (M) = P(M) : P ∈ F denotes the family of partitions of SMobtained by restricting members of F to SM . For each M , we will measurethe complexity of a family of partitions F (M) by the shatter coefficientsof the class of sets obtained as unions of cells of partitions taken from thefamily F (M). Formally, we define the combinatorial quantity ∆n(F (M)) asfollows: introduce the class A(M) of subsets of Rd by

A(M) =A ∈ B(P(M)) for some P(M) ∈ F (M)

,

and define∆n(F (M)) = s(A(M), n),

the shatter coefficient of A(M). A(M) is thus the class of all sets that canbe obtained as unions of cells of some partition of SM in the collectionF (M). For example, if all members of F (M) partition SM into two sets,then ∆n(F (M)) is just the shatter coefficient of all sets in these partitions(with ∅ and SM included in the collection of sets).

Let µ be a probability measure on Rd and let X1, X2, . . . be i.i.d. ran-dom vectors in Rd with distribution µ. For n = 1, 2, . . . let µn denotethe empirical distribution of X1, . . . , Xn, which places mass 1/n at each ofX1, . . . , Xn. To establish the consistency of data-driven histogram meth-ods, we require information about the large deviations of random variablesof the form

supP(M)∈F(M)

∑A∈P(M)

|µn(A)− µ(A)|,

where F is an appropriate family of partitions.

Remark. Just as in Chapter 12, the supremum above is not guaranteed tobe measurable. In order to insure measurability, it is necessary to impose


regularity conditions on uncountable collections of partitions. It suffices tomention that in all our applications, the measurability can be verified bychecking conditions given, e.g., in Dudley (1978), or Pollard (1984). 2

The following theorem is a consequence of the Vapnik-Chervonenkis in-equality:

Theorem 21.1. (Lugosi and Nobel (1993)). Let X1, . . . , Xn be i.i.d.random vectors in Rd with measure µ and empirical measure µn. Let F bea collection of partitions of Rd. Then for each M <∞ and ε > 0,

P

supP(M)∈F(M)

∑A∈P(M)

|µn(A)− µ(A)| > ε

≤ 8∆n(F (M))e−nε2/512+e−nε

2/2.

Proof. For a fixed partition P, define A′ as the set

A′ =⋃

A∈P(M):µn(A)≥µ(A)

A.

Then ∑A∈P(M)

|µn(A)− µ(A)|

=∑

A∈P(M):µn(A)≥µ(A)

(µn(A)− µ(A))

+∑

A∈P(M):µn(A)<µ(A)

(µ(A)− µn(A))

= 2 (µn(A′)− µ(A′)) + µ(SM )− µn(SM )

≤ 2 supA∈B(P(M))

|µn(A)− µ(A)|+ µ(SM )− µn(SM ).

Recall that the class of sets B(P(M)) contains all 2|P(M)| sets obtained by

unions of cells of P(M). Therefore,

supP(M)∈F(M)

∑A∈P(M)

|µn(A)− µ(A)|

≤ 2 supP(M)∈F(M)

supA∈B(P(M))∪SM

|µn(A)− µ(A)|+ µ(SM )− µn(SM ).

Observe that the first term on the right-hand side of the inequality isa uniform deviation of the empirical measure µn from µ over a specificclass of sets. The class contains all sets that can be written as unions ofcells of a partition P(M) in the class of partitions F (M). This class of sets


is just A(M), defined above. The theorem now follows from the Vapnik-Chervonenkis inequality (Theorem 12.5), the definition of ∆n(F (M)), andHoeffding’s inequality (Theorem 8.1). 2

We will use a special application of Theorem 21.1, summarized in thefollowing corollary:

Corollary 21.1. Let (X1, Y1), (X2, Y2) . . . be a sequence of i.i.d. randompairs in Rd×0, 1 and let F1,F2, . . . be a sequence of families of partitionsof Rd. If for M <∞

limn→∞

log(∆n(F (M)

n ))

n= 0,

then

supP(M)∈F(M)

n

∑A∈P(M)

∣∣∣∣∣ 1nn∑i=1

YiIXi∈A −EY IX∈A

∣∣∣∣∣→ 0

with probability one as n tends to infinity.

Proof. Let ν be the measure of (X,Y ) on Rd × 0, 1, and let νn be theempirical measure corresponding to the sequence (X1, Y1), . . . , (Xn, Yn).Using F (M)

n we define a family G(M)n of partitions of Rd × 0, 1 by

G(M)n = P(M) × 0 : P(M) ∈ F (M)

n ∪ P(M) × 1 : P(M) ∈ F (M)n ,

where P×0 = A1, A2, . . .×0def= A1×0, A2×0, . . .. We apply

Theorem 21.1 for these families of partitions.

supP(M)∈F(M)

n

∑A∈P(M)

∣∣∣∣∣ 1nn∑i=1


∣∣∣∣∣= sup

P(M)∈F(M)n

∑A∈P(M)

|νn(A× 1)− ν(A× 1)|

≤ supP(M)∈G(M)

n

∑A∈P(M)

∣∣νn(A)− ν(A)∣∣ .

It is easy to see that ∆n(G(M)n ) = ∆n(F (M)

n ). Therefore the stated conver-gence follows from Theorem 21.1 and the Borel-Cantelli lemma. 2

Lemma 21.1. Assume that the family F (M) is such that the number ofcells of the partitions in the family are uniformly bounded, that is, there isa constant N such that |P(M)| ≤ N for each P(M) ∈ F (M). Then

∆n(F (M)) ≤ 2N∆∗n(F (M)),


where ∆∗n(F (M)) is maximal number of different ways n points can be par-

titioned by members of F (M).

Example. Flexible grids. As a first simple example, let us take in Fnall partitions into d-dimensional grids (called flexible grids as they may bevisualized as chicken-wire fences with unequally spaced vertical and hori-zontal wires) in which cells are made up as Cartesian products of d intervals,and each coordinate axis contributes one of mn intervals to these products.Clearly, if P is a member partition of Fn, then |P| = md

n. Nevertheless,there are uncountably many P’s, as there are uncountably many intervalsof the real line. This is why the finite quantity ∆n comes in so handy. Wewill verify later that for each M ,

∆n(F (M)) ≤ 2mdn

(n+mn

n

)d,

so that the condition of Corollary 21.1 is fulfilled when

limn→∞

mdn

n= 0. 2

Figure 21.2. A flexible grid partition.

21.3 Consistency

In this section we establish a general consistency theorem for a large classof data-based partitioning rules. Using the training data Dn we producea partition Pn = πn(X1, Y1, . . . , Xn, Yn) according to a prescribed ruleπn. We then use the partition Pn in conjunction with Dn to produce aclassification rule based on a majority vote within the cells of the (random)partition. That is, the training set is used twice and it is this feature ofdata-dependent histogram methods that distinguishes them from regularhistogram methods.

Formally, an n-sample partitioning rule for Rd is a function πn thatassociates every n-tuple of pairs (x1, y1), . . . , (xn, yn) ∈ Rd × 0, 1 witha measurable partition of Rd. Associated with every partitioning rule πnthere is a fixed, non-random family of partitions

Fn = πn(x1, y1, . . . , xn, yn) : (x1, y1) . . . , (xn, yn) ∈ Rd × 0, 1.


Fn is the family of partitions produced by the partitioning rule πn for allpossible realizations of the training sequence Dn. When a partitioning ruleπn is applied to the sequence Dn = (X1, Y1), . . . , (Xn, Yn), it produces arandom partition Pn = πn(Dn) ∈ Fn. In what follows we suppress thedependence of Pn on Dn for notational simplicity. For every x ∈ Rd letAn(x) be the unique cell of Pn that contains the point x.

Now let π1, π2, . . . be a fixed sequence of partitioning rules. The clas-sification rule gn(·) = gn(·, Dn) is defined by taking a majority vote amongthe classes appearing in a given cell of Pn, that is,

gn(x) =

0 ifn∑i=1

IXi∈An(x),Yi=1 ≤n∑i=1

IXi∈An(x),Yi=0

1 otherwise.

We emphasize here that the partition Pn can depend on the vectors Xi

and the labels Yi. First we establish the strong consistency of the rulesgn for a wide class of partitioning rules. As always, diam(A) denotes thediameter of a set A, that is, the maximum Euclidean distance between anytwo points of A:

diam(A) = supx,y∈A

‖x− y‖.

Theorem 21.2. (Lugosi and Nobel (1993)). Let π1, π2, . . . be afixed sequence of partitioning rules, and for each n let Fn be the collectionof partitions associated with the n-sample partitioning rule πn. If

(i) for each M <∞

limn→∞

log(∆n(F (M)

n ))

n= 0,

and(ii) for all balls SB and all γ > 0,

limn→∞

µ (x : diam(An(x) ∩ SM ) > γ) = 0

with probability one,then the classification rule gn corresponding to πn satisfies

L(gn) → L∗

with probability one. In other words, the rule gn is strongly consistent.

Remark. In some applications we need a weaker condition to replace (ii) inthe theorem. The following condition will do: for every γ > 0 and δ ∈ (0, 1)

limn→∞

infT⊂Rd:µ(T )≥1−δ

µ(x : diam(An(x)∩T ) > γ) = 0 with probability one.


The verification of this is left to the reader (Problem 21.2). 2

The proof, given below, requires quite some effort. The utility of thetheorem is not immediately apparent. The length of the proof is indicativeof the generality of the conditions in the theorem. Given a data-dependentpartitioning rule, we must verify two things: condition (i) merely relates tothe richness of the class of partitions of Rd that may possibly occur, suchas flexible grids. Condition (ii) tells us that the rule should eventually makelocal decisions. From examples in earlier chapters, it should be obvious that(ii) is not necessary. Finite partitions ofRd necessarily have component setswith infinite diameter, hence we need a condition that states that such setshave small µ-measure. Condition (ii) requires that a randomly chosen cellhave a small diameter. Thus, it may be viewed as the “with-probability-one” version of condition (1) of Theorem 6.1. However, the weaker versionof condition (ii) stated in the above remark is more subtle. By consideringexamples in which µ has bounded support, more than just balls SM areneeded, as the sets of the partition near the boundary of the support mayall have infinite diameter as well. Hence we introduced an infimum withrespect to sets T over all T with µ(T ) ≥ 1− δ.

It suffices to mention that (ii) is satisfied for all distributions in someof the examples that follow. Theorem 21.2 then allows us to conclude thatsuch rules are strongly universally consistent. The theorem has done mostof the digestion of such proofs, so we are left with virtually no work at all.Consistency results will follow like dominos falling.

Proof of Theorem 21.2. Observe that the partitioning classifier gncan be rewritten in the form

gn(x) =

0 if

∑n

i=1IYi=1IXi∈An(x)

µ(An(x)) ≤∑n

i=1IYi=0IXi∈An(x)

µ(An(x))

1 otherwise.

Introduce the notation

ηn(x) =∑ni=1 YiIXi∈An(x)

nµ(An(x)).

For any ε > 0, there is an M ∈ (0,∞) such that PX /∈ SM < ε. Thus,by an application of Theorem 2.3 we see that

L(gn)− L∗

≤ Pgn(X) 6= Y,X ∈ SM |Dn −Pg∗(X) 6= Y,X ∈ SM+ ε

≤∫SM

|η(x)− ηn(x)|µ(dx) +∫SM

|(1− η(x))− η(0)n (x)|µ(dx) + ε,

where

η(0)n (x) =

∑ni=1(1− Yi)IXi∈An(x)

nµ(An(x)).


By symmetry, since ε > 0 is arbitrary, it suffices to show that for eachM <∞, ∫

SM

|η(x)− ηn(x)|µ(dx) → 0 with probability one.

Fix ε > 0 and let r : Rd → R be a continuous function with boundedsupport such that

∫SM

|η(x) − r(x)|µ(dx) < ε. Such r exists by TheoremA.8. Now define the auxiliary functions

ηn(x) =EY IX∈An(x)

∣∣Dn

µ(An(x))

and

rn(x) =Er(X)IX∈An(x)

∣∣Dn

µ(An(x))

.

Note that both ηn and rn are piecewise constant on the cells of the randompartition Pn. We may decompose the error as follows:

|η(x)− ηn(x)| (21.1)

≤ |η(x)− r(x)|+ |r(x)− rn(x)|+ |rn(x)− ηn(x)|+ |ηn(x)− ηn(x)|.

The integral of the first term on the right-hand side is smaller than ε bythe definition of r. For the third term we have∫

SM

|rn(x)− ηn(x)|µ(dx)

=∑

A∈P(M)n

∣∣EY IX∈A∣∣Dn

−E

r(X)IX∈A

∣∣Dn

∣∣=

∑A∈P(M)

n

∣∣∣∣∫A

η(x)µ(dx)−∫A

r(x)µ(dx)∣∣∣∣

≤∫SM

|η(x)− r(x)|µ(dx) < ε.

Consider the fourth term on the right-hand side of (21.1). Clearly,∫SM

|ηn(x)− ηn(x)|µ(dx)

=∑

A∈P(M)n

∣∣∣∣∣ 1nn∑i=1


∣∣Dn

∣∣∣∣∣≤ sup

P(M)∈F(M)n

∑A∈P(M)

∣∣∣∣∣ 1nn∑i=1


∣∣∣∣∣ .


It follows from the first condition of the theorem and Corollary 21.1 ofTheorem 21.1 that∫

SM

|ηn(x)− ηn(x)|µ(dx) → 0 with probability one.

Finally, we consider the second term on the right-hand side of (21.1).Using Fubini’s theorem we have∫

SM

|r(x)− rn(x)|µ(dx)

=∑

A∈P(M)n :µ(A) 6=0

∫A

∣∣∣∣∣r(x)− Er(X)IX∈A

∣∣Dn

µ(A)

∣∣∣∣∣µ(dx)

=∑

A∈P(M)n :µ(A) 6=0

1µ(A)

∫A

∣∣r(x)µ(A)−Er(X)IX∈A

∣∣Dn

∣∣µ(dx)

=∑

A∈P(M)n :µ(A) 6=0

1µ(A)

∫A

∣∣∣∣r(x)∫A

µ(dy)−∫A

r(y)µ(dy)∣∣∣∣µ(dx)

≤∑

A∈P(M)n :µ(A) 6=0

1µ(A)

∫A

∫A

|r(x)− r(y)|µ(dx)µ(dy).

Fix δ ∈ (0, 1). As r is uniformly continuous, there exists a number γ > 0such that if diam(A) < γ, then |r(x) − r(y)| < δ for every x, y ∈ A. Inaddition, there is a constant M < ∞ such that |r(x)| ≤ M for everyx ∈ Rd. Fix n now and consider the integrals

1µ(A)

∫A

∫A

|r(x)− r(y)|µ(dx)µ(dy)

appearing in the sum above. We always have the upper bound

1µ(A)

∫A

∫A

|r(x)− r(y)|µ(dx)µ(dy) ≤ 2Mµ(A).

Assume now that A ∈ P(M)n has diam(A) < γ. Then we can write

1µ(A)

∫A

∫A

|r(x)− r(y)|µ(dx)µ(dy) ≤ δµ(A).

Summing over the cells A ∈ P(M)n with µ(A) > 0, these bounds give∫

SM

|r(x)− rn(x)|µ(dx)

≤∑

A∈P(M)n :diam(A)≥γ

2Mµ(A) +∑

A∈P(M)n :diam(A)<γ

δµ(A)

≤ 2Mµ(x : diam(An(x)) ≥ γ) + δ.

21.4 Statistically Equivalent Blocks 394

Letting n tend to infinity gives

lim supn→∞

∫SM

|r(x)− rn(x)|µ(dx) ≤ δ with probability one

by the second condition of the theorem. Summarizing,

lim supn→∞

∫SM

|η(x)− ηn(x)|µ(dx) ≤ 2ε+ δ with probability one.

Since ε and δ are arbitrary, the theorem is proved. 2

21.4 Statistically Equivalent Blocks

In this section we apply Theorem 21.2 both to classifiers based on uniformspacings in one dimension, and to their extension to multidimensional prob-lems. We refer to these as rules based on statistically equivalent blocks. Theorder statistics of the components of the training data are used to constructa partition into rectangles. All such classifiers are invariant with respect toall strictly monotone transformations of the coordinate axes. The simplestsuch rule is the k-spacing rule studied in Chapter 20 (see Theorem 20.1).Generalizations are possible in several ways. Theorem 21.2 allows us tohandle partitions depending on the whole data sequence—and not only onthe Xi’s. The next simple result is sometimes useful.

Theorem 21.3. Consider a data-dependent partitioning classifier on thereal line that partitions R into intervals in such a way that each intervalcontains at least an and at most bn points. Assume that X has a nonatomicdistribution. Then the classifier is strongly consistent whenever an → ∞and bn/n→ 0 as n→∞.

Proof. We check the conditions of Theorem 21.2. Let Fn contain all par-titions of R into m = dn/ane intervals. Since for each M , all partitionsin F (M)

n have at most m cells, we can bound ∆n(F (M)n ) according to the

Lemma 21.1. By the lemma, ∆n(F (M)n ) does not exceed 2m times the num-

ber of different ways n points can be partitioned into m intervals. A littlethought confirms that this number is(

n+m

n

),

and therefore,

∆n(F (M)n ) ≤ 2m

(n+m

n

).


LetH denote the binary entropy function,H(x) = −x log(x)−(1−x) log(1−x) for x ∈ (0, 1). Note that H is symmetric about 1/2 and that H isincreasing for 0 < x ≤ 1/2. By the inequality of Theorem 13.4, log

(st

)≤

sH(t/s). Therefore, it is easy to see that

log ∆n(F (M)n ) ≤ m+ (n+m)H

(m

n+m

)≤ n/an + 2nH(1/an) + 1.

As H(x) → 0 when x→ 0, the condition an →∞ implies that

1n

log ∆n(F (M)n ) → 0,

which establishes condition (i).To establish condition (ii) of Theorem 21.2, we proceed similarly to the

proof of Theorem 20.1. Fix γ, ε > 0. There exists an interval [−M,M ] suchthat µ([−M,M ]c) < ε, and consequently

µ(x : diam(An(x)) > γ) ≤ ε+ µ(x : diam(An(x)) > γ ∩ [−M,M ]),

where An(x) denotes the cell of the partition Pn containing x. Among theintervals of Pn, there can be at most 2 + 2M/γ disjoint intervals of lengthgreater than γ in [−M,M ]. Thus we may bound the second term on theright-hand side above by

µ(x : diam(An(x)) > γ ∩ [−M,M ])

≤(

2 +2Mγ

)maxA∈Pn

µ(A)

≤(

2 +2Mγ

)(maxA∈Pn

µn(A) + maxA∈Pn

|µ(A)− µn(A)|)

≤(

2 +2Mγ

)(bnn

+ supA∈A

|µ(A)− µn(A)|),

where A is the set of all intervals in R. The first term in the parenthesisconverges to zero by the second condition of the theorem, while the secondterm goes to zero with probability one by an obvious extension of theclassical Glivenko-Cantelli theorem (Theorem 12.4). Summarizing, we haveshown that for any γ, ε > 0

lim supn→∞

µ(x : diam(An(x)) > γ) ≤ ε with probability one.


This completes the proof. 2

The d-dimensional generalizations of k-spacing rules include rules basedupon statistically equivalent blocks, that is, partitions with sets that con-tain k points each. It is obvious that one can proceed in many ways, see,for example, Anderson (1966), Patrick (1966), Patrick and Fisher (1967),Quesenberry and Gessaman (1968) and Gessaman and Gessaman (1972).

As a first example, consider the following algorithm: the k-th smallestx(1)-coordinate among the training data defines the first cut. The (infinite)rectangle with n − k points is cut according to the x(2)-axis, isolating an-other k points. This can be repeated on a rotational basis for all coordinateaxes. Unfortunately, the classifier obtained this way is not consistent. Tosee this, observe that if k is much smaller than n—a clearly necessary re-quirement for consistency—then almost all cells produced by the cuts arelong and thin. We sketched a distribution in Figure 21.3 for which the errorprobability of this classifier fails to converge to L∗. The details are left tothe reader (Problem 21.3). This example highlights the importance of con-dition (ii) of Theorem 21.2, that is, that the diameters of the cells shouldshrink in some sense as n→∞.


Figure 21.3. A non-consistent k-blockalgorithm (with k = 2 in the picture).η(x) = 1 in the shaded area and η(x) = 0in the white area.

Rules have been developed in which the rectangular partition dependsnot only upon the Xi’s in the training sequence, but also upon the Yi’s (see,e.g., Henrichon and Fu (1969), Meisel and Michalopoulos (1973) and Fried-man (1977)). For example, Friedman cuts the axes at the places where theabsolute differences between the marginal empirical distribution functionsare largest, to insure minimal empirical error after the cut. His procedureis based upon Stoller splits (see Chapters 4 and 20).

Rules depending on the coordinatewise ranks of data points are inter-esting because they are invariant under monotone transformations of thecoordinate axes. This is particularly important in practice when the com-ponents are not physically comparable. “Distribution-free” is an adjectiveoften used to point out a property that is universally valid. For such meth-ods in statistics, see the survey of Das Gupta (1964). Statistically equivalentsets in partitions are called “distribution-free” because the measure µ(A)of a set in the partition does not depend upon the distribution of X. Wealready noted a similar distribution-free behavior for k-d trees and mediantrees (Chapter 20). There is no reason to stay with rectangular-shaped sets(Anderson and Benning (1970), Beakley and Tuteur (1972)) but doing sogreatly simplifies the interpretation of a classifier. In this book, to avoidconfusion, we reserve the term “distribution-free” for consistency results orother theoretical properties that hold for all distributions of (X,Y ).

It is possible to define consistent partitions that have statistically equiva-lent sets. To fix the ideas, we take Gessaman’s rule (1970) as our prototyperule for further study (note: for hypothesis testing, this partition was al-ready noted by Anderson (1966)). For each n, let m =

⌈(n/kn)

1/d⌉. Project

the vectors X1, . . . , Xn onto the first coordinate axis, and then partitionthe data into m sets using hyperplanes


Figure 21.4. Gessaman’s parti-tion with m = 3.

perpendicular to that axis, in such a way that each set contains an equalnumber of points (except, possibly, the rightmost set, where fewer pointsmay fall if n is not a multiple of m). We obtain m cylindrical sets. In thesame fashion, cut each of these cylindrical sets, along the second axis, intom boxes such that each box contains the same number of data points. Con-tinuing in the same way along the remaining coordinate axes, we obtain md

rectangular cells, each of which (with the exception of those on the bound-ary) contains about kn points. The classification rule gn uses a majorityvote among those Yi’s for which Xi lies within a given cell. Consistency ofthis classification rule can be established by an argument similar to thatused for the kn-spacing rule above. One needs to check that the conditionsof Theorem 21.2 are satisfied. The only minor difference appears in the com-putation of ∆n, which in this case is bounded from above by 2m

d(n+mn

)d.

The following theorem summarizes the result:

Theorem 21.4. Assume that the marginal distributions of X in Rd arenonatomic. Then the partitioning classification rule based on Gessaman’srule is strongly consistent if kn →∞ and kn/n→ 0 as n tends to infinity.

To consider distributions with possibly atomic marginals, the partition-ing algorithm must be modified, since every atom has more than kn pointsfalling on it for large n. With a proper modification, a strongly universallyconsistent rule can be obtained. We leave the details to the reader (Problem21.4).

Remark. Consistency of Gessaman’s classification scheme can also be de-rived from the results of Gordon and Olshen (1978) under the additionalcondition kn/

√n→∞. Results of Breiman, Friedman, Olshen, and Stone

(1984) can be used

21.5 Partitioning Rules Based on Clustering 399

to improve this condition to kn/ log n → ∞. Theorem 21.4 guaranteesconsistency under the weakest possible condition kn →∞. 2

21.5 Partitioning Rules Based on Clustering

Clustering is one of the most widely used methods in statistical data anal-ysis. Typical clustering schemes divide the data into a finite number ofdisjoint groups by minimizing some empirical error measure, such as theaverage squared distance from cluster centers (see Hartigan (1975)). In thissection we outline the application of our results to classification rules basedon k-means clustering of unlabeled observations.

As a first step, we divide X1, . . . , Xn into kn disjoint groups having clus-ter centers a1, . . . , akn ∈ Rd. The vectors a1, . . . , akn are chosen to minimizethe empirical squared Euclidean distance error,

en(b1, . . . , bkn) =

1n

n∑i=1

min1≤j≤kn

‖Xi − bj‖2

over all the nearest-neighbor clustering rules having kn representativesb1, . . . , bkn

∈ Rd, where ‖ · ‖ denotes the usual Euclidean norm. Note thatthe choice of cluster centers depends only on the vectors Xi, not on theirlabels. For the behavior of en(a1, . . . , akn

), see Problem 29.4.The vectors a1, . . . , akn

give rise to a Voronoi partition Pn = A1, . . . , Akn

in a natural way: for each j ∈ 1, . . . , kn, let

Aj = x : ‖x− aj‖ ≤ ‖x− ai‖ for all i = 1, . . . , kn.

Ties are broken by assigning points on the boundaries to the vector thathas the smallest index.

The classification rule gn is defined in the usual way: gn(x) is a majorityvote among those Yj ’s such that Xj falls in An(x). If the measure µ ofX has a bounded support, Theorem 21.2 shows that the classification rulegn is strongly consistent if kn grows with n at an appropriate rate. Notethat this rule is just another of the prototype nearest neighbor rules thatwe discussed in Chapter 19.


Figure 21.5. Example of parti-tioning based on clustering withkn = 3. The criterion we mini-mize is the sum of the squares ofthe distances of the Xi’s to theaj’s.

Theorem 21.5. (Lugosi and Nobel (1993)). Assume that there is abounded set A ⊂ Rd such that PX ∈ A = 1. Let kn be a sequence ofintegers for which

kn →∞ andk2n log nn

→ 0 as n→∞.

Let gn(·, Dn) be the histogram classification rule based on the Voronoi par-tition of kn cluster centers minimizing the empirical squared Euclidean dis-tance error. Then

L(gn) → L∗

with probability one as n tends to infinity. If d = 1 or 2, then the secondcondition on kn can be relaxed to

kn log nn

→ 0.

Proof. Again, we check the conditions of Theorem 21.2. Let Fn consistof all Voronoi partitions of kn points in Rd. As each partition consists ofkn cells, we may use Lemma 21.1 to bound ∆n(F (M)

n ). Clearly, boundariesbetween cells are subsets of hyperplanes. Since there are at most kn(kn −1)/2 boundaries between the kn Voronoi cells, each cell of a partition inFn is a polytope with at most kn(kn − 1)/2 < k2

n faces. By Theorem 13.9,n fixed points in Rd, d ≥ 2, can be split by hyperplanes in at most nd+1

different ways. It follows that for each


M <∞, ∆n(F (M)n ) ≤ 2knn(d+1)k2

n , and consequently

1n

log ∆n(F (M)n ) ≤ kn

n+

(d+ 1)k2n log n

n→ 0

by the second condition on the sequence kn. Thus condition (i) of The-orem 21.2 is satisfied.

It remains to establish condition (ii) of Theorem 21.2. This time we needthe weaker condition mentioned in the remark after the theorem, that is,that for every γ > 0 and δ ∈ (0, 1)

infT :µ(T )≥1−δ

µ(x : diam(An(x) ∩ T ) > γ) → 0 with probability one.

Clearly, we are done if we can prove that there is a sequence of subsetsTn of Rd (possibly depending on the data Dn) such that µ(Tn) → 1 withprobability one, and

µ (x : diam(An(x) ∩ Tn) > γ) = 0.

To this end, let a1, . . . , akn denote the optimal cluster centers correspondingto Dn, and define

Tn =kn⋃j=1

Saj ,γ/2 ∩Aj ,

where Sx,r is the ball of radius r around the vector x. Clearly, x ∈ Tn impliesthat ‖x − a(x)‖ < γ/2, where a(x) denotes the closest cluster center to xamong a1, . . . , akn

. But since

An(x) ∩ Tn = An(x) ∩ Sa(x),γ/2,

it follows that

diam(An(x) ∩ Tn) ≤ diam(Sa(x),γ/2) = γ,

andµ (x : diam(An(x) ∩ Tn) > γ) = 0.

It remains to show that µ(Tn) → 1 with probability one as n→∞. UsingMarkov’s inequality, we may write

1− µ(Tn) = PX /∈ Tn|Dn

≤ P

min1≤j≤kn

‖X − aj‖2 >(γ

2

)2∣∣∣∣Dn

≤

E

min1≤j≤kn ‖X − aj‖2∣∣Dn

(γ/2)2

.


Using a large-deviation inequality for the empirical squared error of nearest-neighbor clustering schemes, it can be shown (see Problem 29.4) that if Xhas bounded support, then

E

min1≤j≤kn

‖X − aj‖2∣∣∣∣Dn

− minb1,...,bkn∈Rd

E

min1≤j≤kn

‖X − bj‖2→ 0

with probability one if kn log n/n → 0 as n → ∞. Moreover, it is easy tosee that

minb1,...,bkn∈Rd

E

min1≤j≤kn

‖X − bj‖2→ 0

as kn →∞. It follows that µ(Tn) → 1, as desired.If d = 1, then the cells of the Voronoi partition are intervals on the

real line, and therefore, ∆n(F (M)n ) ≤ 2knnkn . Similarly, if d = 2, then

the number of hyperplanes defining the Voronoi partition increases linearlywith kn. To see this, observe that if we connect centers of neighboringclusters by edges, then we obtain a planar graph. From Euler’s theorem(see, e.g., Edelsbrunner (1987, p.242)), the number of edges in a planargraph is bounded by 3N − 6, N ≥ 3, where N is the number of vertices.Thus, in order to satisfy condition (i), it suffices that kn log n/n → 0 inboth cases. 2

Remark. In the theorem above we assume that the cluster centers a1, . . . , akare empirically optimal in the sense that they are chosen to minimize theempirical squared Euclidean distance error

1n

n∑i=1

min1≤j≤kn

‖Xi − aj‖2.

Practically speaking, it is hard to determine minimum. To get around thisdifficulty, several fast algorithms have been proposed that approximate theoptimum (see Hartigan (1975) for a survey). Perhaps the most popularalgorithm is the so-called k-means clustering method, also known in thetheory of quantization as the Lloyd-Max algorithm (Lloyd (1982), Max(1960), Linde, Buzo, and Gray (1980)). The iterative method works asfollows:

Step 1. Take k initial centers a(0)1 , . . . , a

(0)k ∈ Rd, and set i = 0.

Step 2. Cluster the data pointsX1, . . . , Xn around the centers a(i)1 , . . .,

a(i)k into k sets such that the m-th set C(i)

m contains the Xj ’sthat are closer to a(i)

m than to any other center. Ties are brokenin favor of smaller indices.

21.6 Data-Based Scaling 403

Step 3. Determine the new centers a(i+1)1 , . . . , a

(i+1)k as the averages

of the data points within the clusters:

a(i+1)m =

∑j:Xj∈C(i)

mXj

|C(i)m |

.

Step 4. Increase i by one, and repeat Steps 1 and 2 until there are nochanges in the cluster centers.

It is easy to see that each step of the algorithm decreases the empiricalsquared Euclidean distance error. On the other hand, the empirical squarederror can take finitely many different values during the execution of thealgorithm. Therefore the algorithm halts in finite time. By inspecting theproof of Theorem 21.5, it is not hard to see that consistency can alsobe proved for partitions given by the suboptimal cluster centers obtainedby the k-means method, provided that it is initialized appropriately (seeProblem 21.6). 2

21.6 Data-Based Scaling

We now choose the grid size h in a cubic histogram rule in a data-dependentmanner and denote its value by Hn. Theorem 21.2 implies the followinggeneral result:

Theorem 21.6. Let gn be the cubic histogram classifier based on a parti-tion into cubes of size Hn. If

(a) limn→∞

Hn = 0 and

(b) limn→∞

nHdn = ∞ with probability one,

then the partitioning rule is strongly universally consistent.

To prove the theorem, we need the following auxiliary result:

Lemma 21.2. Let Z1, Z2, . . . be a sequence of nonnegative random vari-ables. If limn→∞ Zn = 0 with probability one, then there exists a sequencean ↓ 0 of positive numbers such that limn→∞ IZn≥an = 0 with probabilityone.

Proof. Define Vn = supm≥n Zm. Then clearly, Vn ↓ 0 with probabilityone. We can find a subsequence n1, n2, . . . of positive integers such that foreach k,

PVnk≥ 1/k ≤ 2−k.

21.6 Data-Based Scaling 404

Then the Borel-Cantelli lemma implies that

limk→∞

IVnk≥1/k = 0 with probability one. (21.2)

Define an = 1/k for nk ≤ n < nk+1. Then, for nk ≤ n < nk+1,

IVn≥an = IVn≥1/k ≤ IVnk≥1/k.

The fact that Vn ≥ Zn and (21.2) imply the statement. 2

Proof of Theorem 21.6. Let an and bn be sequences of positivenumbers with an < bn. Then

limn→∞

L(gn) = L∗⊆

limn→∞

L(gn) = L∗,Hn ∈ [an, bn]⋃

Hn /∈ [an, bn] .

It follows from Lemma 21.2 that there exist sequences of positive numbersan and bn satisfying an < bn, bn → 0, and nadn → ∞ as n → ∞ suchthat

limn→∞

IHn /∈[an,bn] = 0 with probability one.

Therefore we may assume that for each n, PHn ∈ [an, bn] = 1. Sincebn → 0, condition (ii) of Theorem 21.2 holds trivially, as all diameters ofall cells are inferior to bn

√d. It remains to check condition (i). Clearly, for

each M < ∞, each partition in F (M)n contains less than c0(1/an)d cells,

where the constant c0 depends on M and d only. On the other hand, itis easy to see that n points can not be partitioned more than c1n(1/an)d

different ways by cubic-grid partitions with cube size h ≥ an for some otherconstant c1. Therefore, for each M <∞,

∆n(F (M)) ≤ 2c0(1/an)d c1n

adn,

and condition (i) is satisfied. 2

In many applications, different components of the feature vector X cor-respond to different physical measurements. For example, in a medical ap-plication, the first coordinate could represent blood pressure, the secondcholesterol level, and the third the weight of the patient. In such cases thereis no reason to use cubic histograms, because the resolution of the partitionhn along the coordinate axes depends on the apparent scaling of the mea-surements, which is rather arbitrary. Then one can use scale-independentpartitions such as methods based on order statistics described earlier. Al-ternatively, one might use rectangular partitions instead of cubic ones, andlet the data decide the scaling along the different coordinate axes. Again,Theorem 21.2 can be used to establish conditions of universal consistency ofthe classification rule corresponding to data-based rectangular partitions:

21.7 Classification Trees 405

Theorem 21.7. Consider a data-dependent histogram rule when the cellsof the partition are all rectangles of the form

[k1Hn1, (k1 + 1)Hn1)× . . .× [kdHnd, (kd + 1)Hnd),

where k1, . . . , kd run through the set of integers, and the edge sizes of therectangles Hn1, . . . ,Hnd are determined from the data Dn. If as n→∞

Hni → 0 for each 1 ≤ i ≤ d, and nHn1 · · ·Hnd →∞ with probability one,

then the data-dependent rectangular partitioning rule is strongly universallyconsistent.

To prove this, just check the conditions of Theorem 21.2 (Problem 21.7).We may pick, for example, Hn1, . . . ,Hnd to minimize the resubstitutionestimate

n∑i=1

Ign(Xi) 6=Yi

subject of course to certain conditions, so that∑di=1Hni → 0 with proba-

bility one, yet n∏di=1Hni →∞ with probability one. See Problem 21.8.

21.7 Classification Trees

Consider a partition of the space obtained by a binary classification treein which each node dichotomizes its set by a hyperplane (see Chapter 20for more on classification trees). The construction of the tree is stoppedaccording to some unspecified rule, and classification is by majority voteover the convex polytopes of the partition.

The following corollary of Theorem 21.2 generalizes (somewhat) the con-sistency results in the book by Breiman, Friedman, Olshen, and Stone(1984):

Theorem 21.8. (Lugosi and Nobel (1993)). Let gn be a binary treeclassifier based upon at most mn−1 hyperplane splits, where mn = o(n/ log n).If, in addition, condition (ii) of Theorem 21.2 is satisfied, then gn is stronglyconsistent. In particular, the rule is strongly consistent if condition (ii) ofTheorem 21.2 holds and every cell of the partition contains at least knpoints, where kn/ log n→∞.

Proof. To check condition (i) of Theorem 21.2, recall Theorem 13.9 whichimplies that n ≥ 2 points in a d-dimensional Euclidean space can be di-chotomized by hyperplanes in at most nd+1 different ways. From this, wesee that the number of different ways n points of Rd can be partitioned bythe rule gn can be bounded by(

nd+1)mn = 2(d+1)mn logn,


as there are not more than mn cells in the partition. Thus,

∆n(Fn) ≤ 2mn2(d+1)mn logn.

By the assumption that mn log n/n→ 0 we have

1n

log ∆n(Fn) ≤mn

n+mn(d+ 1) log n

n→ 0,

so that condition (i) of Theorem 21.2 is satisfied.For the second part of the statement, observe that there are no more

than n/kn cells in any partition, and that the tree-structured nature of thepartitions assures that gn is based on at most n/kn hyperplane splits. Thiscompletes the proof. 2


Problem 21.1. Let P be a partition of Rd. Prove that∑A∈P

|µn(A)− µ(A)| = 2 supB∈B(P)

|µn(B)− µ(B)| ,

where the class of sets B(P) contains all sets obtained by unions of cells of P.This is Scheffe’s (1947) theorem for partitions. See also Problem 12.13.

Problem 21.2. Show that condition (ii) of Theorem 21.2 may be replaced bythe following: for every γ > 0 and δ ∈ (0, 1)

limn→∞

infT⊂Rd:µ(T )≥1−δ

µ(x : diam(An(x) ∩ T ) > γ) = 0 with probability one.

Problem 21.3. Let X be uniformly distributed on the unit square [0, 1]2. Letη(x) = 1 if x(1) ≤ 2/3, and η(x) = 0 otherwise (see Figure 21.3). Consider thealgorithm when first the x(1)-coordinate is cut at the k-th smallest x(1)-valueamong the training data. Next the rectangle with n − k points is cut accordingto the x(2)-axis, isolating another k points. This is repeated on a rotational basisfor the two coordinate axes. Show that the error probability of the obtainedpartitioning classifier does not tend to L∗ = 0. Can you determine the asymptoticerror probability?

Problem 21.4. Modify Gessaman’s rule based on statistically equivalent blocksso that the rule is strongly universally consistent.

Problem 21.5. Cut each axis independently into intervals containing exactly kof the (projected) data points. The i-th axis has intervals A1,i, A2,i, . . .. Form ahistogram rule that takes a majority vote over the product sets Ai1,1×· · ·×Aid,d.


Figure 21.6. A partition based upon

the method obtained above with k = 6.

This rule does not guarantee a minimal number of points in every cell. Never-theless, if kd = o(n), k →∞, show that this decision rule is consistent, i.e., thatELn → L∗ in probability.

Problem 21.6. Show that each step of the k-means clustering algorithm de-creases the empirical squared error. Conclude that Theorem 21.5 is also true ifthe clusters are given by the k-means algorithm. Hint: Observe that the onlyproperty of the clusters used in the proof of Theorem 21.5 is that

E

min

1≤j≤kn

‖X − aj‖2∣∣∣∣Dn→ 0

with probability one. This can be proven for clusters given by the k-means algo-rithm if it is appropriately initialized. To this end, use the techniques of Problem29.4.


Problem 21.8. Consider the Hni’s in Theorem 21.7, the interval sizes for cubichistograms. Let the Hni’s be found by minimizing

n∑i=1

Ign(Xi) 6=Yi

subject to the condition that each marginal interval contain at least k data points(that is, at least k data points have that one coordinate in the interval). Underwhich condition on k is the rule consistent?

Problem 21.9. Take a histogram rule with data-dependent sizes Hn1, . . . , Hndas in Theorem 21.7, defined as follows:

(1) Hni =

(max

1≤j≤nX

(i)j − min

1≤j≤nX

(i)j

)/√n, 1 ≤ i ≤ d (whereXj = (X

(1)j , . . .,

X(d)j );

(2) Hni = (Wni − Vni)/n1/(2d), 1 ≤ i ≤ d, where Vni and Wni are 25 and 75

percentiles of X(i)1 , . . . , X

(i)n .

Assume for convenience that X has nonatomic marginals. Show that (1) leadssometimes to an inconsistent rule, even if d = 1. Show that (2) always yields ascale-invariant consistent rule.


22Splitting the Data

22.1 The Holdout Estimate

Universal consistency gives us a partial satisfaction—without knowing theunderlying distribution, taking more samples is guaranteed to push us closeto the Bayes rule in the long run. Unfortunately, we will never know justhow close we are to the Bayes rule unless we are given more informationabout the unknown distribution (see Chapter 7). A more modest goal is todo as well as possible within a given class of rules. To fix the ideas, considerall nearest neighbor rules based upon metrics of the form

‖x‖2 =d∑i=1

aix(i)2,

where ai ≥ 0 for all i and x =(x(1), . . . , x(d)

). Here the ai’s are variable

scale factors. Let φn be a particular nearest neighbor rule for a given choiceof (a1, . . . , ad), and let gn be a data-based rule chosen from this class. Thebest we can hope for now is something like

L(gn)infφn

L(φn)→ 1 in probability

for all distributions, where L(gn) = Pgn(X) 6= Y |Dn is the conditionalprobability of error for gn. This sort of optimality-within-a-class is definitelyachievable. However, proving such optimality is generally not easy as gndepends on the data. In this chapter we present one possible methodology

22.1 The Holdout Estimate 409

for selecting provably good rules from restricted classes. This is achievedby splitting the data into a training sequence and a testing sequence. Thisidea was explored and analyzed in depth in Devroye (1988b) and is nowformalized.

The data sequenceDn is split into a training sequenceDm = (X1, Y1), . . .,(Xm, Ym) and a testing sequence Tl = (Xm+1, Ym+1), . . . , (Xm+l, Ym+l),where l +m = n. The sequence Dm defines a class of classifiers Cm whosemembers are denoted by φm(·) = φm(·, Dm). The testing sequence is usedto select a classifier from Cm that minimizes the error count

Lm,l(φm) =1l

l∑i=1

Iφm(Xm+i) 6=Ym+i.

This error estimate is called the holdout estimate, as the testing sequenceis “held out” of the design of φm. Thus, the selected classifier gn ∈ Cmsatisfies

Lm,l(gn) ≤ Lm,l(φm)

for all φm ∈ Cm. The subscript n in gn may be a little confusing, sincegn is in Cm, a class of classifiers depending on the first m pairs Dm only.However, gn depends on the entire data Dn, as the rest of the data is usedfor testing the classifiers in Cm. We are interested in the difference betweenthe error probability

L(gn) = Pgn(X) 6= Y |Dn,

and that of the best classifier in Cm, infφm∈CmL(φm). Note that L(φm) =

Pφm(X) 6= Y |Dm denotes the error probability conditioned on Dm. Theconditional probability

PL(gn)− inf

φm∈Cm

L(φm) > ε

∣∣∣∣Dm

is small when most testing sequences Tl pick a rule gn whose error prob-ability is within ε of the best classifier in Cm. We have already addressedsimilar questions in Chapters 8 and 12. There we have seen (Lemma 8.2),that

L(gn)− infφm∈Cm

L(φm) ≤ 2 supφm∈Cm

|Lm,l(φm)− L(φm)|.

If Cm contains finitely many rules, then the bound of Theorem 8.3 may beuseful:

P

sup

φm∈Cm

|Lm,l(φm)− L(φm)| > ε

∣∣∣∣∣Dm

≤ 2|Cm|e−2lε2 .

If we take m = l = n/2, then Theorem 8.3 shows (see Problem 12.1) thaton the average we are within

√log(2e|Cm|)/n of the best possible error

rate, whatever it is.

22.2 Consistency and Asymptotic Optimality 410

If Cm is of infinite cardinality, then we can use the Vapnik-Chervonenkistheory to get similar inequalities. For example, from Theorem 12.8 we get

P

sup

φm∈Cm

∣∣∣Lm,l(φm)− L(φm)∣∣∣ > ε

∣∣∣∣∣Dm

≤ 4e8S(Cm, l2)e−2lε2 , (22.1)

and consequently,

PL(gn)− inf

φm∈Cm

L(φm) > ε

∣∣∣∣Dm

≤ 4e8S(Cm, l2)e−lε

2/2, (22.2)

where S(Cm, l) is the l-th shatter coefficient corresponding to the class ofclassifiers Cm (see Theorem 12.6 for the definition). Since Cm depends onthe training data Dm, the shatter coefficients S(Cm, l) may depend on Dm,too. However, usually it is possible to find upper bounds on the randomvariable S(Cm, l) that depend on m and l only, but not on the actual valuesof the random variables X1, Y1, . . . , Xm, Ym. Both upper bounds above aredistribution-free, and the problem now is purely combinatorial: count |Cm|(this is usually trivial), or compute S(Cm, l).

Remark. With much more effort it is possible to obtain performancebounds of the holdout estimate in the form of bounds for

PLm,l(φm)− L(φn) > ε

for some special rules where φm and φn are carefully defined. For example,Devroye and Wagner (1979a) give upper bounds when both φm and φn arek-nearest neighbor classifiers with the same k (but working with differentsample size). 2

Remark. Minimizing the holdout estimate is not the only possibility. Othererror estimates that do not split the data may be used in classifier selectionas well. Such estimates are discussed in Chapters 23, 24, and 31. However,these estimates are usually tailored to work well for specific discriminationrules. The most general and robust method is certainly the data splittingdescribed here. 2

22.2 Consistency and Asymptotic Optimality

Typically, Cm becomes richer as m grows, and it is natural to ask whetherthe empirically best classifier in Cm is consistent.

Theorem 22.1. Assume that from each Cm we can pick one φm such thatthe sequence of φm’s is consistent for a certain class of distributions. Then

22.2 Consistency and Asymptotic Optimality 411

the automatic rule gn defined above is consistent for the same class ofdistributions (i.e., EL(gn) → L∗ as n→∞) if

limn→∞

log(ES(Cm, l))l

= 0.

Proof. Decompose the difference between the actual error probability andthe Bayes error as

L(gn)− L∗ =(L(gn)− inf

φm∈Cm

L(φm))

+(

infφm∈Cm

L(φm)− L∗).

The convergence of the first term on the right-hand side is a direct corollaryof (22.2). The second term converges to zero by assumption. 2

Theorem 22.1 shows that a consistent rule is picked if the sequence of Cm’scontains a consistent rule, even if we do not know which functions from Cmlead to consistency. If we are just worried about consistency, Theorem 22.1reassures us that nothing is lost as long as we take l much larger thanlog(ES(Cm, l)). Often, this reduces to a very weak condition on the sizel of the testing set.

Let us now introduce the notion of asymptotic optimality. A sequenceof rules gn is said to be asymptotically optimal for a given distribution of(X,Y ) when

limn→∞

EL(gn) − L∗

E infφm∈CmL(φm) − L∗

= 1.

Our definition is not entirely fair, because gn uses n observations, whereasthe family of rules in the denominator is restricted to using m observations.If gn is not taken from the same Cm, then it is possible to have a ratio whichis smaller than one. But if gn ∈ Cm, then the ratio always is at least one.That is why the definition makes sense in our setup.

When our selected rule is asymptotically optimal, we have achieved some-thing very strong: we have in effect picked a rule (or better, a sequence ofrules) which has a probability of error converging at the optimal rate at-tainable within the sequence of Cm’s. And we do not even have to knowwhat the optimal rate of convergence is. This is especially important in non-parametric rules, where some researchers choose smoothing factors basedon theoretical results about the optimal attainable rate of convergence forcertain classes of distributions.

We are constantly faced with the problem of choosing between paramet-ric and nonparametric discriminators. Parametric discriminators are basedupon an underlying model in which a finite number of unknown parame-ters is estimated from the data. A case in point is the multivariate normaldistribution, which leads to linear or quadratic discriminators. If the model

22.3 Nearest Neighbor Rules with Automatic Scaling 412

is wrong, parametric methods can perform very poorly; when the modelis right, their performance is difficult to beat. The method based on split-ting the data chooses among the best discriminator depending upon whichhappens to be best for the given data. We can throw in Cm a variety ofrules, including nearest neighbor rules, a few linear discriminators, a cou-ple of tree classifiers and perhaps a kernel rule. The probability boundsabove can be used when the complexity of Cm (measured by its shattercoefficients) does not get out of hand.

The notion of asymptotic optimality can be too strong in many cases. Thereason for this is that in some rare lucky situations E infφm∈Cm L(φm)−L∗ may be very small. In these cases it is impossible to achieve asymp-totic optimality. We can fix this problem by introducing the notion of εm-optimality, where εm is a positive sequence decreasing to 0 with m. A ruleis said to be εm-optimal when

limn→∞

EL(gn) − L∗


= 1.

for all distributions of (X,Y ) for which

limm→∞


εm= ∞.

In what follows we apply the idea of data splitting to scaled nearestneighbor rules and to rules closely related to the data-dependent partition-ing classifiers studied in Chapter 21. Many more examples are presented inChapters 25 and 26.

22.3 Nearest Neighbor Rules with AutomaticScaling

Let us work out the simple example introduced above. The (a1, . . . , ad)-nnrule is the nearest neighbor rule for the metric

‖x‖ =

√√√√ d∑i=1

a2ix

(i)2,

where x =(x(1), . . . , x(d)

). The class Cm is the class of all (a1, . . . , ad)-nn

rules for Dm = ((X1, Y1), . . . , (Xm, Ym)). The testing sequence Tl is used tochoose (a1, . . . , ad) so as to minimize the holdout error estimate. In orderto have

L(gn)− infφm∈Cm

L(φm) → 0 in probability,

it suffices that

limn→∞

log E S(Cm, l)l

= 0.

22.3 Nearest Neighbor Rules with Automatic Scaling 413

This puts a lower bound on l. To get this lower bound, one must computeS(Cm, l). Clearly, S(Cm, l) is bounded by the number of ways of classifyingXm+1, . . . , Xm+l using rules picked from Cm, that is, the total number ofdifferent values for

(φm(Xm+1), . . . , φm(Xm+l)) , φm ∈ Cm .

We now show that regardless of Dm and Xm+1, . . . , Xm+l, for n ≥ 4 wehave

S(Cm, l) ≤(l

(m

2

))d.

This sort of result is typical—the bound does not depend upon Dm or Tl.When plugged back into the condition of convergence, it yields the simplecondition

logml

→ 0.

In fact, we may thus take l slightly larger than logm. It would plainly besilly to take l = n/2, as we would thereby in fact throw away most of thedata.

SetAm = (φm(Xm+1), . . . , φm(Xm+l)) , φm ∈ Cm .

To count the number of values in Am, note that all squared distances canbe written as

‖Xi −Xj‖2 =d∑k=1

a2kρk,i,j ,

where ρk,i,j is a nonnegative number only depending upon Xi and Xj . Notethat each squared distance is linear in (a2

1, . . . , a2d). Now consider the space

of (a21, . . . , a

2d). Observe that, in this space, φm(Xm+1) is constant within

each cell of the partition determined by the(m2

)hyperplanes

d∑k=1

a2kρk,i,m+1 =

d∑k=1

a2kρk,i′,m+1, 1 ≤ i < i′ ≤ m.

To see this, note that within each set in the partition, φm(Xm+1) keeps thesame nearest neighbor among X1, . . . , Xm. It is known that k > 2 hyper-planes in Rd create partitions of cardinality not exceeding kd (see Problem22.1). Now, overlay the l partitions obtained for φm(Xm+1), . . . , φm(Xm+l)respectively. This yields at most(

l

(m

2

))d

22.4 Classification Based on Clustering 414

sets, as the overlays are determined by l(m2

)hyperplanes. But clearly, on

each of these sets, (φm(Xm+1), . . . , φm(Xm+l)) is constant. Therefore,

S(Cm, l) ≤ |Am| ≤(l

(m

2

))d.

22.4 Classification Based on Clustering

Recall the classification rule based on clustering that was introduced inChapter 21. The data points X1, . . . , Xn are grouped into k clusters, wherek is a predetermined integer, and a majority vote decides within the kclusters. If k is chosen such that k → ∞ and k2 log n/n → 0, then therule is consistent. For a given finite n, however, these conditions give littleguidance. Also, the choice of k could dramatically affect the performanceof the rule, as there may be a mismatch between k and some unknownnatural number of clusters. For example, one may construct distributionsin which the optimal number of clusters does not increase with n.

Let us split the data and let the testing sequence decide the value of k.In the framework of this chapter, Cm contains the classifiers based on thefirst m pairs Dm of the data with all possible values of k. Clearly, Cm is afinite family with |Cm| = m. In this case, by Problem 12.1, we have

EL(gn)− inf

φm∈Cm

L(φm)∣∣∣∣Dm

≤√

2log(2m)l

.

The consistency result Theorem 21.7 implies that

E

infφm∈Cm

L(φm)→ L∗

for all distributions, if the Xi’s have bounded support. Thus, we see thatour strategy leads to a universally consistent rule whenever (logm)/l→ 0.This is a very mild condition, since we can take l equal to a small fractionof n, without sacrificing consistency. If l is small compared to n, then mis close to n, so infφm∈Cm L(φm) is likely to be close to infφn∈Cn L(φn).Thus, we do not lose much by sacrificing a part of the data for testingpurposes, but the gain can be tremendous, as we are guaranteed to bewithin

√(logm)/l of the optimum in the class Cm.

That we cannot take l = 1 and hope to obtain consistency should beobvious. It should also be noted that for l = m, we are roughly within√

log(m)/m of the best possible probability of error within the family.Also the empirical selection rule is

√log(m)/l-optimal.


22.5 Statistically Equivalent Blocks

Recall from Chapter 21 that classifiers based on statistically equivalentblocks typically partition the feature space Rd into rectangles such thateach rectangle contains k points, where k is a certain positive integer, theparameter of the rule. This may be done in several different ways. One ofmany such rules—the rule introduced by Gessaman (1970)— is consistentif k → ∞ and k/n → 0 (Theorem 21.4). Again, we can let the data pickthe value of k by minimizing the holdout estimate. Just as in the previoussection, |Cm| = m, and every remark mentioned there about consistencyand asymptotic optimality remains true for this case as well.

We can enlarge the family Cm by allowing partitions without restrictionson cardinalities of cells. This leads very quickly to oversized families ofrules, and we have to impose reasonable restrictions. Consider cuts into atmost k rectangles, where k is a number picked beforehand. Recall that fora fixed partition, the class assigned to every rectangle is decided upon bya majority vote among the training points. On the real line, choosing apartition into at most k sets is equivalent to choosing k − 1 cut positionsfrom m+ l+1 = n+1 spacings between all test and training points. Hence,

S(Cm, l) ≤(n+ 1k − 1

)≤ (n+ 1)k−1.

For consistency, k has to grow as n grows. It is easily seen that

E

infφm∈Cm

L(φm)→ L∗

if k →∞ and m→∞. To achieve consistency of the selected rule, however,we also need

logS(Cm, l)l

≤ (k − 1) log(n+ 1)l

→ 0.

Now consider d-dimensional partitions defined by at most k − 1 consec-utive orthogonal cuts, that is, the first cut divides Rd into two halfspacesalong a hyperplane perpendicular to one of the coordinate axes. The secondcut splits one of the halfspaces into two parts along another orthogonal hy-perplane, and so forth. This procedure yields a partition of the space intok rectangles. We see that for the first cut, there are at most 1+dn possiblecombinations to choose from. This yields the loose upper bound

S(Cm, l) ≤ (1 + dn)k−1.

This bound is also valid for all grids defined by at most k − 1 cuts. Themain difference here is that every cut defines two halfspaces of Rd, and nottwo hyperrectangles of a cells, so that we usually end up with 2k rectanglesin the partition.


Assume that Cm contains all histograms with partitions into at mostk (possibly infinite) rectangles. Then, considering that a rectangle in Rd

requires choosing 2d spacings between all test and training points, two percoordinate axis,

S(Cm, l) ≤ (n+ 1)2d(k−1).

See Feinholz (1979) for more work on such partitions.

22.6 Binary Tree Classifiers

We can analyze binary tree classifiers from the same viewpoint. Recall thatsuch classifiers are represented by binary trees, where each internal nodecorresponds to a split of a cell by a hyperplane, and the terminal nodesrepresent the cells of the partition.

Assume that there are k cells (and therefore k − 1 splits leading to thepartition). If every split is perpendicular to one of the axes, then, the sit-uation is the same as in the previous section,

S(Cm, l) ≤ (1 + dn)k−1.

For smaller families of rules whose cuts depend upon the training sequenceonly, the bound is pessimistic. Others have proposed generalizing orthogo-nal cuts by using general hyperplane cuts. Recall that there are at most

d+1∑j=0

(n

j

)≤ nd+1 + 1

ways of dichotomizing n points in Rd by hyperplanes (see Theorem 13.9).Thus, if we allow up to k − 1 internal nodes (or hyperplane cuts),

S(Cm, l) ≤(1 + nd+1

)k−1.

The number of internal nodes has to be restricted in order to obtain con-sistency from this bound. Refer to Chapter 20 for more details.


Problem 22.1. Prove that n hyperplanes partition Rd into at most∑d

i=0

(ni

)contiguous regions when d ≤ n (Schlaffli (1950), Cover (1965)). Hint: Proceedby induction.

Problem 22.2. Assume that gn is selected so as to minimize the holdout errorestimate Ll,m(φm) over Cm, the class of rules based upon the first m data points.


Assume furthermore that we vary l over [log2 n, n/2], and that we pick the bestl (and m = n− l) by minimizing the holdout error estimate again. Show that ifS(Cm, l) = O(nγ) for some γ > 0, then the obtained rule is strongly universallyconsistent.

Problem 22.3. Let Cm be the class of (a1, . . . , ad)-nn rules based upon Dm.Show that if m/n→ 1, then

infφm∈Cm

L(φm)− infφn∈Cn

L(φn) → 0 in probability

for all distributions of (X,Y ) for whichX has a density. Conclude that if l/ logn→∞, m = n− l, l = o(n), then

L(gn)− infφn∈Cn

L(φn) → 0 in probability.

Problem 22.4. Finding the best split. This exercise is concerned with theautomatic selection of m and l = n−m. If gn is the selected rule minimizing theholdout estimate, then

EL(gn)|Dm−L∗ ≤

√2 log (4S(Cm, l2)) + 16

l+

(inf

φm∈Cm

L(φm)− L∗). (22.3)

Since in most interesting cases S(Cm, l) is bounded from above as a polyno-mial of m and l, the estimation error typically decreases as l increases. On theother hand, the approximation error infφm∈Cm L(φm) − L∗ typically decreasesas m increases, as the class Cm gets richer. Some kind of balance between thetwo terms is required to get optimum performance. We may use the empiricalestimates Lm,l(φm) again to decide which value of m we wish to choose (Prob-lem 22.2). However, as m gets large—and therefore l small—the class Cm willtend to overfit the data Tl, providing strongly optimistically biased estimates forinfφm∈Cm L(φm). To prevent overfitting, we may apply the method of complexityregularization (Chapter 18). By (22.1), we may define the penalty term by

r(m, l) =

√2 log (4e8ES(Cm, l2)) + logn

l,

and minimize the penalized error estimate Lm,l(φ) = Lm,l(φ) + r(m, l) over allφ ∈ ∪nm=1Cm. Denote the selected rule by φ∗n. Prove that for every n and alldistributions of (X,Y ),

EL(φ∗n)− L∗ ≤ minm,l

(3

√2 log (4ES(Cm, l2)) + logn+ 16

l

+

(E

inf

φm∈Cm

L(φm)− L∗))

+2√n.

Hint: Proceed similarly to the proof of Theorem 18.2.


23The Resubstitution Estimate

Estimating the error probability is of primordial importance for classifierselection. The method explored in the previous chapter attempts to solvethis problem by using a testing sequence to obtain a reliable holdout esti-mate. The independence of testing and training sequences leads to a ratherstraightforward analysis. For a good performance, the testing sequence hasto be sufficiently large (although we often get away with testing sequencesas small as about log n). When data are expensive, this constitutes a waste.Assume that we do not split the data and use the same sequence for testingand training. Often dangerous, this strategy nevertheless works if the classof rules from which we select is sufficiently restricted. The error estimate inthis case is appropriately called the resubstitution estimate and it will bedenoted by L(R)

n . This chapter explores its virtues and pitfalls. A third errorestimate, the deleted estimate, is discussed in the next chapter. Estimatesbased upon other paradigms are treated briefly in Chapter 31.

23.1 The Resubstitution Estimate

The resubstitution estimate L(R)n counts the number of errors committed

on the training sequence by the classification rule. Expressed formally,

L(R)n =

1n

n∑i=1

Ign(Xi) 6=Yi.

23.1 The Resubstitution Estimate 419

Sometimes L(R)n is called the apparent error rate. It is usually strongly

optimistically biased. Since the classifier gn is tuned by Dn, it is intuitivelyclear that gn may behave better on Dn than on independent data.

The best way to demonstrate this biasedness is to consider the 1-nearestneighbor rule. If X has a density, then the nearest neighbor of Xi amongX1, . . . , Xn is Xi itself with probability one. Therefore, L(R)

n ≡ 0, regardlessof the value of Ln = L(gn). In this case the resubstitution estimate isuseless. For k-nearest neighbor rules with large k, L(R)

n is close to Ln. Thiswas demonstrated by Devroye and Wagner (1979a), who obtained upperbounds on the performance of the resubstitution estimate for the k-nearestneighbor rule without posing any assumption on the distribution.

Also, if the classifier whose error probability is to be estimated is a mem-ber of a class of classifiers with finite Vapnik-Chervonenkis dimension (seethe definitions in Chapter 12), then we can get good performance boundsfor the resubstitution estimate. To see this, consider any generalized linearclassification rule, that is, any rule that can be put into the following form:

gn(x) =

0 if a0,n +∑d∗

i=1 ai,nψi(x) ≤ 01 otherwise,

where the ψi’s are fixed functions, and the coefficients ai,n depend on thedataDn in an arbitrary but measurable way. We have the following estimatefor the performance of the resubstitution estimate L(R)

n .

Theorem 23.1. (Devroye and Wagner (1976a)). For all n and ε > 0,the resubstitution estimate L(R)

n of the error probability Ln of a generalizedlinear rule satisfies

P|L(R)n − Ln| ≥ ε

≤ 8nd

∗e−nε

2/32.

Proof. Define the set An ⊂ Rd × 0, 1 as the set of all pairs (x, y) ∈Rd × 0, 1, on which gn errs:

An = (x, y) : gn(x) 6= y .

Observe thatLn = P (X,Y ) ∈ An|Dn

and

L(R)n =

1n

n∑j=1

I(Xj ,Yj)∈An,

or denoting the measure of (X,Y ) by ν, and the corresponding empiricalmeasure by νn,

Ln = ν(An),

23.1 The Resubstitution Estimate 420

andL(R)n = νn(An).

The set An depends on the data Dn, so that, for example, Eνn(An) 6=Eν(An). Fortunately, the powerful Vapnik-Chervonenkis theory comesto the rescue via the inequality

|Ln − L(R)n | ≤ sup

C∈C|νn(C)− ν(C)|,

where C is the family of all sets of the form (x, y) : φ(x) 6= y, whereφ : Rd → 0, 1 is a generalized linear classifier based on the functionsψ1, . . . , ψd∗ . By Theorems 12.6, 13.1, and 13.7 we have

P

supC∈C

|νn(C)− ν(C)| ≥ ε

≤ 8nd

∗e−nε

2/32. 2

Similar inequalities can be obtained for other classifiers. For example, forpartitioning rules with fixed partitions we have the following:

Theorem 23.2. Let gn be a classifier whose value is constant over cells ofa fixed partition of Rd into k cells. Then

P|L(R)n − Ln| ≥ ε

≤ 8 · 2ke−nε

2/32.

The proof is left as an exercise (Problem 23.1). From Theorems 23.1 and23.2 we get bounds for the expected difference between the resubstitutionestimate and the actual error probability Ln. For example, Theorem 23.1implies (via Problem 12.1) that

E|L(R)n − Ln|

= O

(√d∗ log n

n

).

In some special cases the expected behavior of the resubstitution estimatecan be analyzed in more detail. For example, McLachlan (1976) provedthat if the conditional distributions of X given Y = 0 and Y = 1 are bothnormal with the same covariance matrices, and the rule is linear and basedon the estimated parameters, then the bias of the estimate is of the order1/n: ∣∣∣EL(R)

n

−ELn

∣∣∣ = O

(1n

).

McLachlan also showed for this case that for large n the expected value ofthe resubstitution estimate is smaller than that of Ln, that is, the estimateis optimistically biased, as expected.

23.2 Histogram Rules 421

23.2 Histogram Rules

In this section, we explore the properties of the resubstitution estimate forhistogram rules. Let P = A1, A2, . . . be a fixed partition of Rd, and let gnbe the corresponding histogram classifier (see Chapters 6 and 9). Introducethe notation

ν0,n(A) =1n

n∑j=1

IYj=0IXj∈A, and ν1,n(A) =1n

n∑j=1

IYj=1IXj∈A.

The analysis is simplified if we rewrite the estimate L(R)n in the following

form (see Problem 23.2):

L(R)n =

∑i

minν0,n(Ai), ν1,n(Ai). (23.1)

It is also interesting to observe that L(R)n can be put in the following form:

L(R)n =

∫ (Iη1,n(x)≤η0,n(x)η1,n(x) + Iη1,n(x)>η0,n(x)η0,n(x)

)µ(dx),

where

η0,n(x) =1n

∑nj=1 IYj=0,Xj∈A(x)

µ(A(x))and η1,n(x) =

1n

∑nj=1 IYj=1,Xj∈A(x)

µ(A(x)).

We can compare this form with the following expression of Ln:

Ln =∫ (

Iη1,n(x)≤η0,n(x)η(x) + Iη1,n(x)>η0,n(x)(1− η(x)))µ(dx)

(see Problem 23.3). We begin the analysis of performance of the estimateby showing that its mean squared error is not larger than a constant timesthe number of cells in the partition over n.

Theorem 23.3. For any distribution of (X,Y ) and for all n, the estimateL

(R)n of the error probability of any histogram rule satisfies

VarL(R)n ≤ 1

n.

Also, the estimate is optimistically biased, that is,

EL(R)n ≤ ELn.

If, in addition, the histogram rule is based on a partition P of Rd into atmost k cells, then

E(

L(R)n − Ln

)2≤ 6k

n.


Proof. The first inequality is an immediate consequence of Theorem 9.3and (23.1). Introduce the auxiliary quantity

R∗ =∑i

minν0(Ai), ν1(Ai),

where ν0(A) = PY = 0, X ∈ A, and ν1(A) = PY = 1, X ∈ A. We usethe decomposition

E(

L(R)n − Ln

)2≤ 2

(E(

L(R)n −R∗

)2

+ E

(R∗ − Ln)2)

.

(23.2)First we bound the second term on the right-hand side of (23.2). ObservethatR∗ is just the Bayes error corresponding to the pair of random variables(T (X), Y ), where the function T transforms X according to T (x) = i ifx ∈ Ai. Since the histogram classification rule gn(x) can be written asa function of T (x), its error probability Ln cannot be smaller than R∗.Furthermore, by the first identity of Theorem 2.2, we have

0 ≤ Ln −R∗

=∑i

Isign(ν1,n(Ai)−ν0,n(Ai)) 6=sign(ν1(Ai)−ν0(Ai))|ν1(Ai)− ν0(Ai)|

≤∑i

|ν1(Ai)− ν0(Ai)− (ν1,n(Ai)− ν0,n(Ai)) |.

If the partition has at most k cells, then by the Cauchy-Schwarz inequality,

E(Ln −R∗)2

≤ E

(∑

i

|ν1(Ai)− ν0(Ai)− (ν1,n(Ai)− ν0,n(Ai)) |

)2

≤ k∑i

Var ν1,n(Ai)− ν0,n(Ai)

≤ k∑i

µ(Ai)n

=k

n.

We bound the first term on the right-hand side of (23.2):

E(

L(R)n −R∗

)2≤ VarL(R)

n +(R∗ −E

L(R)n

)2

.


As we have seen earlier, VarL(R)n ≤ 1/n, so it suffices to bound |R∗ −

EL(R)n |. By Jensen’s inequality,

EL(R)n =

∑i

E minν0,n(Ai), ν1,n(Ai)

≤∑i

min(Eν0,n(Ai),Eν1,n(Ai)

)=

∑i

min(ν0(Ai), ν1(Ai))

= R∗.

So we bound R∗ −EL(R)n from above. By the inequality

|min(a, b)−min(c, d)| ≤ |a− c|+ |b− d|,

we have

R∗ −EL(R)n ≤

∑i

E |ν0(Ai)− ν0,n(Ai)|+ |ν1(Ai)− ν1,n(Ai)|

≤∑i

(√Varν0,n(Ai)+

√Varν1,n(Ai)

)(by the Cauchy-Schwarz inequality)

=∑i

(√ν0(Ai)(1− ν0(Ai))

n+

√ν1(Ai)(1− ν1(Ai))

n

)

≤∑i

(√2(ν0(Ai)(1− ν0(Ai)) + ν1(Ai)(1− ν1(Ai)))

n

)

≤∑i

√2µ(Ai)n

,

where we used the elementary inequality√a+

√b ≤

√2(a+ b). Therefore,

(R∗ −EL(R)

n )2

≤

(∑i

√2µ(Ai)n

)2

.

23.3 Data-Based Histograms and Rule Selection 424

To complete the proof of the third inequality, observe that if there are atmost k cells, then

R∗ −EL(R)n ≤

∑i

√2µ(Ai)n

≤ k

√1k

∑i

2µ(Ai)n


=

√2kn.

Therefore, (R∗ −EL(R)n )2 ≤ 2k/n. Finally,

ELn −EL(R)n = (ELn −R∗) +

(R∗ −E

L(R)n

)≥ 0. 2

We see that if the partition contains a small number of cells, then theresubstitution estimate performs very nicely. However, if the partition hasa large number of cells, then the resubstitution estimate of Ln can be verymisleading, as the next result indicates:

Theorem 23.4. (Devroye and Wagner (1976b)). For every n thereexists a partitioning rule and a distribution such that

E(

L(R)n − Ln

)2≥ 1/4.

Proof. Let A1, . . . , A2n be 2n cells of the partition such that µ(Ai) =1/(2n), i = 1, . . . , 2n. Assume further that η(x) = 1 for every x ∈ Rd,that is, Y = 1 with probability one. Then clearly L(R)

n = 0. On the otherhand, if a cell does not contain any of the data points X1, . . . , Xn, thengn(x) = 0 in that cell. But since the number of points is only half of thenumber of cells, at least half of the cells are empty. Therefore Ln ≥ 1/2,and |L(R)

n − Ln| ≥ 1/2. This concludes the proof. 2

23.3 Data-Based Histograms and Rule Selection

Theorem 23.3 demonstrates the usefulness of L(R)n for histogram rules with

a fixed partition, provided that the number of cells in the partition is nottoo large. If we want to use L(R)

n to select a good classifier, the estimate


should work uniformly well over the class from which we select a rule. Inthis section we explore such data-based histogram rules.

Let F be a class of partitions of Rd. We will assume that each memberof F partitions Rd into at most k cells. For each partition P ∈ F , definethe corresponding histogram rule by

g(P)n (x) =

0 if

∑ni=1 IYi=1,Xi∈A(x) ≤

∑ni=1 IYi=0,Xi∈A(x)

1 otherwise,

where A(x) is the cell of P that contains x. Denote the error probability ofg(P)n (x) by

Ln(P) = Pg(P)n (X) 6= Y |Dn

.

The corresponding error estimate is denoted by

L(R)n (P) =

∑A∈P

minν0,n(A), ν1,n(A).

By analogy with Theorems 23.1 and 23.2, we can derive the following result,which gives a useful bound for the largest difference between the estimateand the error probability within the class of histogram classifiers definedby F . The combinatorial coefficient ∆∗

n(F) defined in Chapter 21 appearsas a coefficient in the upper bound. The computation of ∆∗

n(F) for severaldifferent classes of partitions is illustrated in Chapter 21.

Theorem 23.5. (Feinholz (1979)). Assume that each member of Fpartitions Rd into at most k cells. For every n and ε > 0,

P

supP∈F

|L(R)n (P)− Ln(P)| > ε

≤ 8 · 2k∆∗

n(F)e−nε2/32.

Proof. We can proceed as in the proof of Theorem 23.1. The shattercoefficient S(C, n) corresponding to the class of histogram classifiers definedby partitions in F can clearly be bounded from above by the number ofdifferent ways in which n points can be partitioned by members of F ,times 2k, as there are at most 2k different ways to assign labels to cells ofa partition of at most k cells. 2

The theorem has two interesting implications. The error estimate L(R)n

can also be used to estimate the performance of histogram rules based ondata-dependent partitions (see Chapter 21). The argument of the proof ofTheorem 23.3 is not valid for these rules. However, Theorem 23.5 providesperformance guarantees for these rules in the following corollaries:

Corollary 23.1. Let gn(x) be a histogram classifier based on a randompartition Pn into at most k cells, which is determined by the data Dn. As-sume that for any possible realization ot the training data Dn, the partition


Pn is a member of a class of partitions F . If Ln is the error probability ofgn, then

P|L(R)n − Ln| > ε

≤ 8 · 2k∆∗

n(F)e−nε2/32,

and

E(

L(R)n − Ln

)2≤ 32k + 32 log(∆∗

n(F)) + 128n

.

Proof. The first inequality follows from Theorem 23.5 by the obviousinequality

P|L(R)n − Ln| > ε

≤ P

supP∈F

|L(R)n (P)− Ln(P)| > ε

.

The second inequality follows from the first one via Problem 12.1. 2

Perhaps the most important application of Theorem 23.5 is in classifierselection. Let Cn be a class of (possibly data-dependent) histogram rules.We may use the error estimate L(R)

n to select a classifier that minimizesthe estimated error probability. Denote the selected histogram rule by φ∗n,that is,

L(R)n (φ∗n) ≤ L(R)

n (φn) for all φn ∈ Cn.

Here L(R)n (φn) denotes the estimated error probability of the classification

rule φn. The question is how well the selection method works, in otherwords, how close the error probability of the selected classifier L(φ∗n) is tothe error probability of the best rule in the class, infφn∈Cn L(φn). It turnsout that if the possible partitions are not too complex, then the methodworks very well:

Corollary 23.2. Assume that for any realization of the data Dn, thepossible partitions that define the histogram classifiers in Cn belong to aclass of partitions F , whose members partition Rd into at most k cells.Then

PL(φ∗n)− inf

φn∈Cn

L(φn) > ε

≤ 8 · 2k∆∗

n(F)e−nε2/128.

Proof. By Lemma 8.2,

L(φ∗n)− infφn∈Cn

L(φn) ≤ 2 supφn∈Cn

|L(R)n (φn)−L(φn)| ≤ 2 sup

P∈F|L(R)n (P)−L(P)|.

Therefore, the statement follows from Theorem 23.5. 2



Problem 23.1. Prove Theorem 23.2. Hint: Proceed as in the proof of Theorem23.1.

Problem 23.2. Show that for histogram rules, the resubstitution estimate maybe written as

L(R)n =

∑i

minν0,n(Ai), ν1,n(Ai).

Problem 23.3. Consider the resubstitution estimate L(R)n of the error probabil-

ity Ln of a histogram rule based on a fixed sequence of partitions Pn. Show thatif the regression function estimate

ηn(x) =

1n

∑n

j=1IYj=1,Xj∈A(x)

µ(A(x))

is consistent, that is, it satisfies E∫

|η(x)− ηn(x)|µ(dx)→ 0 as n→∞, then

limn→∞

E∣∣L(R)

n − Ln∣∣ = 0,

and also EL

(R)n

→ L∗.

Problem 23.4. Let (X ′1, Y

′1 ), . . . , (X ′

m, Y′m) be a sequence that depends in an

arbitrary fashion on the data Dn, and let gn be the nearest neighbor rule with(X ′

1, Y′1 ), . . . , (X ′

m, Y′m), where m is fixed. Let L

(R)n denote the resubstitution es-

timate of Ln = L(gn). Show that for all ε > 0 and all distributions,

P|Ln − L(R)

n | ≥ ε≤ 8ndm(m−1)e−nε

2/32.

Problem 23.5. Find a rule for (X,Y ) ∈ R× 0, 1 such that for all nonatomic

distributions with L∗ = 0 we have ELn → 0, yet EL

(R)n

≥ ELn. (Thus,

L(R)n may be pessimistically biased even for a consistent rule.)

Problem 23.6. For histogram rules on fixed partitions (partitions that do not

change with n and are independent of the data), show that EL

(R)n

is mono-

tonically nonincreasing.

Problem 23.7. Assume that X has a density, and investigate the resubstitution

estimate of the 3-nn rule. What is the limit of EL

(R)n

as n→∞?


24Deleted Estimates ofthe Error Probability

The deleted estimate (also called cross-validation, leave-one-out, or U-meth-od) attempts to avoid the bias present in the resubstitution estimate. Pro-posed and developed by Lunts and Brailovsky (1967), Lachenbruch (1967),Cover (1969), and Stone (1974), the method deletes the first pair (X1, Y1)from the training data and makes a decision gn−1 using the remaining n−1pairs. It tests for an an error on (X1, Y1), and repeats this procedure forall n pairs of the training data Dn. The estimate L(D)

n is the average thenumber of errors.

We formally denote the training set with (Xi, Yi) deleted by

Dn,i = ((X1, Y1), . . . , (Xi−1, Yi−1), (Xi+1, Yi+1), . . . , (Xn, Yn)).

Then we define

L(D)n =

1n

n∑i=1

Ign−1(Xi,Dn,i) 6=Yi.

Clearly, the deleted estimate is almost unbiased in the sense that

EL(D)n = ELn−1.

Thus, L(D)n should be viewed as an estimator of Ln−1, rather than of Ln.

In most of the interesting cases Ln converges with probability one so thatthe difference between Ln−1 and Ln becomes negligible for large n.

The designer has the luxury of being able to pick the most convenientgn−1. In some cases the choice is very natural, in other cases it is not.For example, if gn is the 1-nearest neighbor rule, then letting gn−1 be the

24.1 A General Lower Bound 429

1-nearest neighbor rule based on n − 1 data pairs seems to be an obviouschoice. We will see later that this indeed yields an extremely good estima-tor. But what should gn−1 be if gn is, for example, a k-nn rule? Shouldit be the k-nearest neighbor classifier, the k − 1-nearest neighbor classi-fier, or maybe something else? Well, the choice is typically nontrivial andneeds careful attention if the designer wants a distribution-free performanceguarantee for the resulting estimate. Because of the variety of choices forgn−1, we should not speak of the deleted estimate, but rather of a deletedestimate.

In this chapter we analyze the performance of deleted estimates for a fewprototype classifiers, such as the kernel, nearest neighbor, and histogramrules. In most cases studied here, deleted estimates have good distribution-free properties.

24.1 A General Lower Bound

We begin by exploring general limitations of error estimates for some im-portant nonparametric rules. An error estimate Ln is merely a function(Rd × 0, 1

)n → [0, 1], which is applied to the data Dn = ((X1, Y1), . . .,(Xn, Yn)).

Theorem 24.1. Let gn be one of the following classifiers:

(a) The kernel rule

gn(x) =

0 if∑ni=1 IYi=0K

(x−Xi

h

)≥∑ni=1 IYi=1K

(x−Xi

h

)1 otherwise

with a nonnegative kernel K of compact support and smoothingfactor h > 0.

(b) The histogram rule

gn(x) =

0 if∑ni=1 IYi=1IXi∈A(x) ≤

∑ni=1 IYi=0IXi∈A(x)

1 otherwise

based on a fixed partition P = A1, A2, . . . containing at least ncells.

(c) The lazy histogram rule

gn(x) = Yj , x ∈ Ai,

where Xj is the minimum-index point among X1, . . . , Xn for whichXj ∈ Ai, where P = A1, A2, . . . is a fixed partition containing atleast n cells.


Denote the probability of error for gn by Ln = Pgn(X,Dn) 6= Y |Dn.Then for every n ≥ 10, and for every error estimator Ln, there exists adistribution of (X,Y ) with L∗ = 0 such that

E∣∣∣Ln(Dn)− Ln

∣∣∣ ≥ 1√32n

.

The theorem says that for any estimate Ln, there exists a distributionwith the property that E

∣∣∣Ln(Dn)− Ln

∣∣∣ ≥ 1/√

32n. For the rules gngiven in the theorem, no error estimate can possibly give better distribution-free performance guarantees. Error estimates are necessarily going to per-form at least this poorly for some distributions.

Proof of Theorem 24.1. Let Ln be an arbitrary fixed error estimate.The proof relies on randomization ideas similar to those of the proofs ofTheorems 7.1 and 7.2. We construct a family of distributions for (X,Y ).For b ∈ [0, 1) let b = 0.b1b2b3 . . . be its binary expansion. In all cases thedistribution of X is uniform on n points x1, . . . , xn. For the histogramand lazy histogram rules choose x1, . . . , xn such that they fall into differentcells. For the kernel rule choose x1, . . . , xn so that they are isolated fromeach other, that is, K(xi−xj

h ) = 0 for all i, j ∈ 1, . . . , n, i 6= j. To simplifythe notation we will refer to these points by their indices, that is, we willwrite X = i instead of X = xi. For a fixed b define

Y = bX .

We may create infinitely many samples (one for each b ∈ [0, 1)) drawn fromthe distribution of (X,Y ) as follows. Let X1, . . . , Xn be i.i.d. and uniformlydistributed on 1, . . . , n. All the samples share the same X1, . . . , Xn, butdiffer in their Yi’s. For given Xi, define Yi = bXi

. Write Zn = (X1, . . . , Xn)and Ni =

∑nj=1 IXj=i. Observe that Dn is a function of Zn and b. It is

clear that for all classifiers covered by our assumptions, for a fixed b,

Ln =1n

∑i∈S

Ibi=1,

where S = i : 1 ≤ i ≤ n,Ni = 0 is the set of empty bins. We randomizethe distribution as follows. Let B = 0.B1B2 . . . be a uniform [0, 1] randomvariable independent of (X,Y ). Then clearly B1, B2, . . . are independentuniform binary random variables. Note that

supb


∣∣∣ ≥ E∣∣∣Ln(Dn)− Ln

∣∣∣(where b is replaced by B)

= EE∣∣∣Ln(Dn)− Ln

∣∣∣∣∣∣Zn .


In what follows we bound the conditional expectation within the brackets.To make the dependence upon B explicit we write Ln(Zn, B) = Ln(Dn)and Ln(B) = Ln. Thus,

E∣∣∣Ln(Zn, B)− Ln(B)

∣∣∣∣∣∣Zn=

12E∣∣∣Ln(Zn, B)− Ln(B)

∣∣∣+ ∣∣∣Ln(Zn, B∗)− Ln(B∗)∣∣∣∣∣∣Zn

(where B∗i = Bi for i with Ni > 0 and B∗i = 1−Bi for i with Ni = 0)

≥ 12E |Ln(B)− Ln(B∗)||Zn

(since Ln(Zn, B∗) = Ln(Zn, B))

=12E

1n

∣∣∣∣∣∑i∈S

(IBi=1 − IBi=0

)∣∣∣∣∣∣∣∣∣∣Zn

(recall the expression for Ln given above)

=12n

E |2B(|S|, 1/2)− 1|| |S|

(where B(|S|, 1/2) is a binomial (|S|, 1/2) random variable)

≥ 12n

√|S|2

(by Khintchine’s inequality, see Lemma A.5).

In summary, we have

supb


∣∣∣ ≥ 12n

E

√|S|2

.

We have only to bound the right-hand side. We apply Lemma A.4 to therandom variable

√|S| =

√∑ni=1 INi=0. Clearly, E|S| = n(1−1/n)n, and

E|S|2

= E

n∑i=1

INi=0 +∑i 6=j

INi=0,Nj=0

= E|S|+ n(n− 1)

(1− 2

n

)n,

so that

E√

|S|

≥(n(1− 1

n

)n)3/2√n(1− 1

n

)n + n(n− 1)(1− 2

n

)n≥

√n

(1− 1

n

)n√1n + (1−2/n)n

(1−1/n)n

24.2 A General Upper Bound for Deleted Estimates 432

≥√n

(1− 1

n

)n√1− 1

n −1n2 − . . .

≥√n

(1− 1

n

)n√e (use

(1− 1

n

)n ≤ 1e )

≥ 12√n (for n ≥ 10).

The proof is now complete. 2

24.2 A General Upper Bound for DeletedEstimates

The following inequality is a general tool for obtaining distribution-freeupper bounds for the difference between the deleted estimate and the trueerror probability Ln:

Theorem 24.2. (Rogers and Wagner (1978); Devroye and Wag-ner (1976b)). Assume that gn is a symmetric classifier, that is, gn(x,Dn) =gn(x,D′

n), where D′n is obtained by permuting the pairs of Dn arbitrarily.

Then

E(

L(D)n − Ln

)2≤ 1n

+ 6P gn(X,Dn) 6= gn−1(X,Dn−1) .

Proof. First we express the three terms on the right-hand side of

E(

L(D)n − Ln

)2

= EL(D)n

2− 2E

L(D)n Ln

+ E

L2n

.

The first term can be bounded, by using symmetry of gn, by

EL(D)n

2

= E

(

1n

n∑i=1

Ign−1(Xi,Dn,i) 6=Yi

)2

= E

1n2

n∑i=1


+E

1n2

∑i 6=j

Ign−1(Xi,Dn,i) 6=YiIgn−1(Xj ,Dn,j) 6=Yj

24.2 A General Upper Bound for Deleted Estimates 433

=ELn−1

n+n− 1n

Pgn−1(X1, Dn,1) 6= Y1, gn−1(X2, Dn,2) 6= Y2

≤ 1n

+ Pgn−1(X1, Dn,1) 6= Y1, gn−1(X2, Dn,2) 6= Y2.

The second term is written as

EL(D)n Ln

= E

Ln

1n

n∑i=1


=1n

n∑i=1

ELnIgn−1(Xi,Dn,i) 6=Yi

=

1n

n∑i=1

E Pgn(X,Dn) 6= Y |DnPgn−1(Xi, Dn,i) 6= Yi|Dn

=1n

n∑i=1

E Pgn(X,Dn) 6= Y, gn−1(Xi, Dn,i) 6= Yi|Dn

= Pgn(X,Dn) 6= Y, gn−1(X1, Dn,1) 6= Y1.

For the third term, we introduce the pair (X ′, Y ′), independent of X,Y ,and Dn, having the same distribution as (X,Y ). Then

EL2n

= E

Pgn(X,Dn) 6= Y |Dn2

= E Pgn(X,Dn) 6= Y |DnPgn(X ′, Dn) 6= Y ′|Dn= E Pgn(X,Dn) 6= Y, gn(X ′, Dn) 6= Y ′|Dn= Pgn(X,Dn) 6= Y, gn(X ′, Dn) 6= Y ′,

where we used independence of (X ′, Y ′).We introduce the notation

D′n = (X3, Y3), . . . , (Xn, Yn),

Ak,i,j = gn(Xk; (Xi, Yi), (Xj , Yj), D′n) 6= Yk ,

Bk,i = gn−1(Xk; (Xi, Yi), D′n) 6= Yk ,

and we formally replace (X,Y ) and (X ′, Y ′) by (Xα, Yα) and (Xβ , Yβ) sothat we may work with the indices α and β. With this notation, we haveshown thus far the following:

E(

L(D)n − Ln

)2≤ 1n

+ PAα,1,2, Aβ,1,2 −PAα,1,2, B1,2

+ PB1,2, B2,1 −PAα,1,2, B2,1.

24.3 Nearest Neighbor Rules 434

Note that

PAα,1,2, Aβ,1,2 −PAα,1,2, B1,2= PAα,1,2, Aβ,1,2 −PAα,β,2, Bβ,2

(by symmetry)

= PAα,1,2, Aβ,1,2 −PAα,β,2, Aβ,1,2+PAα,β,2, Aβ,1,2 −PAα,β,2, Bβ,2

= I + II.

Also,

PB1,2, B2,1 −PAα,1,2, B1,2= PBα,β , Bβ,α −PAα,β,2, Bβ,2

(by symmetry)

= PBα,β , Bβ,α −PAα,β,2, Bβ,α+PAα,β,2, Bβ,α −PAα,β,2, Bβ,2

= III + IV.

Using the fact that for events Ci, |PCi, Cj−PCi, Ck| ≤ PCj4Ck,we bound

I ≤ PAα,1,24Aα,β,2,

II ≤ PAβ,1,24Bβ,2def= V,

III ≤ PBα,β4Aα,β,2 = V,

IV ≤ PBβ,α4Bβ,2.

The upper bounds for II and III are identical by symmetry. Also,

I ≤ PAα,1,24Bα,2+ PBα,24Aα,β,2 = 2V

andIV ≤ PBβ,α4Aβ,α,2+ PAβ,α,24Bβ,2 = 2V,

for a grand total of 6V . This concludes the proof. 2

24.3 Nearest Neighbor Rules

Theorem 24.2 can be used to obtain distribution-free upper bounds forspecific rules. Here is the most important example.


Theorem 24.3. (Rogers and Wagner (1978)). Let gn be the k-nearestneighbor rule with randomized tie-breaking. If L(D)

n is the deleted estimatewith gn−1 chosen as the k-nn rule (with the same k and with the samerandomizing random variables), then

E(

L(D)n − Ln

)2≤ 6k + 1

n.

Proof. Because of the randomized tie-breaking, the k-nn rule is symmet-ric, and Theorem 24.2 is applicable. We only have to show that

P gn(X,Dn) 6= gn−1(X,Dn−1) ≤k

n.

Clearly, gn(X,Dn) 6= gn−1(X,Dn−1) can happen only if Xn is among thek nearest neighbors of X. But the probability of this event is just k/n,since by symmetry, all points are equally likely to be among the k nearestneighbors. 2

Remark. If gn is the k-nn rule such that distance ties are broken bycomparing indices, then gn is not symmetric, and Theorem 24.2 is no longerapplicable (unless, e.g., X has a density). Another nonsymmetric classifieris the lazy histogram rule. 2

Remark. Applying Theorem 24.3 to the 1-nn rule, the Cauchy-Schwarzinequality implies E

∣∣∣L(D)n − Ln

∣∣∣ ≤√7/n for all distributions. 2

Remark. Clearly, the inequality of Theorem 24.3 holds for any rule that issome function of the k nearest points. For the k-nearest neighbor rule, witha more careful analysis, Devroye and Wagner (1979a) improved Theorem24.3 to

E(

L(D)n − Ln

)2≤ 1n

+24√k

n√

2π

(see Problem 24.8). 2

Probability inequalities for |L(D)n −Ln| can also be obtained with further

work. By Chebyshev’s inequality we immediately get

P|L(D)n − Ln| > ε ≤

E(

L(D)n − Ln

)2

ε2,


so that the above bounds on the expected squared error can be used.Sharper distribution-free inequalities were obtained by Devroye and Wag-ner (1979a; 1979b) for several nonparametric rules. Here we present a resultthat follows immediately from what we have already seen:

Theorem 24.4. Consider the k-nearest neighbor rule with randomized tie-breaking. If L(D)

n is the deleted estimate with gn−1 chosen as the k-nn rulewith the same tie-breaking, then

P|L(D)n −ELn−1| > ε ≤ 2e−nε

2/(k2γ2d).

Proof. The result follows immediately from McDiarmid’s inequality by thefollowing argument: from Lemma 11.1, given n points in Rd, a particularpoint can be among the k nearest neighbors of at most kγd points. To seethis, just set µ equal to the empirical measure of the n points in Lemma11.1. Therefore, changing the value of one pair from the training data canchange the value of the estimate by at most 2kγd. Now, since EL(D)

n =ELn−1, Theorem 9.1 yields the result. 2

Exponential upper bounds for the probability P|L(D)n − Ln| > ε are

typically much harder to obtain. We mention one result without proof.

Theorem 24.5. (Devroye and Wagner (1979a)). For the k-nearestneighbor rule,

P|L(D)n − Ln| > ε ≤ 2e−nε

2/18 + 6e−nε3/(108k(γd+2)).

One of the drawbacks of the deleted estimate is that it requires muchmore computation than the resubstitution estimate. If conditional on Y = 0and Y = 1,X is gaussian, and the classification rule is the appropriate para-metric rule, then the estimate can be computed quickly. See Lachenbruchand Mickey (1968), Fukunaga and Kessel (1971), and McLachlan (1992)for further references.

Another, and probably more serious, disadvantage of the deleted estimateis its large variance. This fact can be illustrated by the following examplefrom Devroye and Wagner (1979b): let n be even, and let the distributionof (X,Y ) be such that Y is independent of X with PY = 0 = PY =1 = 1/2. Consider the k-nearest neighbor rule with k = n − 1. Thenobviously, Ln = 1/2. Clearly, if the number of zeros and ones among thelabels Y1, . . . , Yn are equal, then L(D)

n = 1. Thus, for 0 < ε < 1/2,

P|L(D)n − Ln| > ε ≥ P

n∑i=1

IYi=0 =n

2

=

12n

(n

n/2

).

By Stirling’s formula (Lemma A.3), we have

P|L(D)n − Ln| > ε ≥ 1√

2πn1

e1/12.

24.4 Kernel Rules 437

Therefore, for this simple rule and certain distributions, the probabilityabove can not decrease to zero faster than 1/

√n. Note that in the example

above, EL(D)n = ELn−1 = 1/2, so the lower bound holds for P|L(D)

n −EL(D)

n | > ε as well. Also, in this example, we have

E(

L(D)n −EL(D)

n

)2≥ 1

4P

n∑i=1

IYi=0 =n

2

≥ 1

4√

2πn1

e1/12.

In Chapter 31 we describe other estimates with much smaller variances.

24.4 Kernel Rules

Theorem 24.2 may also be used to obtain tight distribution-free upperbounds for the performance of the deleted estimate of the error probabilityof kernel rules. We have the following bound:

Theorem 24.6. Assume that K ≥ 0 is a regular kernel of bounded sup-port, that is, it is a function satisfying

(i) K(x) ≥ β, ‖x‖ ≤ ρ,

(ii) K(x) ≤ B,

(iii) K(x) = 0, ‖x‖ > R,

for some positive finite constants β, ρ,B and R. Let the kernel rule bedefined by

gn(x) =

0 if∑ni=1 IYi=0K (x−Xi) ≥

∑ni=1 IYi=1K (x−Xi)

1 otherwise,

and define gn−1 similarly. Then there exist constants C1(d) depending upond only and C2(K) depending upon K only such that for all n,

E(

L(D)n − Ln

)2≤ 1n

+C1(d)C2(K)√

n.

One may take C2(K) = 6(1 +R/ρ)d/2 min(2, B/β).

Remark. Since C2(K) is a scale-invariant factor, the theorem applies tothe rule with K(u) replaced by Kh(u) = 1

hK(uh

)for any smoothing factor.

As it is, C2(K) is minimal and equal to 12 if we let K be the uniform kernelon the unit ball (R = ρ,B = β). The assumptions of the theorem requirethat gn−1 is defined with the same kernel and smoothing factor as gn. 2

Remark. The theorem applies to virtually any kernel of compact supportthat is of interest to the practitioners. Note, however, that the gaussian

24.4 Kernel Rules 438

kernel is not covered by the result. The theorem generalizes an earlier resultof Devroye and Wagner (1979b), in which a more restricted class of kernelswas considered. They showed that if K is the uniform kernel then

E(

L(D)n − Ln

)2≤ 1

2n+

24√n.

See Problem 24.4. 2

We need the following auxiliary inequality, which we quote without proof:

Lemma 24.1. (Petrov (1975), p.44). Let Z1, . . . , Zn be real-valuedi.i.d. random variables. For ε > 0,

supx

P

n∑i=1

Zi ∈ [x, x+ ε]

≤ Cε√

n

1√sup0<λ≤ε λ

2P|Z1| ≥ λ/2,

where C is a universal constant.

Corollary 24.1. Let Z1, . . . , Zn be real-valued i.i.d. random variables.For ε ≥ λ > 0,

P

∣∣∣∣∣n∑i=1

Zi

∣∣∣∣∣ < ε

2

≤ C√

n

ε

λ

1√P|Z1| ≥ λ/2

,

where C is a universal constant.

Proof of Theorem 24.6. We apply Theorem 24.2, by finding an upperbound for

P gn(X,Dn) 6= gn−1(X,Dn−1) .

For the kernel rule with kernel K ≥ 0, in which h is absorbed,

P gn(X,Dn) 6= gn−1(X,Dn−1)

≤ P

∣∣∣∣∣n−1∑i=1

(2Yi − 1)K(X −Xi)

∣∣∣∣∣ ≤ K(X −Xn),K(X −Xn) > 0

.

Define B′ = max(2β,B). We have

P

∣∣∣∣∣n−1∑i=1

(2Yi − 1)K(X −Xi)

∣∣∣∣∣ ≤ K(X −Xn),K(X −Xn) > 0

≤ P

∣∣∣∣∣n−1∑i=1

(2Yi − 1)K(X −Xi)

∣∣∣∣∣ ≤ B′, Xn ∈ SX,R

(where SX,R is the ball of radius R centered at X)


= E

IXn∈SX,R

2CB′

2β√n√

P|(2Y1 − 1)K(X −X1)| ≥ β|X

(by Corollary 24.1, since 2β ≤ B′)

≤ E

IXn∈SX,R

CB′

β√nµ(SX,ρ)

(recall that K(u) ≥ β for ‖u‖ ≤ ρ,

so that P|(2Y1 − 1)K(x−X1)| ≥ β ≥ PX1 ≥ Sx,ρ)

=CB′

β√n

∫ ∫Sx,R

µ(dy)√µ(Sx,ρ)

µ(dx)

≤ CcdB′

β√n

(1 +

R

ρ

)d/2,

where we used Lemma 10.2. The constant cd depends upon the dimensiononly. 2

24.5 Histogram Rules

In this section we discuss properties of the deleted estimate of the errorprobability of histogram rules. Let P = A1, A2, . . . be a partition of Rd,and let gn be the corresponding histogram classifier (see Chapters 6 and9). To get a performance bound for the deleted estimate, we can simplyapply Theorem 24.2.

Theorem 24.7. For the histogram rule gn corresponding to any partitionP, and for all n,

E(

L(D)n − Ln

)2≤ 1n

+ 6∑i

µ(Ai)3/2√π(n− 1)

+ 6∑i

µ2(Ai)e−nµ(Ai),

and in particular,

E(

L(D)n − Ln

)2≤ 1 + 6/e

n+

6√π(n− 1)

.

Proof. The first inequality follows from Theorem 24.2 if we can find anupper bound for Pgn(X) 6= gn−1(X). We introduce the notation

ν0,n(A) =1n

n∑j=1

IYj=0IXj∈A.


Clearly, gn−1(X) can differ from gn(X) only if both Xn and X fall in thesame cell of the partition, and if the number of zeros in the cell is eitherequal, or less by one than the number of ones. Therefore, by independence,we have

Pgn(X) 6= gn−1(X)

=∑i

P gn(X) 6= gn−1(X)|X ∈ Ai, Xn ∈ Aiµ(Ai)2

≤∑i

Pν0,n−1(Ai) =

⌊µn−1(Ai)

2

⌋∣∣∣∣X ∈ Ai, Xn ∈ Aiµ(Ai)2.

The terms in the sum above may be bounded as follows:

Pν0,n−1(A) =

⌊µn−1(A)

2

⌋∣∣∣∣X ∈ A,Xn ∈ A

= Pν0,n−1(A) =

⌊µn−1(A)

2

⌋(by independence)

≤ Pµn−1(A) = 0

+EPν0,n−1(A) =

⌊µn−1(A)

2

⌋∣∣∣∣X1, . . . , Xn

Iµn−1(A)>0

≤ (1− µ(A))n + E

1√

2π(n− 1)µn−1(A)Iµn−1(A)>0

(by Lemma A.3)

≤ (1− µ(A))n +

√E

Iµn−1(A)>0

2π(n− 1)µn−1(A)


≤ e−nµ(A) +

√1

π(n− 1)µ(A),

where in the last step we use Lemma A.2. This concludes the proof of thefirst inequality. The second one follows trivially by noting that xe−x ≤ 1/efor all x. 2

Remark. It is easy to see that the inequalities of Theorem 24.7 are tight,up to a constant factor, in the sense that for any partition P, there existsa distribution such that

E(

L(D)n − Ln

)2≥ 1

4√

2πn1

e1/12

(see Problem 24.5). 2


Remark. The second inequality in Theorem 23.3 points out an importantdifference between the behavior of the resubstitution and the deleted esti-mates for histogram rules. As mentioned above, for some distributions thevariance of L(D)

n can be of the order 1/√n. This should be contrasted with

the much smaller variance of the resubstitution estimate. The small vari-ance of L(R)

n comes often with a larger bias. Other types of error estimateswith small variance are discussed in Chapter 31. 2

Remark. Theorem 24.3 shows that for any partition,

sup(X,Y )

E(

L(D)n − Ln

)2

= O

(1√n

).

On the other hand, if k = o(√n), where k is the number of cells in the

partition, then for the resubstitution estimate we have a better guaranteeddistribution-free performance:

sup(X,Y )

E(

L(R)n − Ln

)2

= o

(1√n

).

At first sight, the resubstitution estimate seems preferable to the deletedestimate. However, if the partition has a large number of cells, L(R)

n maybe off the mark; see Theorem 23.4. 2


Problem 24.1. Show the following variant of Theorem 24.2: For all symmetricclassifiers

E(L(D)n − Ln

)2≤ 1

n+ 2Pgn(X,Dn) 6= gn−1(X,Dn−1)

+ Pgn(X,Dn) 6= gn(X,D∗n)

+ Pgn−1(X,Dn−1) 6= gn−1(X,D∗n−1),

where D∗n and D∗

n−1 are just Dn and Dn−1 with (X1, Y1) replaced by an inde-pendent copy (X0, Y0).

Problem 24.2. Let gn be the relabeling nn rule with the k-nn classifier as an-cestral rule, as defined in Chapter 11. Provide an upper bound for the squared

error E

(L

(D)n − Ln

)2

of the deleted estimate.

Problem 24.3. Let gn be the rule obtained by choosing the best k ≤ k0 in thek-nn rule (k0 is a constant) by minimizing the standard deleted estimate L

(D)n

with respect to k. How would you estimate the probability of error for this rule?Give the best possible distribution-free performance guarantees you can find.


Problem 24.4. Consider the kernel rule with the window kernel K = S0,1. Showthat

E(L(D)

n − Ln)2≤ 1 + 6/e

n+

6√π(n− 1)

.

Hint: Follow the line of the proof of Theorem 24.7.

Problem 24.5. Show that for any partition P, there exists a distribution suchthat for the deleted estimate of the error probability of the corresponding his-togram rule,

E(L(D)n − Ln

)2≥ 1

4√

2πn

1

e1/12.

Hint: Proceed as in the proof of the similar inequality for k-nearest neighborrules.

Problem 24.6. Consider the k-spacings method (see Chapter 21). We estimate

the probability of error (Ln) by a modified deleted estimate Ln as follows:

Ln =1

n

n∑i=1

IYi 6=gn,i(Xi,Dn),

where gn,i is a histogram rule based upon the same k-spacings partition used forgn—that is, the partition determined by k-spacings of the data pointsX1, . . . , Xn—but in which a majority vote is based upon the Yj ’s in the same cell of the partitionwith Yi deleted. Show that

E(

Ln − Ln

)≤ 6k + 1

n.

Hint: Condition on the Xi’s, and verify that the inequality of Theorem 24.2remains valid.

Problem 24.7. Consider a rule in which we rank the real-valued observationsX1, . . . , Xn from small to large, to obtain X(1), . . . , X(n). Assume that X1 has a

density. Derive an inequality for the error |L(D)n − Ln| for some deleted estimate

L(D)n (of your choice), when the rule is defined by a majority vote over the data-

dependent partition

(−∞, X(1)], (X(1), X(2)], (X(3), X(5)], (X(6), X(9)], . . .

defined by 1, 2, 3, 4, . . . points, respectively.

Problem 24.8. Prove that for the k-nearest neighbor rule,

E(L(D)n − Ln

)2≤ 1

n+

24√k

n√

2π

(Devroye and Wagner (1979a)). Hint: Obtain a refined upper bound for

P gn(X,Dn) 6= gn−1(X,Dn−1) ,

using techniques not unlike those of the proof of Theorem 24.7.

Problem 24.9. Open-ended problem. Investigate if Theorem 24.6 can be ex-tended to kernels with unbounded support such as the gaussian kernel.


25Automatic Kernel Rules

We saw in Chapter 10 that for a large class of kernels, if the smoothingparameter h converges to zero such that nhd goes to infinity as n → ∞,then the kernel classification rule is universally consistent. For a particularn, asymptotic results provide little guidance in the selection of h. On theother hand, selecting the wrong value of h may lead to catastrophic errorrates—in fact, the crux of every nonparametric estimation problem is thechoice of an appropriate smoothing factor. It tells us how far we generalizeeach data point Xi in the space. Purely atomic distributions require littlesmoothing (h = 0 will generally be fine), while distributions with densitiesrequire a lot of smoothing. As there are no simple tests for verifying whetherthe data are drawn from an absolutely continuous distribution—let alonea distribution with a Lipschitz density—it is important to let the data Dn

determine h. A data-dependent smoothing factor is merely a mathematicalfunction Hn :

(Rd × 0, 1

)n → [0,∞). For brevity, we will simply writeHn to denote the random variable Hn(Dn). This chapter develops resultsregarding such functions Hn.

This chapter is not a luxury but a necessity. Anybody developing softwarefor pattern recognition must necessarily let the data do the talking—infact, good universally applicable programs can have only data-dependentparameters.

Consider the family of kernel decision rules gn and let the smoothing fac-tor h play the role of parameter. The best parameter (HOPT) is the one thatminimizes Ln. Unfortunately, it is unknown, as is LOPT, the correspondingminimal probability of error. The first goal of any data-dependent smooth-ing factor Hn should be to approach the performance of HOPT. We are


careful here to avoid saying that Hn should be close to HOPT, as closenessof smoothing factors does not necessarily imply closeness of error probabil-ities and vice versa. Guarantees one might want in this respect are

ELn − LOPT ≤ αn

for some suitable sequence αn → 0, or better still,

ELn − L∗ ≤ (1 + βn)ELOPT − L∗,

for another sequence βn → 0. But before one even attempts to developsuch data-dependent smoothing factors, one’s first concern should be withconsistency: is it true that with the given Hn, Ln → L∗ in probability orwith probability one? This question is dealt with in the next section. Insubsequent sections, we give various examples of data-dependent smoothingfactors.

25.1 Consistency

We start with consistency results that generalize Theorem 10.1. The firstresult assumes that the value of the smoothing parameter is picked from adiscrete set.

Theorem 25.1. Assume that the random variable Hn takes its values fromthe set of real numbers of the form 1

(1+δn)k , where k is a nonnegative integerand δn > 0. Let K be a regular kernel function. (Recall Definition 10.1.)Define the kernel classification rule corresponding to the random smoothingparameter Hn by

gn(x) =

0 if

∑ni=1 IYi=0K

(x−Xi

Hn

)≥∑ni=1 IYi=1K

(x−Xi

Hn

)1 otherwise.

If1δn

= eo(n)

and

Hn → 0 and nHdn →∞ with probability one as n→∞,

then L(gn) → L∗ with probability one, that is, gn is strongly universallyconsistent.

Proof. The theorem is a straightforward extension of Theorem 10.1. Clearly,L(gn) → L∗ with probability one if and only if for every ε > 0, IL(gn)−L∗>ε →0 with probability one. Now, for any β > 0,

IL(gn)−L∗>ε ≤ I1/Hn>β,nHdn>β,L(gn)−L∗>ε + I1/Hn≤β + InHd

n≤β.


We have to show that the random variables on the right-hand side convergeto zero with probability one. The convergence of the second and third termsfollows from the conditions on Hn. The convergence of the first term followsfrom Theorem 10.1, since it states that for any ε > 0, there exist β > 0 andn0 such that for the error probability Ln,k of the kernel rule with smoothingparameter h = 1

(1+δn)k ,

PLn,k − L∗ > ε ≤ 4e−cnε2

for some constant c depending on the dimension only, provided that n > n0,h < 1/β and nhd > β. Now clearly,

PLn − L∗ > ε, 1/Hn > β, nHd

n > β

≤ P

sup

k:(1+δn)k>β,n/(1+δn)kd>β

Ln,k − L∗ > ε

≤ Cn sup

k:(1+δ)k>β,n/(1+δ)kd>β

PLn,k − L∗ > ε,

by the union bound, where Cn is the number of possible values of Hn inthe given range. As

β < (1 + δn)k <(n

β

)1/d

,

we note that

Cn ≤ 2 +1d log

(nβ

)− log β

log(1 + δn)= O

(log n/δ2n

)= eo(n)

by the condition on the sequence δn. Combining this with Theorem 10.1,for n > n0, we get

PLn − L∗ > ε, 1/Hn > β, nHd

n > β≤ 4Cne−cnε

2,

which is summable in n. The Borel-Cantelli lemma implies that

I1/Hn>β,nHdn>β,L(gn)−L∗>ε → 0

with probability one, and the theorem is proved. 2

For weak consistency, it suffices to require convergence of Hn and nHdn

in probability (Problem 25.1):

Theorem 25.2. Assume that the random variable Hn takes its values fromthe set of real numbers of the form 1

(1+δn)k , where k is a nonnegative integerand δn > 0. Let K be a regular kernel. If

1δn

= eo(n)


andHn → 0 and nHd

n →∞ in probability as n→∞,

then the kernel classification rule corresponding to the random smoothingparameter Hn is universally consistent, that is, L(gn) → L∗ in probability.

We are now prepared to prove a result similar to Theorem 25.1 withoutrestricting the possible values of the random smoothing parameter Hn. Fortechnical reasons, we need to assume some additional regularity conditionson the kernel function: K must be decreasing along rays starting from theorigin, but it should not decrease too rapidly. Rapidly decreasing functionssuch as the Gaussian kernel, or functions of bounded support, such as thewindow kernel are excluded.

Theorem 25.3. Let K be a regular kernel that is monotone decreasingalong rays, that is, for any x ∈ Rd and a > 1, K(ax) ≤ K(x). Assume inaddition that there exists a constant c > 0 such that for every sufficientlysmall δ > 0, and x ∈ Rd, K((1 + δ)x) ≥ (1 − cδ)K(x). Let Hn be asequence of random variables satisfying

Hn → 0 and nHdn →∞ with probability one, as n→∞.

Then the error probability L(gn) of the kernel classification rule with kernelK and smoothing parameter Hn converges to L∗ with probability one, thatis, the rule is strongly universally consistent.

Remark. The technical condition on K is needed to ensure that smallchanges in h do not cause dramatic changes in L(gn). We expect somesmooth behavior of L(gn) as a function of h. The conditions are ratherrestrictive, as the kernels must have infinite support and decrease slowerthan at a polynomial rate. An example satisfying the conditions is

K(x) =

1 if ‖x‖ ≤ 11/‖x‖r otherwise,

where r > 0 (see Problem 25.2). The conditions on Hn are by no meansnecessary. We have already seen that consistency occurs for atomic distri-butions if K(0) > 0 and Hn ≡ 0, or for distributions with L∗ = 1/2 whenHn takes any value. However, Theorem 25.3 provides us with a simplecollection of sufficient conditions. 2

Proof of Theorem 25.3. First we discretize Hn. Define a sequenceδn → 0 satisfying the condition in Theorem 25.1, and introduce the ran-dom variables Hn and Hn as follows: Hn = 1

(1+δn)Kn, where Kn is the

smallest integer such that Hn >1

(1+δn)Kn, and let Hn = (1+δn)Hn. Thus,

Hn < Hn ≤ Hn. Note that both Hn and Hn satisfy the conditions of


Theorem 25.1. As usual, the consistency proof is based on Theorem 2.3.Here, however, we need a somewhat tricky choice of the denominator of thefunctions that approximate η(x). Introduce

ηn,Hn(x) =

1n

∑ni=1 IYi=1K

(x−Xi

Hn

)∫K(x−zHn

)µ(dz)

.

Clearly, the value of the classification rule gn(x) equals one if and only ifηn,Hn

(x) is greater than the function defined similarly, with the IYi=1’sreplaced with IYi=0. Then by Theorem 2.3 it suffices to show that∫

|ηn,Hn(x)− η(x)|µ(dx) → 0

with probability one. We use the following decomposition:∫|ηn,Hn

(x)− η(x)|µ(dx)

≤∫|ηn,Hn

(x)− ηn,Hn(x)|µ(dx) +

∫|ηn,Hn

(x)− η(x)|µ(dx).(25.1)

The second term on the right-hand side converges to zero with probabilityone, which can be seen by repeating the argument of the proof of Theorem25.1, using the observation that in the proof of Theorem 10.1 we provedconsistency via an exponential probability inequality for∫

|ηn,hn(x)− η(x)|µ(dx).

The first term may be bounded as the following simple chain of inequalitiesindicates:∫|ηn,Hn

(x)− ηn,Hn(x)|µ(dx)

=∫ ∣∣∣∣∣∣

1n

∑ni=1 YiK

(x−Xi

Hn

)− 1

n

∑ni=1 YiK

(x−Xi

Hn

)∫K(x−zHn

)µ(dz)

∣∣∣∣∣∣µ(dx)

≤∫ 1

n

∑ni=1

(K(x−Xi

Hn

)−K

(x−Xi

Hn

))∫K(x−zHn

)µ(dz)

µ(dx)

(by the monotonicity of K)

25.2 Data Splitting 448

≤∫ 1

n

∑ni=1(1− (1− cδn))K

(x−Xi

Hn

)∫K(x−zHn

)µ(dz)

µ(dx)

(from Hn = (1 + δn)Hn, and the condition on K, if n is large enough)

= cδn

∫ 1n

∑ni=1K

(x−Xi

Hn

)∫K(x−zHn

)µ(dz)

µ(dx).

Since Hn satisfies the conditions of Theorem 25.1, the integral on the right-hand side converges to one with probability one, just as we argued for thesecond term on the right-hand side of (25.1). But δn converges to zero.Therefore, the first term on the right-hand side of (25.1) tends to zero withprobability one. 2

Remark. A quick inspection of the proof above shows that if an andbn are deterministic sequences with the property that an < bn, bn → 0,and nadn →∞, then for the kernel estimate with kernel as in Theorem 25.3,we have

supan≤h≤bn

Ln(h) → L∗ with probability one

for all distributions. One would never use the worst smoothing factor overthe range [an, bn], but this corollary points out just how powerful Theorem25.3 is. 2

25.2 Data Splitting

Our first example of a data-dependent Hn is based upon the minimizationof a suitable error estimate. You should have read Chapter 22 on datasplitting if you want to understand the remainder of this section.

The data sequence Dn = (X1, Y1), . . . , (Xn, Yn) is divided into two parts.The first part Dm = (X1, Y1), . . . , (Xm, Ym) is used for training, while theremaining l = n−m pairs constitute the testing sequence:

Tl = (Xm+1, Ym+1), . . . , (Xm+l, Ym+l).

The training sequence Dm is used to design a class of classifiers Cm, which,in our case is the class of kernel rules based on Dm, with all possible valuesof h > 0, for fixed kernel K. Note that the value of the kernel rule gm(x)with smoothing parameter h is zero if and only if

fm(x) =m∑i=1

(Yi − 1/2)K(x−Xi

h

)≤ 0.


Classifiers in Cm are denoted by φm. A classifier is selected from Cm thatminimizes the holdout estimate of the error probability:

Lm,l(φm) =1l

l∑i=1

Iφm(Xm+i) 6=Ym+i.

The particular rule selected in this manner is called gn. The question ishow far the error probability L(gn) of the obtained rule is from that of theoptimal rule in Cm.

Finite collections. It is computationally attractive to restrict thepossible values of h to a finite set of real numbers. For example, Cm couldconsist of all kernel rules with h ∈ 2−km , 2−km+1, . . . , 1/2, 1, 2, . . . , 2km,for some positive integer km. The advantage of this choice of Cm is that thebest h in this class is within a factor of two of the best h among all possiblereal smoothing factors, unless the best h is smaller than 2−km−1 or largerthan 2km+1. Clearly, |Cm| = 2km+1, and as pointed out in Chapter 22, forthe selected rule gn Hoeffding’s inequality and the union bound imply that

PL(gn)− inf

φm∈Cm

L(φm) > ε

∣∣∣∣Dm

≤ (4km + 2)e−lε

2/2.

If km = eo(l), then the upper bound decreases exponentially in l, and, infact,

EL(gn)− inf

φm∈Cm

L(φm)

= O

(√log(l)l

).

By Theorem 10.1, Cm contains a subsequence of consistent rules if m→∞,and km → ∞ as n → ∞. To make sure that gn is strongly universallyconsistent as well, we only need that limn→∞ l = ∞, and km = eo(l) (seeTheorem 22.1). Under these conditions, the rule is

√log(km)/l-optimal (see

Chapter 22).The discussion above does little to help us with the selection of m, l, and

km. Safe, but possibly suboptimal, choices might be l = n/10, m = n − l,km = 2 log2 n. Note that the argument above is valid for any regular kernelK.

Infinite collections. If we do not want to exclude any value of thesmoothing parameter, and pick h from [0,∞), then Cm is of infinite car-dinality. Here, we need something stronger, like the Vapnik-Chervonenkistheory. For example, from Chapter 22, we have

PL(gn)− inf

φm∈Cm

L(φm) > ε

∣∣∣∣Dm


2/2,

where S(Cm, l) is the l-th shatter coefficient corresponding to the class ofclassifiers Cm. We now obtain upper bounds for S(Cm, l) for different choicesof K.


Define the function

fm(x) = fm(x,Dm) =m∑i=1

(Yi −

12

)K

(x−Xi

h

).

Recall that for the kernel rule based on Dm,

gm(x) =

0 if fm(x,Dm) ≤ 01 otherwise.

We introduce the kernel complexity κm:

κm = supx,(x1,y1),...,(xm,ym)

Number of sign changes of

fm(x, (x1, y1), . . . , (xm, ym)) as h varies from 0 to infinity.

Suppose we have a kernel with kernel complexity κm. Then, as h variesfrom 0 to infinity, the binary l-vector

(sign(fm(Xm+1)), . . . , sign(fm(Xm+l))

changes at most lκm times. It can thus take at most lκm+1 different values.Therefore,

S(Cm, l) ≤ lκm + 1.

We postpone the issue of computing kernel complexities until the nextsection. It suffices to note that if gn is obtained by minimizing the holdouterror estimate Lm,l(φm) by varying h, then

EL(gn)− inf

φm∈Cm

L(φm)∣∣∣∣Dm

(25.2)

≤ 16

√log(8eS(Cm, l))

2l(Corollary 12.1)

≤ 16

√log(8e(lκm + 1))

2l. (25.3)

Various probability bounds may also be derived from the results of Chapter12. For example, we have

PL(gn)− inf

φm∈Cm

L(φm) > ε

∣∣∣∣Dm


2/2

≤ 4e8(l2κm + 1)e−lε2/2.(25.4)

Theorem 25.4. Assume that gn minimizes the holdout estimate Lm,l(φm)over all kernel rules with fixed kernel K of kernel complexity κm, and overall (unrestricted) smoothing factors h > 0. Then gn is strongly universallyconsistent if


(i) limn→∞m = ∞;

(ii) limn→∞

log κml

= 0;

(iii) limn→∞

l

log n= ∞;

(iv) K is a regular kernel.

For weak universal consistency, (iii) may be replaced by (v): limn→∞ l = ∞.

Proof. Note that Cm contains a strongly universally consistent subsequence—take h = m−1/(2d) for example, and apply Theorem 10.1, noting that h→ 0,yet mhd →∞. Thus,

limn→∞

infφm∈Cm

L(φm) = L∗ with probability one.

It suffices to apply Theorem 22.1 and to note that the bound in (25.4) issummable in n when l/ log n → ∞ and log κm = o(l). For weak universalconsistency, a simple application of (25.2) suffices to note that we only needl→∞ instead of l/ log n→∞. 2

Approximation errors decrease with m. For example, if class densitiesexist, we may combine the inequality of Problem 2.10 with bounds fromDevroye and Gyorfi (1985) and Holmstrom and Klemela (1992) to concludethat E infφm∈Cm

L(φm)−L∗ is of the order of m−2α/(4+d), with α ∈ [1, 2],under suitable conditions on K and the densities. By (25.2), the estimationerror is O

(√log(lκm)/l

), requiring instead large values for l. Clearly, some

sort of balance is called for. Ignoring the logarithmic term for now, we seethat l should be roughly m4α/(4+d) if we are to balance errors of bothkinds. Unfortunately, all of this is ad hoc and based upon unverifiabledistributional conditions. Ideally, one should let the data select l and m.See Problem 25.3, and Problem 22.4 on optimal data splitting.

For some distributions, the estimation error is just too large to obtainasymptotic optimality as defined in Chapter 22. For example, the bestbound on the estimation error is O

(√log n/n

), attained when κm = 1,

l = n. If the distribution of X is atomic with finitely many atoms, then theexpected approximation error is O(1/

√m). Hence the error introduced by

the selection process smothers the approximation error when m is linear inn. Similar conclusions may even be drawn when X has a density: considerthe uniform distribution on [0, 1] ∪ [3, 4] with η(x) = 1 if x ∈ [0, 1] andη(x) = 0 if x ∈ [3, 4]. For the kernel rule with h = 1, K = I[−1,1], theexpected approximation error tends to L∗ = 0 exponentially quickly in m,and this is always far better than the estimation error which at best isO(√

log n/n).

25.3 Kernel Complexity 452

25.3 Kernel Complexity

We now turn to the kernel complexity κm. The following lemmas are usefulin our computations. If l/ log n → ∞, we note that for strong consistencyit suffices that κm = O (mγ) for some finite γ (just verify the proof again).This, as it turns out, is satisfied for nearly all practical kernels.

Lemma 25.1. Let 0 ≤ b1 < · · · < bm be fixed numbers, and let ai ∈ R,1 ≤ i ≤ m, be fixed, with the restriction that am 6= 0. Then the functionf(x) =

∑mi=1 aix

bi has at most m− 1 nonzero positive roots.

Proof. Note first that f cannot be identically zero on any interval ofnonzero length. Let Z(g) denote the number of nonzero positive roots of afunction g. We have

Z

(m∑i=1

aixbi

)

= Z

(m∑i=1

aixci

)(where ci = bi − b1, all i; thus, c1 = 0)

≤ Z

(m∑i=2

aicixci−1

)+ 1

(for a continuously differentiable g, we have Z(g) ≤ 1 + Z(g′))

= Z

(m∑i=2

a′ixb′i

)+ 1

(where a′i = aici, b′i = ci − c2, all i ≥ 2).

Note that the b′i are increasing, b′2 = 0, and a′i 6= 0. As Z(amx

bm)

= 0 foram 6= 0, we derive our claim by simple induction on m. 2

Lemma 25.2. Let a1, . . . , am be fixed real numbers, and let b1, . . . , bm bedifferent nonnegative reals. Then if α 6= 0, the function

f(x) =m∑i=1

aie−bix

α

, x ≥ 0,

is either identically zero, or takes the value 0 at most m times.

Proof. Define y = e−xα

. If α 6= 0, y ranges from 0 to 1. By Lemma 25.1,g(y) =

∑mi=1 aiy

bi takes the value 0 at at most m − 1 positive y-values,unless it is identically zero everywhere. This concludes the proof of thelemma. 2


A star-shaped kernel is one of the form K(x) = Ix∈A, where A is astar-shaped set of unit Lebesgue measure, that is, x /∈ A implies cx /∈ A forall c ≥ 1. It is clear that κm = m − 1 by a simple thresholding argument.On the real line, the kernel K(x) =

∑∞i=−∞ aiIx∈[2i,2i+1] for ai > 0

oscillates infinitely often, and has κm = ∞ for all values of m ≥ 2. Wemust therefore disallow such kernels. For the same reason, kernels such asK(x) = (sinx/x)r on the real line are not good (see Problem 25.4).

If K =∑ki=1 aiIAi

for some finite k, some numbers ai and some star-shaped sets Ai, then κm ≤ k(m− 1).

Consider next kernels of the form

K(x) = ‖x‖−rIx∈A,

where A is star-shaped, and r ≥ 0 is a constant (see Sebestyen (1962)). Wesee that

fm(x) =m∑i=1

(Yi −12)K(x−Xi

h

)

= hrm∑i=1

(Yi −12)‖x−Xi‖−rI(x−Xi)/h∈A,

which changes sign at most as often as fm(x)/hr. From our earlier remarks,it is easy to see that κm = m−1, as κm is the same as for the kernelK = IA.If A is replaced byRd, then the kernel is not integrable, but clearly, κm = 0.Assume next that we have

K(x) =


where r > 0. For r > d, these kernels are integrable and thus regular. Notethat

K(xh

)= I‖x‖≤h + I‖x‖>h

hr

‖x‖r.

As h increases, fm(x), which is of the form

∑i:‖x−Xi‖≤h

(Yi −

12

)+ hr

∑i:‖x−Xi‖>h

(Yi −

12

)1

‖x−Xi‖r,

transfers an Xi from one sum to the other at most m times. On an intervalon which no such transfer occurs, fm varies as α + βhr and has at mostone sign change. Therefore, κm cannot be more than m + 1 (one for eachh-interval) plus m (one for each transfer), so that κm ≤ 2m+ 1. For morepractice with such computations, we refer to the exercises. We now continuewith a few important classes of kernels.


Consider next exponential kernels such as

K(x) = e−‖x‖α

for some α > 0, where ‖ · ‖ is any norm on Rd. These kernels include thepopular gaussian kernels. As the decision rule based on Dm is of the form

gn(x) =

1 if

∑mi=1(2Yi − 1)e−

‖x−Xi‖α

hα > 00 otherwise,

a simple application of Lemma 25.2 shows that κm ≤ m. The entire classof kernels behaves nicely.

Among compact support kernels, kernels of the form

K(x) =

(k∑i=1

ai‖x‖bi

)I‖x‖≤1

for real numbers ai, and bi ≥ 1, are important. A particularly popularkernel in d-dimensional density estimation is Deheuvels’ (1977) kernel

K(x) =(1− ‖x‖2

)I‖x‖≤1.

If the kernel was K(x) =(∑k

i=1 ai‖x‖bi

), without the indicator function,

then Lemma 25.1 would immediately yield the estimate κm ≤ k, uniformlyover all m. Such kernels would be particularly interesting. With the indica-tor function multiplied in, we have κm ≤ km, simply because fm(x) at eachh is based upon a subset of the Xj ’s, 1 ≤ j ≤ m, with the subset growingmonotonically with h. For each subset, the function fm(x) is a polynomialin ‖x‖ with powers b1, . . . , bk, and changes sign at most k times. Therefore,polynomial kernels of compact support also have small complexities. Ob-serve that the “k” in the bound κm ≤ km refers to the number of terms inthe polynomial and not the maximal power.

A large class of kernels of finite complexity may be obtained by apply-ing the rich theory of total positivity. See Karlin (1968) for a thoroughtreatment. A real-valued function L of two real variables is said to be to-tally positive on A × B ⊂ R2 if for all n, and all s1 < · · · < sn, si ∈ A,t1 < · · · < tn, ti ∈ B, the determinant of the matrix with elements L(si, tj)is nonnegative. A key property of such functions is the following result,which we cite without proof:

Theorem 25.5. (Schoenberg (1950)). Let L be a totally positive func-tion on A×B ∈ R2, and let α : B → R be a bounded function. Define thefunction

β(s) =∫B

L(s, t)α(t)σ(dt),

25.4 Multiparameter Kernel Rules 455

on A, where σ is a finite measure on B. Then β(s) changes sign at mostas many times as α(s) does. (The number of sign changes of a functionβ is defined as the supremum of sign changes of sequences of the formβ(s1), . . . , β(sn), where n is arbitrary, and s1 < · · · < sn.)

Corollary 25.1. Assume that the kernel K is such that the functionL(s, t) = K(st) is totally positive for s > 0 and t ∈ Rd. Then the ker-nel complexity of K satisfies κm ≤ m− 1.

Proof. We apply Theorem 25.5. We are interested in the number of signchanges of the function

β(s) =m∑i=1

(2Yi − 1)K((Xi − x)s)

on s ∈ (0,∞) (s plays the role of 1/h). But β(s) may be written as

β(s) =∫B

L(s, t)α(t)σ(dt),

where L(s, t) = K(st), the measure σ puts mass 1 on each t = Xi − x,i = 1, . . . ,m, and α(t) is defined at these points as

α(t) = 2Yi − 1 if t = Xi − x.

Other values of α(t) are irrelevant for the integral above. Clearly, α(t) canbe defined such that it changes sign at most m − 1 times. Then Theorem25.5 implies that β(s) changes sign at most m− 1 times, as desired. 2

This corollary equips us with a whole army of kernels with small com-plexity. For examples, refer to the monograph of Karlin (1968).

25.4 Multiparameter Kernel Rules

Assume that in the kernel rules considered in Cm, we perform an optimiza-tion with respect to more than one parameter. Collect these parameters inθ, and write the discrimination function as

fm(x) =m∑i=1

(Yi −

12

)Kθ(x−Xi).

Examples:

(i) Product kernels. Take

Kθ(x) =d∏j=1

K( x

h(j)

),

25.5 Kernels of Infinite Complexity 456

where θ = (h(1), . . . , h(d)) is a vector of smoothing factors—one perdimension—and K is a fixed one-dimensional kernel.

(ii) Kernels of variable form. Define

Kθ(x) = e−‖x‖α/hα

,

where α > 0 is a shape parameter, and h > 0 is the standard smooth-ing parameter. Here θ = (α, h) is two-dimensional.

(iii) Define

Kθ(x) =1

1 + xTRRTx=

11 + ‖Rx‖2

,

where xT is the transpose of x, and R is an orthogonal transformationmatrix, all of whose free components taken together are collected in θ.Kernels of this kind may be used to adjust automatically to a certainvariance-covariance structure in the data.

We will not spend a lot of time on these cases. Clearly, one route is toproperly generalize the definition of kernel complexity. In some cases, it ismore convenient to directly find upper bounds for S(Cm, l). In the productkernel case, with one-dimensional kernel I[−1,1], we claim for example that

S(Cm, l) ≤ (lm)d + 1.

The corresponding rule takes a majority vote over centered rectangles withsides equal to 2h(1), 2h(2), . . . , 2h(d). To see why the inequality is true, con-sider the d-dimensional quadrant of lm points obtained by taking the abso-lute values of the vectors Xj−Xi, m < j ≤ m+l = n, 1 ≤ i ≤ m, where theabsolute value of a vector is a vector whose components are the absolutevalues of the components of the vector. To compute S(Cm, l), it suffices tocount how many different subsets can be obtained from these lm points byconsidering all possible rectangles with one vertex at the origin, and thediagonally opposite vertex in the quadrant. This is 1 + (lm)d. The stronguniversal consistency of the latter family Cm is insured when m→∞ andl/ log n→∞.

25.5 Kernels of Infinite Complexity

In this section we demonstrate that not every kernel function supportssmoothing factor selection based on data splitting. These kernels have in-finite kernel complexity, or even worse, infinite vc dimension. Some of theexamples may appear rather artificial, but some “nice” kernels will surpriseus by misbehaving.


We begin with a kernel K having the property that the class of sets

A =

x : K(x−X1

h

)> 0

: h > 0

has vc dimension VA = ∞ for any fixed value of X1 (i.e., using the no-tation of the previous sections, VC1 = ∞). Hence, with one sample point,all hope is lost to use the Vapnik-Chervonenkis inequality in any meaning-ful way. Unfortunately, the kernel K takes alternately positive and nega-tive values. In the second part of this section, a kernel is constructed thatis unimodal and symmetric and has VCm = ∞ for m = 4 when D4 =((X1, Y1), (X2, Y2), (X3, Y3), (X4, Y4)) takes certain values. Finally, in thelast part, we construct a positive kernel with the property that for any mand any nondegenerate nonatomic distribution, limm→∞P VCm

= ∞ =1.

We return to A. Our function is picked as follows:

K(x) = α(x)g(i), 2i ≤ x < 2i+1, i ≥ 1,

where g(i) ∈ −1, 1 for all i, α(x) > 0 for all x, α(x) ↓ 0 as x ↑ ∞, x > 0,and α(x) ↓ 0 as x ↓ −∞, x < 0 (the monotonicity is not essential and maybe dropped but the resulting class of kernels will be even less interesting).We enumerate all binary strings in lexicographical order, replace all 0’s by−1’s, and map the bit sequence to g(1), g(2), . . .. Hence, the bit sequence0, 1, 00, 01, 10, 11, 000, 001, 010, . . . becomes

−1, 1,−1,−1,−1, 1, 1,−1, 1, 1,−1,−1,−1,−1,−1, 1,−1, 1,−1, . . . .

Call this sequence S. For every n, we can find a set x1, . . . , xn that can beshattered by sets from A. Take x1, . . . , xn = X1 + 21, . . . , 2n. A subsetof this may be characterized by a string sn from −1, 1n, 1’s denotingmembership, and −1’s denoting absence. We find the first occurrence of snin S and let the starting point be g(k). Take h = 21−k. Observe that(

K

(x1 −X1

h

), . . . ,K

(xn −X1

h

))=

(K(2k), . . . ,K

(2k+n−1

))=

(α(2k)g(k), α(2k+1)g(k + 1), . . . , α(2k+n−1)g(k + n− 1)

),

which agrees in sign with sn as desired. Hence, the vc dimension is infinite.The next kernel is symmetric, unimodal, and piecewise quadratic. The

intervals into which [0,∞) is divided are denoted by A0, A1, A2, . . ., fromleft to right, where

Ai = [

342i, 3

22i)

i ≥ 1[0, 3

2

)i = 0.


On each Ai, K is of the form ax2 + bx + c. Observe that K ′′ = 2a takesthe sign of a. Also, any finite symmetric difference of order 2 has the signof a, as

K(x+ δ) +K(x− δ)− 2K(x) = 2aδ2, x− δ, x, x+ δ ∈ Ai.

We take four points and construct the class

A4 =

x :

4∑i=1

K

(x−Xi

h

)(2Yi − 1) > 0

: h > 0

,

where X1 = −δ, X2 = δ, X3 = X4 = 0, Y1 = Y2 = 0, Y3 = Y4 = 1are fixed for now. On Ai, we let the “a” coefficient have the same sign asg(i), known from the previous example. All three quadratic coefficients arepicked so that K ≥ 0 and K is unimodal. For each n, we show that the set21, . . . , 2n can be shattered by intersecting with sets from A4. A subsetis again identified by a −1, 1n-valued string sn, and its first match in Sis g(k), . . . , g(k + n− 1). We take h = 21−k. Note that for 1 ≤ i ≤ n,

sign(K

(2i − δ

h

)+K

(2i + δ

h

)− 2K

(2i

h

))= sign

(K(2k+i−1 − δ2k−1

)+K

(2k+i−1 + δ2k−1

)− 2K

(2k+i−1

))= sign

(K ′′ (2k+i−1

))(if δ ≤ 1/4, by the finite-difference property of quadratics)

= g(k + i− 1),

as desired. Hence, any subset of 21, . . . , 2n can be picked out.The previous example works whenever δ ≤ 1/4. It takes just a little

thought to see that if (X1, Y1), . . . , (X4, Y4) are i.i.d. and drawn from thedistribution of (X,Y ), then

P VA4 = ∞ ≥ PX1 = −X2, X3 = X4 = 0, Y1 = Y2 = 0, Y3 = Y4 = 1,

and this is positive if given Y = 0, X has atoms at δ and −δ for someδ > 0; and given Y = 1, X has an atom at 0. However, with some work,we may even remove these restrictions (see Problem 25.12).

We also draw the reader’s attention to an 8-point example in which Kis symmetric, unimodal, and convex on [0,∞), yet VC8 = ∞. (See Problem25.11.)

Next we turn to our general m-point example. Let the class of rules begiven by

gm,h(x) =

1 if∑mi=1(2Yi − 1)K

(x−Xi

h

)> 0

0 otherwise,


h > 0, where the data (X1, Y1), . . . , (Xm, Ym) are fixed for now. We exhibita nonnegative kernel such that if the Xi’s are different and not all the Yi’sare the same, VCm = ∞. This situation occurs with probability tending toone whenever X is nonatomic and PY = 1 ∈ (0, 1). It is stressed herethat the same kernel K is used regardless of the data.

The kernel is of the form K(x) = K0(x)IA(x), where K0(x) = e−x2, and

A ⊂ R will be specially picked. Order X1, . . . , Xm into X(1) < · · · < X(m)

and find the first pair X(i), X(i+1) with opposite values for Y(i). Withoutloss of generality, assume that X(i) = −1, X(i+1) = 1, 2Y(i) − 1 = −1,and 2Y(i+1) − 1 = 1 (Y(i) is the Y -value that corresponds to X(i)). Letthe smallest value of |X(j)|, j 6= i, j 6= i + 1 be denoted by δ > 1. Thecontribution of all such X(j)’s for x ∈ [−δ/3, δ/3] is not more than (m −2)K0((1 + 2δ/3)/h) ≤ mK0((1 + 2δ/3)/h). The contribution of either X(i)

or X(i+1) is at least K0((1 + δ/3)/h). For fixed δ and m (after all, they aregiven), we first find h∗ such that h ≤ h∗ implies

K0

(1 + δ/3

h

)> mK0

(1 + 2δ/3

h

).

For h ≤ h∗, the rule, for x ∈ [−δ/3, δ/3], is equivalent to a 2-point rulebased on

gm,h(x) =

1 if −K(x+1h

)+K

(x−1h

)> 0

0 otherwise.

We define the set A by

A =∞⋃k=1

2k−1⋃l=0

32k+lAk,l,

where the sets Ak,l are specified later. (For c 6= 0 and B ⊂ R, cB = x ∈ R :x/c ∈ B.) Here k represents the length of a bit string, and l cycles throughall 2k bit strings of length k. For a particular such bit string, b1, . . . , bk,represented by l, we define Ak,l as follows. First of all, Ak,l ⊆ [1/2, 3/2], sothat all sets 32k+lAk,l are nonoverlapping. Ak,l consists of the sets( ⋃

i:bi=0

[1− 1

(i+ 1)2k, 1− 1

(i+ 2)2k

))

∪

( ⋃i:bi=1

(1 +

1(i+ 2)2k

, 1 +1

(i+ 1)2k

]).

Ak,l is completed by symmetry.We now exhibit for each integer k a set of that size that can be shat-

tered. These sets will be positioned in [−δ/3, 0] at coordinate values −1/(2 ·

25.6 On Minimizing the Apparent Error Rate 460

2ρ),−1/(3 ·2ρ), . . . ,−1/((k+1) ·2ρ), where ρ is a suitable large integer suchthat 1/(2ρ+1) < δ/3. Assume that we wish to extract the set indexed by thebit vector (b1, . . . , bk) (bi = 1 means that −1/(i+ 1)2ρ must be extracted).To do so, we are only allowed to vary h. First, we find a pair (k′, l), wherek′ is at least k and 1/(2k

′+1) < δ/3. 3−2k′ ≤ h∗ is needed too. Also, l issuch that it matches b1, . . . , bk in its first k ≤ k′ bits. Take h = 3−2k′−l andρ = k′ and observe that

gn,h

(− 1

(i+ 1)2k′

)

=

1 if K

(− 1

(i+1)2k′ −1

32k′+l

)> K

(− 1

(i+1)2k′ +1

32k′+l

)0 otherwise

=

1 if 1 + 1

(i+1)2k′ ∈ Ak,l, 1− 1(i+1)2k′ /∈ Ak,l

0 if 1 + 1(i+1)2k′ /∈ Ak,l, 1− 1

(i+1)2k′ ∈ Ak,lunimportant otherwise

=

1 if bi = 10 if bi = 0,

1 ≤ i ≤ k. Thus, we pick the desired set. This construction may be repeatedfor all values of l of course. We have shown the following:

Theorem 25.6. If X is nonatomic, PY = 1 ∈ (0, 1), and if VCmdenotes

the vc dimension of the class of kernel rules based on m i.i.d. data drawnfrom the distribution of (X,Y ), and with the kernel specified above, then

limm→∞

PVCm= ∞ = 1.

25.6 On Minimizing the Apparent Error Rate

In this section, we look more closely at kernel rules that are picked by min-imizing the resubstitution estimate L(R)

n over kernel rules with smoothingfactor h > 0. We make two remarks in this respect:

(i) The procedure is generally inconsistent if X is nonatomic.(ii) The method is consistent if X is purely atomic.

To see (i), take K = IS0,1. We note that L(R)n = 0 if h < mini 6=j ‖Xi−Xj‖

as gn(Xi) = Yi for such h. In fact, if Hn is the minimizing h, it may takeany value on [

0, mini,j:Yi=1,Yj=0

‖Xi −Xj‖].

25.6 On Minimizing the Apparent Error Rate 461

If X is independent of Y , PY = 1 = 2/3, and X has a density, then,

limc→∞

lim infn→∞

P

mini,j:Yi=1,Yj=0

‖Xi −Xj‖d ≤c

n2

= 1

(Problem 25.13). Thus, nHd → 0 in probability, and therefore, in this case,EL(gn) → 2/3 (since ties are broken by favoring the “0” class). Hencethe inconsistency.

Consider next X purely atomic. Fix (X1, Y1), . . . , (Xn, Yn), the data; andconsider the class Cn of kernel rules in which h > 0 is a free parameter. Ifa typical rule is gn,h, let An be the class of sets x : gn,h(x) = 1, h > 0. Ifgn is the rule that minimizes L(R)

n (gn,h), we have

L(gn)− infgn,h∈Cn

L(gn,h)

≤ 2 supgn,h∈Cn

∣∣∣L(R)n (gn,h)− L(gn,h)

∣∣∣ (Theorem 8.4)

≤ 2 supA∈An

|µn(A)− µ(A)|+ 2 supA∈Ac

n

|µn(A)− µ(A)|

(Acn denotes the collection of sets x : gn,h(x) = 0)≤ 4 sup

B∈B|µn(B)− µ(B)|,

where B is the collection of all Borel sets. However, the latter quantitytends to zero with probability one—as denoting the set of the atoms of Xby T , we have

supB∈B

|µn(B)− µ(B)|

=12

∑x∈T

|µn(x)− µ(x)|

≤ 12

∑x∈A

|µn(x)− µ(x)|+ µn(Ac) + µ(Ac)

(where A is an arbitrary finite subset of T )

≤ 2µ(Ac) +12

∑x∈A

|µn(x)− µ(x)|+ |µn(Ac)− µ(Ac)|.

The first term is small by choice of A and the last two terms are smallby applying Hoeffding’s inequality to each of the |A| + 1 terms. For yet adifferent proof, the reader is referred to Problems 25.16 and 25.17.

Note next that

L(gn)− L∗ ≤(L(gn)− inf

gn,h∈Cn

L(gn,h))

+(

infgn,h∈Cn

L(gn,h)− L∗).

25.7 Minimizing the Deleted Estimate 462

But if K(0) > 0 and K(x) → 0 as ‖x‖ → ∞ along any ray, then we haveinfgn,h∈Cn

L(gn,h) ≤ L(g′n), where g′n is the fundamental rule discussed inChapter 27 (in which h = 0). Theorem 27.1 shows that L(g′n) → L∗ withprobability one, and therefore, L(gn) → 0 with probability one, proving

Theorem 25.7. Take any kernel K with K(0) > 0 and K(x) → 0 as‖x‖ → ∞ along any ray, and let gn be selected by minimizing L(R)

n over allh ≥ 0. Then L(gn) → L∗ almost surely whenever the distribution of X ispurely atomic.

Remark. The above theorem is all the more surprising since we may takejust about any kernel, such as K(x) = e−‖x‖

2, K(x) = sin(‖x‖)/‖x‖, or

K(x) = (1+‖x‖)−1/3. Also, if X puts its mass on a dense subset of Rd, thedata will look and feel like data from an absolutely continuous distribution(well, very roughly speaking); yet there is a dramatic difference with rulesthat minimize L(R)

n when the Xi’s are indeed drawn from a distributionwith a density! 2

25.7 Minimizing the Deleted Estimate

We have seen in Chapter 24 that the deleted estimate of the error prob-ability of a kernel rule is generally very reliable. This suggests using theestimate as a basis of selecting a smoothing factor for the kernel rule. LetL

(D)n (gn,h) be the deleted estimate of L(gn,h), obtained by using in gn−1,h

the same kernel K and smoothing factor h. Define the set of h’s for which

L(D)n (gn,h) = inf

h′∈[0,∞)L(D)n (gn,h′)

by A. SetHn = infh : h ∈ A.

Two fundamental questions regarding Hn must be asked:

(a) If we use Hn in the kernel estimate, is the rule universally consistent?Note in this respect that Theorem 25.3 cannot be used because it isnot true that Hn → 0 in probability in all cases.

(b) If Hn is used as smoothing factor, how does EL(gn)−L∗ compareto E infh L(gn,h) − L∗?

To our knowledge, both questions have been unanswered so far. We believethat this way of selecting h is very effective. Below, generalizing a resultby Tutz (1986) (Theorem 27.6), we show that the kernel rule obtained by


minimizing the deleted estimate is consistent when the distribution of X isatomic with finitely many atoms.

Remark. If η ≡ 1 everywhere, so that Yi ≡ 1 for all i, and if K has supportequal to the unit ball in Rd, and K > 0 on this ball, then L

(D)n (gn,h) is

minimal and zero ifh ≥ max

1≤j≤n‖Xj −XNN

j ‖,

where XNNj denotes the nearest neighbor of Xj among the data points

X1, . . . , Xj−1, Xj+1, . . . , Xn. But this shows that

Hn = max1≤j≤n

‖Xj −XNNj ‖.

We leave it as an exercise to show that for any nonatomic distribution ofX, nHd

n →∞ with probability one. However, it is also true that for somedistributions, Hn →∞ with probability one (Problem 25.19). Nevertheless,for η ≡ 1, the rule is consistent!

If η ≡ 0 everywhere, then Yi = 0 for all i. Under the same condition onK as above, with tie breaking in favor of class “0” (as in the entire book),it is clear that Hn ≡ 0. Interestingly, here too the rule is consistent despitethe strange value of Hn. 2

We state the following theorem for window kernels, though it can be ex-tended to include kernels taking finitely many different values (see Problem25.18):

Theorem 25.8. Let gn be the kernel rule with kernel K = IA for somebounded set A ⊂ Rd containing the origin, and with smoothing factor cho-sen by minimizing the deleted estimate as described above. Then EL(gn) →L∗ whenever X has a discrete distribution with finitely many atoms.

Proof. Let U1, . . . , Un be independent, uniform [0, 1] random variables,independent of the data Dn, and consider the augmented sample

D′n = ((X ′

1, Y1), . . . , (X ′n, Yn)),

where X ′i = (Xi, Ui). For each h ≥ 0, introduce the rule

g′n,h(x′)

=

1 if∑ni=1K

(x−Xi

h

)(2Yi − 1)−

∑ni=1K(0)(2Yi − 1)Iu=Ui > 0

0 otherwise,

where x′ = (x, u) ∈ Rd × [0, 1]. Observe that minimizing the resubstitu-tion estimate L

(R)n (g′n,h) over these rules yields the same h as minimiz-

ing L(D)n (gn,h) over the original kernel rules. Furthermore, EL(gn) =


EL(g′n) if g′n is obtained by minimizing L(R)n (g′n,h). It follows from the

universal consistency of kernel rules (Theorem 10.1) that

limn→∞

infhL(g′n,h) → L∗ with probability one.

Thus, it suffices to investigate L(g′n)− infh L(g′n,h). For a given D′n, define

the subsets A′h ⊂ Rd × [0, 1] by

A′h =x′ : g′n,h(x

′) = 1, h ∈ [0,∞).

Arguing as in the previous section, we see that

L(g′n)− infhL(g′n,h) ≤ 4 sup

h|νn(A′h)− ν(A′h)|,

where ν is the measure of X ′ on Rd× [0, 1], and νn is the empirical measuredetermined by D′

n. Observe that each A′h can be written as

A′h = Ah × [0, 1]− U1, . . . , Un ∪B′h,

where

Ah =

x ∈ Rd :

n∑i=1

K

(x−Xi

h

)(2Yi − 1) > 0

,

and

B′h =

(x, u) ∈ Rd × U1, . . . , Un :

n∑i=1

K

(x−Xi

h

)(2Yi − 1)−K(0)

n∑i=1

(2Yi − 1)Iu=Ui > 0.

Clearly, asνn(A′h) = νn(B′h) and ν(A′h) = µ(Ah),

it suffices to prove that suph |νn(B′h)−µ(Ah)| → 0 in probability, as n→∞.Then

suph|νn(B′h)− µ(Ah)|

≤ suph|νn(B′h)− ν(B′h)|+ sup

h|ν(B′h)− µ(Ah)|

≤ supB′⊂Rd×U1,...,Un

|νn(B′)− ν(B′)|+ suph|ν(B′h)− µ(Ah)|.

The first term on the right-hand side tends to zero in probability, whichmay be seen by Problem 25.17 and the independence of the Ui’s of Dn. To


bound the second term, observe that for each h,

|ν(B′h)− µ(Ah)| ≤ µ

(x :

∣∣∣∣∣n∑i=1

K

(x−Xi

h

)(2Yi − 1)

∣∣∣∣∣ ≤ K(0)

)

=N∑j=1

PX = xjI∣∣∑n

i=1K( xj−Xi

h

)(2Yi−1)

∣∣≤K(0),

where x1, . . . , xN are the atoms of X. Thus, the proof is finished if we showthat for each xj ,

E

suphI∣∣∑n

i=1K( xj−Xi

h

)(2Yi−1)

∣∣≤K(0)→ 0.

Now, we exploit the special form K = IA of the kernel. Observe that

suphI∣∣∑n

i=1K( xj−Xi

h

)(2Yi−1)

∣∣≤K(0)

= suphI∣∣∑n

i=1IA

( xj−Xih

)(2Yi−1)

∣∣≤1

≤N∑k=1

I|V1+···+Vk|≤1,

where

Vk =∑

i:Xi=x(k)

(2Yi − 1)

and x(1), x(2), . . . , x(N) is an ordering of the atoms of X such that x(1) = xjand (xj − x(k))/h ∈ A implies (xj − x(k−1))/h ∈ A for 1 < k ≤ N . Byproperties of the binomial distribution, as n→∞,

E

N∑k=1

I|V1+···+Vk|≤1

= O

(1√n

).

This completes the proof. 2

Historical remark. In an early paper by Habbema, Hermans, and vanden Broek (1974), the deleted estimate is used to select an appropriatesubspace for the kernel rule. The kernel rules in turn have smoothing factorsthat are selected by maximum likelihood. 2

25.9 Squared Error Minimization 466

25.8 Sieve Methods

Sieve methods pick a best estimate or rule from a limited class of rules. Forexample, our sieve Ck might consist of rules of the form

φ(x) =

1 if

∑ki=1 aiK

(x−xi

hi

)> 0

0 otherwise,

where the ai’s are real numbers, the xi’s are points fromRd, and the hi’s arepositive numbers. There are formally (d+ 2)k free scalar parameters. If wepick a rule φ∗n that minimizes the empirical error on (X1, Y1), . . . , (Xn, Yn),we are governed by the theorems of Chapters 12 and 18, and we will needto find the vc dimension of Ck. For this, conditions on K will be needed.We will return to this question in Chapter 30 on neural networks, as theyare closely related to the sieves described here.

25.9 Squared Error Minimization

Many researchers have considered the problem of selecting h in order tominimize the L2 error (integrated squared error) of the kernel estimate

ηn,h =1n

∑ni=1 YiK

(Xi−xh

)1n

∑ni=1K

(Xi−xh

)of the regression function η(x) = PY = 1|X = x,∫

(ηn,h(x)− η(x))2 µ(dx).

For example, Hardle and Marron (1985) proposed and studied a cross-validation method for choosing the optimal h for the kernel regressionestimate. They obtain asymptotic optimality for the integrated squarederror. Although their method gives us a choice for h if we consider η(x) =PY = 1|X = x as the regression function, it is not clear that the h thusobtained is optimal for the probability of error. In fact, as the followingtheorem illustrates, for some distributions, the smoothing parameter thatminimizes the L2 error yields a rather poor error probability compared tothat corresponding to the optimal h.

Theorem 25.9. Let d = 1. Consider the kernel classification rule with thewindow kernel K = I[−1,1], and smoothing parameter h > 0. Denote its er-ror probability by Ln(h). Let h∗ be the smoothing parameter that minimizesthe mean integrated squared error

E∫

(ηn,h(x)− η(x))2 µ(dx).


Then for some distributions

limn→∞

ELn(h∗)infhELn(h)

= ∞,

and the convergence is exponentially fast.

We leave the details of the proof to the reader (Problem 25.20). Only arough sketch is given here. Consider X uniform on [0, 1]∪ [3, 4], and define

η(x) =

1− x/2 if x ∈ [0, 1]2− x/2 otherwise.

The optimal value of h (i.e., the value minimizing the error probability) isone. It is constant, independent of n. This shows that we should not a prioriexclude any values of h, as is commonly done in studies on regression anddensity estimation. The minimal error probability can be bounded fromabove (using Hoeffding’s inequality) by e−c1n for some constant c1 > 0.On the other hand, straightforward calculations show that the smoothingfactor h∗ that minimizes the mean integrated squared error goes to zero asn → ∞ as O

(n−1/4

). The corresponding error probability is larger than

e−c2h∗n for some constant c2. The order-of-magnitude difference between

the exponents explains the exponential speed of convergence to infinity.



Problem 25.2. Show that for any r > 0, the kernel

K(x) =


satisfies the conditions of Theorem 25.3.

Problem 25.3. Assume that K is a regular kernel with kernel complexity κm ≤mγ for some constant γ > 0. Let gn be selected so as to minimize the holdouterror estimate Ll,m(φm) over Cm, the class of kernel rules based upon the firstm data points, with smoothing factor h > 0. Assume furthermore that we vary lover [log2 n, n/2], and that we pick the best l (and m = n− l) by minimizing theholdout error estimate again. Show that the obtained rule is strongly universallyconsistent.

Problem 25.4. Prove that the kernel complexity κm of the de la Vallee-Poussinkernel K(x) = (sinx/x)2, x ∈ R, is infinite when m ≥ 2.

Problem 25.5. Show that κm = ∞ for m ≥ 2 when K(x) = cos2(x), x ∈ R.


Problem 25.6. Compute an upper bound for the kernel complexity κm for thefollowing kernels, where x = (x(1), . . . , x(d)) ∈ Rd:

A. K(x) =

d∏i=1

(1− x(i)2

)I|x(i)|≤1.

B. K(x) =1

(1 + ‖x‖2)α , α > 0.

C. K(x) =

d∏i=1

1

1 + x(i)2.

D. K(x) =

d∏i=1

cos(x(i))I|x(i)|≤π/2.

Problem 25.7. Can you construct a kernel on R with the property that itscomplexity satisfies 2m ≤ κm <∞ for all m? Prove your claim.

Problem 25.8. Show that for kernel classes Cm with kernel rules having a fixedtraining sequence Dm but variable h > 0, we have VCm ≤ log2 κm.

Problem 25.9. Calculate upper bounds for S(Cm, l) when Cm is the class ofkernel rules based on fixed training data but with variable parameter θ in thefollowing cases (for definitions, see Section 25.4):

(1) Kθ is a product kernel of Rd, where the unidimensional kernel K haskernel complexity κm.

(2) Kθ = 11+‖x‖α/hα , where θ = (h, α), h > 0, α > 0 are two parameters.

(3) Kθ(x) = Ix∈A, where A is any ellipsoid of Rd centered at the origin.

Problem 25.10. Prove or disprove: if Dm is fixed and Cm is the class of all kernelrules based on Dm with K = IA, A being any convex set of Rd containing theorigin, is it possible that VCm = ∞, or is VCm <∞ for all possible configurationsof Dm?

Problem 25.11. Let

A8 =

x :

8∑i=1

K(x−Xih

)(2Yi − 1) > 0

, h > 0

with X1 = −3δ, X2 = X3 = X4 = −δ, X5 = X6 = X7 = δ, X8 = 3δ, Y1 = 1,Y2 = Y3 = Y4 = −1, Y5 = Y6 = Y7 = 1, Y8 = −1, and let K be piecewise cubic.Extending the quadratic example in the text, show that the vc dimension of A8

is infinite for some K in this class that is symmetric, unimodal, positive, andconvex on [0,∞).

Problem 25.12. Draw (X1, Y1), . . . , (X4, Y4) from the distribution of (X,Y ) onRd × 0, 1, where X has a density and η(x) ∈ (0, 1) at all x. Find a symmetricunimodal K such that

A4 =

x :

4∑i=1

K(x−Xih

)(2Yi − 1) > 0

, h > 0


has vc dimension satisfying PVA4 = ∞ > 0. Can you find such a kernel Kwith a bounded support?

Problem 25.13. Let X have a density f on Rd and let X1, . . . , Xn be i.i.d.,drawn from f . Show that

limc→∞

lim infn→∞

P

mini6=j

‖Xi −Xj‖d ≤c

n2

= 1.

Apply this result in the following situation: define Hn = mini,j:Yi=1,Yj=0 ‖Xi −Xj‖, where (X1, Y1), . . . , (Xn, Yn) are i.i.d. Rd × 0, 1-valued random variablesdistributed as (X,Y ), with X absolutely continuous, Y independent of X, andPY = 1 ∈ (0, 1). Show that

limc→∞

lim infn→∞

PHdn ≤

c

n2

= 1.

Conclude that nHd → 0 in probability. If you have a kernel rule with kernelS0,1 on Rd, and if the smoothing factor Hn is random but satisfies nHd

n → 0 inprobability, then

limn→∞

ELn = PY = 1

whenever X has a density. Show this.

Problem 25.14. Consider the variable kernel rule based upon the variable kerneldensity estimate of Breiman, Meisel, and Purcell (1977) and studied by Krzyzak(1983)

gn(x) =

1 if∑

i:Yi=1KHi(x−Xi) >

∑i:Yi=0

KHi(x−Xi)0 otherwise .

Here K is a positive-valued kernel, Ku = (1/ud)K(x/u) for u > 0, and Hi isthe distance between Xi and the k-th nearest neighbor of Xi among Xj , j 6= i,1 ≤ j ≤ n. Investigate the consistency of this rule when k/n → 0 and k → ∞and X1 has a density.

Problem 25.15. Continuation. Fix k = 1, and let K be the normal density inRd. If X1 has a density, what can you say about the asymptotic probability oferror of the variable kernel rule? Is the inequality of Cover and Hart still valid?Repeat the exercise for the uniform kernel on the unit ball of Rd.

Problem 25.16. If X1, . . . , Xn are discrete i.i.d. random variables and N de-notes the number of different values taken by X1, . . . , Xn, then

limn→∞

ENn

= 0,

and N/n → 0 with probability one. Hint: For the weak convergence, assumewithout loss of generality that the probabilities are monotone. For the strongconvergence, use McDiarmid’s inequality.


Problem 25.17. Let B be the class of all Borel subsets of Rd. Using the previousexercise, show that for any discrete distribution,

supA∈B

|µn(A)− µ(A)| → 0 with probability one.

Hint: Recall the necessary and sufficient condition E logNA(X1, . . . , Xn) /n→0 from Chapter 12.

Problem 25.18. Prove Theorem 25.8 allowing kernel functions taking finitelymany different values.

Problem 25.19. Let X1, . . . , Xn be an i.i.d. sample drawn from the distributionof X. Let XNN

j denote the nearest neighbor of Xj among X1, . . . , Xj−1, Xj+1, . . .,Xn. Define

Bn = maxj

∥∥Xj −XNNj

∥∥ .(1) Show that for all nonatomic distributions of X, nBdn → ∞ with proba-

bility one.(2) Is it true that for every X with a density, there exists a constant c > 0

such that with probability one, nBdn ≥ c logn for all n large enough?(3) Exhibit distributions on the real line for which Bn →∞ with probability

one.

Hint: Look at the difference between the first and second order statistics.

Problem 25.20. Prove Theorem 25.9. Hint: Use the example given in the text.Get an upper bound for the error probability corresponding to h = 1 by Hoeffd-ing’s inequality. The mean integrated squared error can be computed for every hin a straightforward way by observing that

E

∫(ηn,h(x)− η(x))

2µ(dx)

=

∫Varηn,h(x)+(Eηn,h(x) − η(x))

2µ(dx).

Split the integral between 0 and 1 in three parts, 0 to h, h to 1 − h, and 1 − hto 1. Setting the derivative of the obtained expression with respect to h equalto zero leads to a third-order equation in h, whose roots are O(n−1/4). To get alower bound for the corresponding error probability, use the crude bound

ELn(h) ≥ 1

2

∫ 1

1−hPgn(X) 6= Y |X = xdx.

Now, estimate the tail of a binomial distribution from below; and use Stirling’sformula to show that, modulo polynomial factors, the error probability is larger

than 2−n3/4

.


26Automatic Nearest Neighbor Rules

The error probability of the k-nearest neighbor rule converges to the Bayesrisk for all distributions when k → ∞, and k/n → 0 as n → ∞. Theconvergence result is extended here to include data-dependent choices ofk. We also look at the data-based selection of a metric and of weights inweighted nearest neighbor rules.

To keep the notation consistent with that of earlier chapters, random(data-based) values of k are denoted by Kn. In most instances, Kn ismerely a function of Dn, the data sequence (X1, Y1), . . . , (Xn, Yn). Thereader should not confuse Kn with the kernel K in other chapters.

26.1 Consistency

We start with a general theorem assessing strong consistency of the k-nearest neighbor rule with data-dependent choices of k. For the sake ofsimplicity, we assume the existence of the density of X. The general casecan be taken care of by introducing an appropriate tie-breaking method asin Chapter 11.

Theorem 26.1. Let K1,K2, . . . be integer valued random variables, andlet gn be the Kn-nearest neighbor rule. If X has a density, and

Kn →∞ and Kn/n→ 0 with probability one as n→∞,

then L(gn) → L∗ with probability one.


Proof. L(gn) → L∗ with probability one if and only if for every ε > 0,IL(gn)−L∗>ε → 0 with probability one. Clearly, for any β > 0,

IL(gn)−L∗>ε ≤ I1/Kn+Kn/n<2β , L(gn)−L∗>ε + I1/Kn+Kn/n≥2β.

We are done if both random variables on the right-hand side convergeto zero with probability one. The convergence of the second term for allβ > 0 follows trivially from the conditions of the theorem. The convergenceof the first term follows from the remark following Theorem 11.1, whichstates that for any ε > 0, there exist β > 0 and n0 such that for the errorprobability Ln,k of the k-nearest neighbor rule,

PLn,k − L∗ > ε ≤ 4e−cnε2

for some constant c depending on the dimension only, provided that n > n0,k > 1/β and k/n < β. Now clearly,

PL(gn)− L∗ > ε , 1/Kn +Kn/n < 2β ≤ P sup1/β≤k≤nβ

Ln,k − L∗ > ε

≤ n sup1/β≤k≤nβ

PLn,k − L∗ > ε,

by the union bound. Combining this with Theorem 11.1, we get

PL(gn)− L∗ > ε , 1/Kn +Kn/n < 2β ≤ 4ne−cnε2, n ≥ n0.

The Borel-Cantelli lemma implies that

I1/Kn+Kn/n<2β , L(gn)−L∗>ε → 0

with probability one, and the theorem is proved. 2

Sometimes we only know that Kn →∞ and Kn/n→ 0 in probability. Insuch cases weak consistency is guaranteed. The proof is left as an exercise(Problem 26.1).

Theorem 26.2. Let K1,K2, . . . be integer valued random variables, andlet gn be the Kn-nearest neighbor rule. If X has a density, and

Kn →∞ and Kn/n→ 0 in probability as n→∞,

then limn→∞EL(gn) = L∗, that is, gn is weakly consistent.

26.2 Data Splitting

Consistency by itself may be obtained by choosing k = b√nc, but few—

if any—users will want to blindly use such recipes. Instead, a healthy

26.3 Data Splitting for Weighted NN Rules 473

dose of feedback from the data is preferable. If we proceed as in Chap-ter 22, we may split the data sequence Dn = (X1, Y1), . . . , (Xn, Yn) intoa training sequence Dm = (X1, Y1), . . . , (Xm, Ym), and a testing sequenceTl = (Xm+1, Ym+1), . . . , (Xn, Yn), where m+ l = n. The training sequenceDm is used to design a class of classifiers Cm. The testing sequence is usedto select a classifier from Cm that minimizes the holdout estimate of theerror probability,

Lm,l(φm) =1l

l∑i=1

Iφm(Xm+i) 6=Ym+i.

If Cm contains all k-nearest neighbor rules with 1 ≤ k ≤ m, then |Cm| = m.Therefore, we have

PL(gn)− inf

φm∈Cm

L(φm) > ε

∣∣∣∣Dm

≤ 2me−lε

2/2,

where gn is the selected k-nearest neighbor rule. By combining Theorems11.1 and 22.1, we immediately deduce that gn is universally consistent if

limn→∞

m = ∞, limn→∞

l

logm= ∞.

It is strongly universally consistent if limn→∞ l/ log n = ∞ also. Note toothat

E∣∣∣∣L(gn)− inf

φm∈Cm

L(φm)∣∣∣∣ ≤ 2

√log(2m) + 1

2l

(by Problem 12.1), so that it is indeed important to pick l much larger thanlogm.

26.3 Data Splitting for Weighted NN Rules

Royall (1966) introduced the weighted nn rule in which the i-th nearestneighbor receives weight wi, where w1 ≥ w2 ≥ . . . ≥ wk ≥ 0 and thewi’s sum to one. We assume that wk+1 = · · · = wn = 0 if there are ndata points. Besides the natural appeal of attaching more weight to nearerneighbors, there is also a practical by-product: if the wi’s are all of the form1/zi, where zi is a prime integer, then no two subsums of wi’s are equal,and therefore voting ties are avoided altogether.

Consider now data splitting in which Cm consists of all weighted k-nnrules as described above—clearly, k ≤ m now. As |Cm| = ∞, we computethe shatter coefficients S(Cm, l). We claim that

S(Cm, l) ≤ ∑k

i=1

(li

)≤ lk if l ≥ k

2l if l < k.(26.1)

26.3 Data Splitting for Weighted NN Rules 474

This result is true even if we do not insist that w1 ≥ w2 ≥ . . . ≥ wk ≥ 0.

Proof of (26.1). Each Xj in the testing sequence is classified basedupon the sign of

∑ki=1 aijwi, where aij ∈ −1, 1 depends upon the class

of the i-th nearest neighbor of Xj among X1, . . . , Xm (and does not dependupon the wi’s). Consider the l-vector of signs of

∑ki=1 aijwi, m < j ≤ n.

In the computation of S(Cm, l), we consider the aij ’s as fixed numbers, andvary the wi’s subject to the condition laid out above. Here is the crucialstep in the argument: the collection of all vectors (w1, . . . , wk) for whichXj is assigned to class 1 is a linear halfspace of Rk. Therefore, S(Cm, l) isbounded from above by the number of cells in the partition of Rk definedby l linear halfspaces. This is bounded by

∑ki=1

(li

)(see Problem 22.1) if

k ≤ l. 2

Let gn be the rule in |Cm| that minimizes the empirical error committedon the test sequence (Xm+1, Ym+1), . . . , (Xn, Yn). Then by (22.1), if l =n−m ≥ k, we have,

PL(gn)− inf

φm∈Cm

L(φm) > ε

∣∣∣∣Dm

≤ 4e8l2ke−lε

2/2.

If k → ∞ (which implies m → ∞), we have universal consistency whenk log(l)/l → 0. The estimation error is of the order of

√k log l/l—in the

terminology of Chapter 22, the rule is√k log l/l-optimal. This error must

be weighed against the unknown approximation error. Let us present aquick heuristic argument. On the real line, there is compelling evidence tosuggest that when X has a smooth density, k = cm4/5 is nearly optimal.With this choice, if both l and m grow linearly in n, the estimation error isof the order of

√log n/n1/10. This is painfully large—to reduce this error

by a factor of two, sample sizes must rise by a factor of about 1000. Thereason for this disappointing result is that Cm is just too rich for the valuesof k that interest us. Automatic selection may lead to rules that overfit thedata.

If we restrict Cm by making m very small, the following rough argumentmay be used to glean information about the size of m and l = n−m. Wewill take k = m n. For smooth regression function η, the estimationerror may be anywhere between m−2/5 and m−4/5 on the real line. Asthe estimation error is of the order of

√m log n/n, equating the errors

leads to the rough recipe that m ≈ (n/ log n)5/9, and m ≈ (n/ log n)5/13,respectively. Both errors are then about (n/ log n)−2/9 and (n/ log n)−4/9,respectively. This is better than with the previous example with m linearin n. Unfortunately, it is difficult to test whether the conditions on η thatguarantee certain errors are satisfied. The above procedure is thus doomedto remain heuristic.

26.5 Variable Metric NN Rules 475

26.4 Reference Data and Data Splitting

Split the data intoDm and Tl as is done in the previous section. Let Cm con-tain all 1-nn rules that are based upon the data (x1, y1), . . . , (xk, yk), wherek ≤ m is to be picked, x1, . . . , xk ⊂ X1, . . . Xm, and y1, . . . , yk ∈0, 1k. Note that because the yi’s are free parameters, (x1, y1), . . . , (xk, yk)is not necessarily a subset of (X1, Y1), . . . , (Xm, Ym)—this allows us toflip certain y-values at some data points. Trivially, |Cm| =

(mk

)2k. Hence,

PL(gn)− inf

φm∈Cm

L(φm) > ε

∣∣∣∣Dm

≤ 2k+1

(m

k

)e−lε

2/2,

where l = n −m, and gn ∈ Cm minimizes the empirical error on the testsequence Tl. The best rule in Cm is universally consistent when k → ∞(see Theorem 19.4). Therefore, gn is universally consistent when the abovebound converges to zero. Sufficient conditions are

(i) limn→∞

l = ∞;

(ii) limn→∞

k logml

= 0.

As the estimation error is O(√

k logm/l), it is important to make l large,

while keeping k small.The selected sequence (x1, y1), . . . , (xk, yk) may be called reference data,

as it captures the information in the larger data set. If k is sufficiently small,the computation of gn(x) is extremely fast. The idea of selecting referencedata or throwing out useless or “bad” data points has been proposed andstudied by many researchers under names such as condensed nn rules,edited nn rules, and selective nn rules. See Hart (1968), Gates (1972),Wilson (1972), Wagner (1973), Ullmann (1974), Ritter et al. (1975), Tomek(1976b), and Devijver and Kittler (1980). See also Section 19.1.

26.5 Variable Metric NN Rules

The data may also be used to select a suitable metric for use with the k-nnrule. The metric adapts itself for certain scale information gleaned fromthe data. For example, we may compute the distance between x1 and x2

by the formula

‖AT (x1 − x2)‖ =((x1 − x2)TAAT (x1 − x2)

)1/2=

((x1 − x2)TΣ(x1 − x2)

)1/2,

where (x1− x2) is a column vector, (·)T denotes its transpose, A is a d× dtransformation matrix, and Σ = AAT is a positive definite matrix. Theelements of A or Σ may be determined from the data according to some


heuristic formulas. We refer to Fukunaga and Hostetler (1973), Short andFukunaga (1981), Fukunaga and Flick (1984), and Myles and Hand (1990)for more information.

For example, the object of principal component analysis is to find a trans-formation matrix A such that the components of the vector ATX have unitvariance and are uncorrelated. These methods are typically based on esti-mating the eigenvalues of the covariance matrix of X. For such situations,we prove the following general consistency result:

Theorem 26.3. Let the random metric ρn be of the form

ρn(x, y) = ‖ATn (x− y)‖,

where the matrix An is a function of X1, . . . , Xn. Assume that distanceties occur with zero probability, and there are two sequences of nonnegativerandom variables mn and Mn such that for any n and x, y ∈ Rd,

mn‖x− y‖ ≤ ρn(x, y) ≤Mn‖x− y‖

and

P

lim infn→∞

mn

Mn= 0

= 0.

If

limn→∞

kn = ∞ and limn→∞

knn

= 0,

then the kn-nearest neighbor rule based on the metric ρn is consistent.

Proof. We verify the three conditions of Theorem 6.3. In this case,Wni(X) =1/kn if Xi is one of the kn nearest neighbors of X (according to ρn), andzero otherwise. Condition (iii) holds trivially. Just as in the proof of con-sistency of the ordinary kn-nearest neighbor rule, for condition (i) we needthe property that the number of data points that can be among the knnearest neighbors of a particular point is at most knγd, where the constantγd depends on the dimension only. This is a deterministic property, and itcan be proven exactly the same way as for the standard nearest neighborrule. The only condition of Theorem 6.3 whose justification needs extrawork is condition (ii): we need to show that for any a > 0,

limn→∞

E

n∑i=1

Wni(X)I‖X−Xi‖>a

= 0.

Denote the k-th nearest neighbor of X according to ρn by X(k), and thek-th nearest neighbor of X according to the Euclidean metric by X(k).


Then

E

n∑i=1

Wni(X)I‖X−Xi‖>a

=1kn

E

n∑i=1

Iρn(X,Xi)≤ρn(X,X(kn)),‖X−Xi‖>a

≤ 1kn

E

n∑i=1

Iρn(X,Xi)≤ρn(X,X(kn)) , ρn(X,Xi)>mna

(since mn‖x− y‖ ≤ ρn(x, y))

=1kn

E

n∑i=1

Imna≤ρn(X,Xi)≤ρn(X,X(kn))

=1kn

E

kn∑j=1

Imna≤ρn(X,X(j))

≤ P

mna ≤ ρn(X, X(kn))

= P

n∑i=1

Iρn(X,Xi)≤mna < kn

≤ P

n∑i=1

IMn‖X−Xi‖≤mna < kn

(since Mn‖x− y‖ ≥ ρn(x, y))

= P‖X −X(kn)‖ > a

mn

Mn

.

But we know from the consistency proof of the ordinary kn-nearest neighborrule in Chapter 11 that for each a > 0,

limn→∞

P‖X −X(kn)‖ > a

= 0.

It follows from the condition on mn/Mn that the probability above con-verges to zero as well. 2

The conditions of the theorem hold if, for example, the elements of thematrix An converge to the elements of an invertible matrix A in probability.In that case, we may take mn as the smallest, and Mn as the largesteigenvalues of ATnAn. Then mn/Mn converges to the ratio of the smallestand largest eigenvalues of ATA, a positive number.

If we pick the elements of A by minimizing the empirical error of a testsequence Tl over Cm, where Cm contains all k-nn rules based upon a training

26.6 Selection of k Based on the Deleted Estimate 478

sequence Dm (thus, the elements of A are the free parameters), the valueof S(Cm, l) is too large to be useful—see Problem 26.3. Furthermore, suchminimization is not computationally feasible.

26.6 Selection of k Based on the Deleted Estimate

If you wish to use all the available data in the training sequence, withoutsplitting, then empirical selection based on minimization of other estimatesof the error probability may be your solution. Unfortunately, performanceguarantees for the selected rule are rarely available. If the class of rules isfinite, as when k is selected from 1, . . . , n, there are useful inequalities.We will show you how this works.

Let Cn be the class of all k-nearest neighbor rules based on a fixed trainingsequence Dn, but with k variable. Clearly, |Cn| = n. Assume that thedeleted estimate is used to pick a classifier gn from Cn:

L(D)n (gn) ≤ L(D)

n (φn) for all φn ∈ Cn.

We can derive performance bounds for gn from Theorem 24.5. Since theresult gives poor bounds for large values of k, the range of k’s has to berestricted—see the discussion following Theorem 24.5. Let k0 denote thevalue of the largest k allowed, that is, Cn now contains all k-nearest neighborrules with k ranging from 1 to k0.

Theorem 26.4. Let gn be the classifier minimizing the deleted estimateof the error probability over Cn, the class of k-nearest neighbor rules with1 ≤ k ≤ k0. Then

PL(gn)− inf

φn∈Cn

L(φn) > ε

≤ k0e

−cnε3/k0 ,

where c is a constant depending on the dimension only. If k0 → ∞ andk0 log n/n→ 0, then gn is strongly universally consistent.

Proof. From Theorem 8.4 we recall

L(gn)− infφn∈Cn

L(φn) ≤ 2 supφn∈Cn

∣∣∣L(φn)− L(D)n (φn)

∣∣∣ .The inequality now follows from Theorem 24.5 via the union bound. Uni-versal consistency follows from the previous inequality, and the fact thatthe k0-nn rule is strongly universally consistent (see Theorem 11.1). 2



Problem 26.1. Let K1,K2, . . . be integer valued random variables, and let gnbe the Kn-nearest neighbor rule. Show that if X has a density, and Kn →∞ andKn/n→ 0 in probability as n→∞, then EL(gn) → L∗.

Problem 26.2. Let C be the class of all 1-nn rules based upon pairs (x1, y1), . . .,(xk, yk), where k is a fixed parameter (possibly varying with n), and the (xi, yi)’sare variable pairs fromRd×0, 1. Let gn be the rule that minimizes the empiricalerror over C, or, equivalently, let gn be the rule that minimizes the resubstitutionestimate L

(R)n over C.

(1) Compute a suitable upper bound for S(C, n).(2) Compute a good upper bound for VC as a function of k and d.(3) If k →∞ with n, show that the sequence of classes C contains a strongly

universally consistent subsequence of rules (you may assume for conve-nience that X has a density to avoid distance ties).

(4) Under what condition on k can you guarantee strong universal consis-tency of gn?

(5) Give an upper bound for

E

L(gn)− inf

φ∈CL(φ)

.

Problem 26.3. Let Cm contain all k-nn rules based upon data pairs (X1, Y1), . . .,(Xm, Ym). The metric used in computing the neighbors is derived from the norm

‖x‖Σ =

d∑i=1

d∑j=1

σijx(i)x(j), x = (x(1), . . . , x(d)),

where σij forms a positive definite matrix Σ. The elements of Σ are the freeparameters in Cm. Compute upper and lower bounds for S(Cm, l) as a functionof m, l, k, and d.

Problem 26.4. Let gn be the rule obtained by minimizing L(D)n over all k-nn

rules with 1 ≤ k ≤ n. Prove or disprove: gn is strongly universally consistent.Note that in view of Theorem 26.4, it suffices to consider εn/ logn ≤ k ≤ n − 1for all ε > 0.


27Hypercubes and Discrete Spaces

In many situations, the pair (X,Y ) is purely binary, taking values in0, 1d × 0, 1. Examples include boolean settings (each component of Xrepresents “on” or “off”), representations of continuous variables throughquantization (continuous variables are always represented by bit strings incomputers), and ordinal data (a component of X is 1 if and only if a certainitem is present). The components of X are denoted by X(1), . . . , X(d). Inthis chapter, we review pattern recognition briefly in this setup.

Without any particular structure in the distribution of (X,Y ) or thefunction η(x) = PY = 1|X = x, x ∈ 0, 1d, the pattern recognitionproblem might as well be cast in the space of the first 2d positive integers:(X,Y ) ∈

1, . . . , 2d

×0, 1. This is dealt with in the first section. However,

things become more interesting under certain structural assumptions, suchas the assumption that the components of X be independent. This is dealtwith in the third section. General discrimination rules on hypercubes aretreated in the rest of the chapter.

27.1 Multinomial Discrimination

At first sight, discrimination on a finite set 1, . . . , k—called multinomialdiscrimination—may seem utterly trivial. Let us call the following rule thefundamental rule, as it captures what most of us would do in the absence

27.1 Multinomial Discrimination 481

of any additional information.

g∗n(x) =

1 if∑ni=1 IXi=x,Yi=1 >

∑ni=1 IXi=x,Yi=0,

0 otherwise.

The fundamental rule coincides with the standard kernel and histogramrules if the smoothing factor or bin width are taken small enough. If pi =PX = i and η is as usual, then it takes just a second to see that

Ln ≥∑

x:∑n

i=1IXi=x=0

η(x)px

andELn ≥

∑x

η(x)px(1− px)n,

where Ln = L(g∗n). Assume η(x) = 1 at all x. Then L∗ = 0 and

ELn ≥∑x

px(1− px)n.

If px = 1/k for all x, we have ELn ≥ (1−1/k)n ≥ 1/2 if k ≥ 2n. This simplecalculation shows that we cannot say anything useful about fundamentalrules unless k ≤ 2n at the very least. On the positive side, the followinguniversal bound is useful.

Theorem 27.1. For the fundamental rule, we have Ln → L∗ with proba-bility one as n→∞, and, in fact, for all distributions,

ELn ≤ L∗ +

√k

2(n+ 1)+

k

en

and

ELn ≤ L∗ + 1.075

√k

n.

Proof. The first statement follows trivially from the strong universal con-sistency of histogram rules (see Theorem 9.4). It is the universal inequalitythat is of interest here. If B(n, p) denotes a binomial random variable withparameters n and p, then we have

ELn =k∑x=1

px

(η(x) + (1− 2η(x))PB(N(x), η(x)) > N(x)/2

)(here N(x) is a binomial (n, p(x)) random variable)

=k∑x=1

px

(1− η(x) + (2η(x)− 1)PB(N(x), η(x)) ≤ N(x)/2

).

27.1 Multinomial Discrimination 482

From this, if ξ(x) = min(η(x), 1− η(x)),

ELn

≤k∑x=1

px

(ξ(x) + (1− 2ξ(x))PB(N(x), ξ(x)) ≥ N(x)/2

)

≤ L∗ +k∑x=1

px(1− 2ξ(x))E

N(x)ξ(x)(1− ξ(x))

N(x)ξ(x)(1− 2ξ(x)) +(

12 − ξ(x)

)2N(x)2

(by the Chebyshev-Cantelli inequality—Theorem A.17)

≤ L∗ +k∑x=1

px(1− 2ξ(x))E

11 + (1− 2ξ(x))2N(x)

(since ξ(x)(1− ξ(x)) ≤ 1/4)

≤ L∗ +k∑x=1

pxE

1

2√N(x)

IN(x)>0 + (1− 2ξ(x))IN(x)=0

(since the function u/(1 +Nu2) ≤ 1/(2

√N) for 0 ≤ u ≤ 1)

≤ L∗ +k∑x=1

px(1− px)n +12

k∑x=1

px

√E

1N(x)

IN(x)=0


≤ L∗ +(

1− 1k

)n+

12

k∑x=1

px

√2

(n+ 1)px

(by Lemma A.2 and the fact that the worst distribution

has px = 1/k for all x)

≤ L∗ + e−n/k +1√

2(n+ 1)

k∑x=1

√px

≤ L∗ +k

en+

√k

2(n+ 1)(since e−u ≤ 1/(eu) for u ≥ 0,

and by the Cauchy-Schwarz inequality).

This concludes the proof, as√

1/2 + 1/e ≤ 1.075. 2

For several key properties of the fundamental rule, the reader is referredto Glick (1973) and to Problem 27.1. Other references include Krzanowski(1987), Goldstein (1977), and Goldstein and Dillon (1978). Note also thefollowing extension of Theorem 27.1, which shows that the fundamentalrule can handle all discrete distributions.

27.2 Quantization 483

Theorem 27.2. If X is purely atomic (with possibly infinitely many atoms),then Ln → L∗ with probability one for the fundamental rule.

Proof. Number the atoms 1, 2, . . .. Define X ′ = min(X, k) and replace(Xi, Yi) by (X ′

i, Yi), where X ′i = min(Xi, k). Apply the fundamental rule

to the new problem and note that by Theorem 27.1, if k is fixed, Ln → L∗

with probability one for the new rule (and new distribution). However, thedifference in Ln’s and in L∗’s between the truncated and nontruncatedversions cannot be more than PX ≥ k, which may be made as small asdesired by choice of k. 2

27.2 Quantization

Consider a fixed partition of space Rd into k sets A1, . . . , Ak, and let gnbe the standard partitioning rule based upon majority votes in the Ai’s(ties are broken by favoring the response “0,” as elsewhere in the book).We consider two rules:

(1) The rule gn considered above; the data are (X1, Y1), . . . , (Xn, Yn),with (X,Y ) ∈ Rd×0, 1. The probability of error is denoted by Ln,and the Bayes probability of error is L∗ = Emin(η(X), 1 − η(X))with η(x) = PY = 1|X = x, x ∈ Rd.

(2) The fundamental rule g′n operating on the quantized data (X ′1, Y1), . . .,

(X ′n, Yn), with (X ′, Y ) ∈ 1, . . . , k × 0, 1, X ′

i = j if X ∈ Aj . TheBayes probability of error is L′∗ = Emin(η′(X ′), 1 − η′(X ′)) withη′(x′) = PY = 1|X ′ = x′, x′ ∈ 1, . . . , k. The probability of erroris denoted by L′n.

Clearly, g′n is nothing but the fundamental rule for the quantized data. Asgn(x) = g′n(x

′), where x′ = j if x ∈ Aj , we see that

Ln = PY 6= gn(X)|X1, Y1, . . . , Xn, Yn= PY 6= g′n(X

′)|X ′1, Y1, . . . , X

′n, Yn

= L′n.

However, the Bayes error probabilities are different in the two situations.We claim, however, the following:

Theorem 27.3. For the standard partitioning rule,

ELn ≤ L∗ + 1.075

√k

n+ δ,

27.2 Quantization 484

where δ = E |η(X)− η′(X ′)|. Furthermore,

L∗ ≤ L′∗ ≤ L∗ + δ.

Proof. Clearly, L∗ ≤ L′∗ (see Problem 2.1). Also,

L′∗ = Emin(η′(X ′), 1− η′(X ′))≤ Emin(η(X) + |η(X)− η′(X ′)| , 1− η(X) + |η(X)− η′(X ′)|)≤ δ + Emin(η(X), 1− η(X))= δ + L∗.

Furthermore,

ELn ≤ L′∗ + 1.075

√k

n≤ L∗ + δ + 1.075

√k

n

by Theorem 27.1. 2

As an immediate corollary, we show how to use the last bound to getuseful distribution-free performance bounds.

Theorem 27.4. Let F be a class of distributions of (X,Y ) for which X ∈[0, 1]d with probability one, and for some constants c > 0, 1 ≥ α > 0,

|η(x)− η(z)| ≤ c‖x− z‖α, x, z ∈ Rd.

Then, if we consider all cubic histogram rules gn (see Chapter 6 for adefinition), we have

infcubic histogram rule gn

EL(gn)− L∗ ≤ a√n

+b

nα

d+2α,

where a and b are constants depending upon c, α, and d only (see the prooffor explicit expressions).

Theorem 27.4 establishes the existence of rules that perform uniformlyat rate O

(n−α/(d+2α)

)over F . Results like this have an impact on the

number of data points required to guarantee a given performance for any(X,Y ) ∈ F .

Proof. Consider a cubic grid with cells of volume hd. As the number ofcells covering [0, 1]d does not exceed (1/h+ 2)d, we apply Theorem 27.3 toobtain

EL(gn) ≤ L∗ + 1.075(

2 +1h

)d/2+ δ,

27.3 Independent Components 485

where

δ ≤ supx|η(x)− η′(x′)| ≤ c sup

Ai

supx,z∈Ai

‖z − x‖α ≤ c(h√d)α

.

The right-hand side is approximately maximal when

h =

1.075d

2α(c√d)α√

n

2d+2α

def=c′

n1

d+2α

.

Resubstitution yields the result. 2

27.3 Independent Components

Let X = (X(1), . . . , X(d)) have components that are conditionally indepen-dent, given Y = 1, and also, given Y = 0. Introduce the notation

p(i) = PX(i) = 1|Y = 1,q(i) = PX(i) = 1|Y = 0,p = PY = 1.

With x = (x(1), . . . , x(d)) ∈ 0, 1d, we see that

η(x) =PY = 1, X = x

PX = x=

pPX = x|Y = 1pPX = x|Y = 1+ (1− p)PX = x|Y = 0

,

and

PX = x|Y = 1 =d∏i=1

p(i)x(i)

(1− p(i))1−x(i),

PX = x|Y = 0 =d∏i=1

q(i)x(i)

(1− q(i))1−x(i).

Simple consideration shows that the Bayes rule is given by

g∗(x) =

1 if p

∏di=1 p(i)

x(i)(1− p(i))1−x

(i)

> (1− p)∏di=1 q(i)

x(i)(1− q(i))1−x

(i)

0 otherwise.

Taking logarithms, it is easy to see that this is equivalent to the followingrule:

g∗(x) =

1 if α0 +∑di=1 αix

(i) > 00 otherwise,


where

α0 = log(

p

1− p

)+

d∑i=1

log(

1− p(i)1− q(i)

),

αi = log(p(i)q(i)

· 1− q(i)1− p(i)

)i = 1, . . . , d.

In other words, the Bayes classifier is linear! This beautiful fact was notedby Minsky (1961), Winder (1963), and Chow (1965).

Having identified the Bayes rule as a linear discrimination rule, we mayapply the full force of the Vapnik-Chervonenkis theory. Let gn be the rulethat minimizes the empirical error over the class of all linear discriminationrules. As the class of linear halfspaces of Rd has vc dimension d + 1 (seeCorollary 13.1), we recall from Theorem 13.11 that for gn,

PL(gn)− L∗ > ε ≤ 8nd+1e−nε2/128,

EL(gn)− L∗ ≤ 16

√(d+ 1) log n+ 4

2n(by Corollary 12.1),

PL(gn)− L∗ > ε ≤ 16(√nε)4096(d+1)

e−nε2/2, nε2 ≥ 64,

andEL(gn)− L∗ ≤ c√

n,

where

c = 16 +√

1013(d+ 1) log(1012(d+ 1)) (Problem 12.10).

These bounds are useful (i.e., they tend to zero with n) if d = o(n/ log n). Incontrast, without the independence assumption, we have pointed out thatno nontrivial guarantee can be given about EL(gn) unless 2d < n. Theindependence assumption has led us out of the high-dimensional quagmire.

One may wish to attempt to estimate the p(i)’s and q(i)’s by p(i) andq(i) and to use these in the plug-in rule gn(x) that decides 1 if

pd∏i=1

p(i)x(i)

(1− p(i))1−x(i)> (1− p)

d∏i=1

q(i)x(i)

(1− q(i))1−x(i),

where p is the standard sample-based estimate of p. The maximum-likelihoodestimate of p(i) is given by

p(i) =

∑nj=1 IX(i)

j=1,Yj=1∑n

j=1 IYj=1,

while

q(i) =

∑nj=1 IX(i)

j=1,Yj=0∑n

j=1 IYj=0,


with 0/0 equal to 0 (see Problem 27.6). Note that the plug-in rule too islinear, with

gn(x) =

1 if α0 +∑di=1 αix

(i) > 00 otherwise,

where

α0 = log(

p

1− p

)+

d∑i=1

log(

N01(i)N01(i) +N11(i)

· N00(i) +N10(i)N00(i)

),

αi = log(N11(i)N01(i)

· N00(i)N10(i)

)i = 1, . . . , d,

and

N00(i) =n∑j=1

IX(i)j

=0,Yj=0, N01(i) =n∑j=1

IX(i)j

=0,Yj=1,

N10(i) =n∑j=1

IX(i)j

=1,Yj=0, N11(i) =n∑j=1

IX(i)j

=1,Yj=1,

p =d∑i=1

(N01(i) +N11(i)) .

For all this, we refer to Warner, Toronto, Veasey, and Stephenson (1961)(see also McLachlan (1992)).

Use the inequality

EL(gn)− L∗ ≤ 2E |ηn(X)− η(X)| ,

where η is as given in the text, and ηn is as η but with p, p(i), q(i) replacedby p, p(i), q(i), to establish consistency:

Theorem 27.5. For the plug-in rule, L(gn) → L∗ with probability one andEL(gn) − L∗ → 0, whenever the components are independent.

We refer to the Problem section for an evaluation of an upper bound forEL(gn)− L∗ (see Problem 27.7).

Linear discrimination on the hypercube has of course limited value. Theworld is full of examples that are not linearly separable. For example, on0, 12, if Y = X(1)(1−X(2)) +X(2)(1−X(1)) (so that Y implements theboolean “xor” or “exclusive or” function), the problem is not linearly sep-arable if all four possible values of X have positive probability. However,the exclusive or function may be dealt with very nicely if one considersquadratic discriminants dealt with in a later section (27.5) on series meth-ods.

27.4 Boolean Classifiers 488

27.4 Boolean Classifiers

By a boolean classification problem, we mean a pattern recognition problemon the hypercube for which L∗ = 0. This setup relates to the fact that ifwe consider Y = 1 as a circuit failure and Y = 0 as an operable circuit,then Y is a deterministic function of the X(i)’s, which may be considered asgates or switches. In that case, Y may be written as a boolean function ofX(1), . . . , X(d). We may limit boolean classifiers in various ways by partiallyspecifying this function. For example, following Natarajan (1991), we firstconsider all monomials, that is, all functions g : 0, 1d → 0, 1 of theform

x(i1)x(i2) · · ·x(ik)

for some k ≤ d and some indices 1 ≤ i1 < · · · < ik ≤ d (note that algebraicmultiplication corresponds to a boolean “and”). In such situations, onemight attempt to minimize the empirical error. As we know that g∗ is alsoa monomial, it is clear that the minimal empirical error is zero. One suchminimizing monomial is given by

gn(x(1), . . . , x(d)) = x(i1) · · ·x(ik),

where

min1≤i≤n

X(i1)i = · · · = min

1≤i≤nX

(ik)i = 1, and min

1≤i≤nX

(j)i = 0, j /∈ i1, . . . , ik.

Thus, gn picks those components for which every data point has a “1”.Clearly, the empirical error is zero. The number of possible functions is 2d.Therefore, by Theorem 12.1,

PL(gn) > ε ≤ 2de−nε,

andEL(gn) ≤

d+ 1n

.

Here again, we have avoided the curse of dimensionality. For good perfor-mance, it suffices that n be a bit larger than d, regardless of the distributionof the data!

Assume that we limit the complexity of a boolean classifier g by requir-ing that g must be written as an expression having at most k operations“not,” “and,” or “or,” with the x(i)’s as inputs. To avoid problems withprecedence rules, we assume that any number of parentheses is allowed inthe expression. One may visualize each expression as an expression tree,that is, a tree in which internal nodes represent operations and leaves rep-resent operands (inputs). The number of such binary trees with k internalnodes (and thus k + 1 leaves) is given by the Catalan number

1k + 1

(2kk

)

27.5 Series Methods for the Hypercube 489

(see, e.g., Kemp (1984)). Furthermore, we may associate any of the x(i)’swith the leaves (possibly preceded by “not”) and “and” or “or” with eachbinary internal node, thus obtaining a total of not more than

2k(2d)k+1 1k + 1

(2kk

)possible boolean functions of this kind. As k →∞, this bound is not morethan

(4d)k+1 4k√πk3

≤ (16d)k

for k large enough. Again, for k large enough,

PL(gn) > ε ≤ (16d)ke−nε,

and

EL(gn) ≤k log(16d) + 1

n.

Note that k is much more important than d in determining the samplesize. For historic reasons, we mention that if g is any boolean expressionconsisting of at most k “not,” “and,” or “or” operations, then the numberof such functions was shown by Pippenger (1977) not to exceed(

16(d+ k)2

k

)k.

Pearl (1979) used Pippenger’s estimate to obtain performance bounds suchas the ones given above.

27.5 Series Methods for the Hypercube

It is interesting to note that we may write any function η on the hypercubeas a linear combination of Rademacher-Walsh polynomials

ψi(x) =

1 i = 02x(1) − 1 i = 1

......

2x(d) − 1 i = d(2x(1) − 1)(2x(2) − 1) i = d+ 1

......

(2x(d−1) − 1)(2x(d) − 1) i = d+ 1 +(d2

)(2x(1) − 1)(2x(2) − 1)(2x(3) − 1) i = d+ 2 +

(d2

)...

...(2x(1) − 1) · · · (2x(d) − 1) i = 2d − 1.


We verify easily that∑x

ψi(x)ψj(x) =

2d if i = j0 if i 6= j,

so that the ψi’s form an orthogonal system. Therefore, we may write

µ(x)η(x) =2d−1∑i=0

aiψi(x),

where

ai =12d∑x

ψi(x)η(x)µ(x) = Eψi(X)

2dIY=1

,

and µ(x) = PX = x. Also,

µ(x)(1− η(x)) =2d−1∑i=0

biψi(x),

with

bi =12d∑x

ψi(x)(1− η(x))µ(x) = Eψi(X)

2dIY=0

.

Sample-based estimates of ai and bi are

ai =1n

n∑j=1

ψi(Xj)2d

IYj=1,

bi =1n

n∑j=1

ψi(Xj)2d

IYj=0.

The Bayes rule is given by

g∗(x) =

1 if

∑2d−1i=0 (ai − bi)ψi(x) > 0

0 otherwise.

Replacing ai − bi formally by ai − bi yields the plug-in rule. Observe thatthis is just a discrete version of the Fourier series rules discussed in Chapter17. This rule requires the estimation of 2d differences ai− bi. Therefore, wemight as well have used the fundamental rule.

When our hand is forced by the dimension, we may wish to consider onlyrules in the class C given by

g(x) =

1 if

∑(d0)+(d

1)+···+(dk)

i=0 (a′i − b′i)ψi(x) > 00 otherwise,


where k ≤ d is a positive integer, and a′0, b′0, a

′1, b

′1, . . . are arbitrary con-

stants. We have seen that the vc dimension of this class is not more than(d0

)+(d1

)+ · · · +

(dk

)≤ dk + 1 (Theorem 13.9). Within this class, estima-

tion errors of the order of O(√

dk log n/n)

are thus possible if we mini-mize the empirical error (Theorem 13.12). This, in effect, forces us to takek logd n. For larger k, pattern recognition is all but impossible.

As an interesting side note, observe that for a given parameter k, eachmember of C is a k-th degree polynomial in x.

Remark. Performance of the plug-in rule. Define the plug-in ruleby

gn(x) =

1 if

∑(d0)+(d

1)+···+(dk)

i=0 (ai − bi)ψi(x) > 00 otherwise.

As n→∞, by the law of large numbers, gn approaches g∞:

g∞(x) =

1 if

∑(d0)+(d

1)+···+(dk)

i=0 (ai − bi)ψi(x) > 00 otherwise,

where ai, bi are as defined above. Interestingly, g∞ may be much worse thanthe best rule in C! Consider the simple example for d = 2, k = 1:

Table of η(x):x(1) = 0 x(1) = 1

x(2) = 1 1/8 6/8x(2) = 0 5/8 1/8

Table of µ(x) = PX = x:x(1) = 0 x(1) = 1

x(2) = 1 p 5−10p9

x(2) = 0 4−8p9 p

A simple calculation shows that for any choice p ∈ [0, 1/2], either g∞ ≡ 1or g∞ ≡ 0. We have g∞ ≡ 1 if p < 10/47, and g∞ ≡ 0 otherwise. However,in obvious notation, L(g ≡ 1) = (41p+ 11)/36, L(g ≡ 0) = (50− 82p)/72.The best constant rule is g ≡ 1 when p < 7/41 and g ≡ 0 otherwise. For7/41 < p < 10/47, the plug-in rule does not even pick the best constantrule, let alone the best rule in C with k = 1, which it was intended to pick.

This example highlights the danger of parametric rules or plug-in ruleswhen applied to incorrect or incomplete models. 2

Remark. Historical notes. The Rademacher-Walsh expansion occursfrequently in switching theory, and was given in Duda and Hart (1973).The Bahadur-Lazarsfeld expansion (Bahadur (1961)) is similar in nature.Ito (1969) presents error bounds for discrimination based upon a k-termtruncation of the series. Ott and Kronmal (1976) provide further statistical

27.6 Maximum Likelihood 492

properties. The rules described here with k defining the number of interac-tions are also obtained if we model PX = x|Y = 1 and PX = x|Y = 1by functions of the form

exp

(d0)+(d

1)+···+(dk)∑

i=0

aiψi(x)

and exp

(d0)+(d

1)+···+(dk)∑

i=0

biψi(x)

.

The latter model is called the log-linear model (see McLachlan (1992, sec-tion 7.3)). 2

27.6 Maximum Likelihood

The maximum likelihood method (see Chapter 15) should not be usedfor picking the best rule from a class C that is not guaranteed to includethe Bayes rule. Perhaps a simple example will suffice to make the point.Consider the following hypercube setting with d = 2:

η(x) :x(1) = 0 x(1) = 1

x(2) = 1 0 1x(2) = 0 1 0

µ(x) :x(1) = 0 x(1) = 1

x(2) = 1 p q

x(2) = 0 r s

In this example, L∗ = 0. Apply maximum likelihood to a class F with twomembers ηA, ηB, where

ηA(x) =

4/5 if x(1) = 01/5 if x(1) = 1,

ηB(x) =

0 if x(1) = 01 if x(1) = 1.

Then maximum likelihood won’t even pick the best member from F ! Toverify this, with gA(x) = IηA(x)>1/2, gB(x) = IηB(x)>1/2, we see thatL(gA) = p+ q, L(gB) = r + s. However, if ηML is given by

ηML = arg maxη′∈ηA,ηB

( ∏i:Yi=1

η′(Xi) ·∏i:Yi=0

(1− η′(Xi))

),

and if we write Nij for the number of data pairs (Xk, Yk) having Xk = i,Yk = j, then ηML = ηA if(

45

)N01+N10(

15

)N11+N00

>

0 if N01 +N10 > 01 if N01 +N10 = 0

27.7 Kernel Methods 493

and ηML = ηB otherwise. Equivalently, ηML = ηA if and only if N01 +N10 > 0. Apply the strong law of large numbers to note that N01/n → r,N10/n → s, N00/n → p, and N11/n → q with probability one, as n → ∞.Thus,

limn→∞

PηML = ηA =

1 if r + s > 00 otherwise.

Take r+ s = ε very small, p+ q = 1− ε. Then, for the maximum likelihoodrule,

ELn → L(gA) = 1− ε > ε = min(L(gA), L(gB)).

However, when F contains the Bayes rule, maximum likelihood is consistent(see Theorem 15.1).

27.7 Kernel Methods

Sometimes, d is so large with respect to n that the atoms in the hypercubeare sparsely populated. Some amount of smoothing may help under somecircumstances. Consider a kernel K, and define the kernel rule

gn(x) =

1 if 1

n

∑ni=1(2Yi − 1)K

(‖x−Xi‖

h

)> 0

0 otherwise,

where ‖x− z‖ is just the Hamming distance (i.e., the number of disagree-ments between components of x and z). With K(u) = e−u

2, the rule above

reduces to a rule given in Aitchison and Aitken (1976). In their paper,different h’s are considered for the two classes, but we won’t consider thatdistinction here. Observe that at h = 0, we obtain the fundamental rule. Ash→∞, we obtain a (degenerate) majority rule over the entire sample. Theweight given to an observation Xi decreases exponentially in ‖x−Xi‖2. Forconsistency, we merely need h → 0 (the condition nhd → ∞ of Theorem10.1 is no longer needed). And in fact, we even have consistency with h ≡ 0as this yields the fundamental rule.

The data-based choice of h has been the object of several papers, includ-ing Hall (1981) and Hall and Wand (1988). In the latter paper, a meansquared error criterion is minimized. We only mention the work of Tutz(1986; 1988; 1989), who picks h so as to minimize the deleted estimateL

(D)n .

Theorem 27.6. (Tutz (1986)). Let Hn be the smoothing factor in theAitchison-Aitken rule that minimizes L(D)

n . Then the rule is weakly consis-tent for all distributions of (X,Y ) on the hypercube.

Proof. See Theorem 25.8. 2



Problem 27.1. The fundamental rule. Let g∗n be the fundamental rule on afinite set 1, . . . , k, and define Ln = L(g∗n). Let g∗ be the Bayes rule (with errorprobability L∗), and let

∆ = infx:η(x) 6=1/2

(1

2−min(η(x), 1− η(x))

).

Let L(R)n be the resubstitution error estimate (or apparent error rate). Show the

following:

(1) EL

(R)n

≤ ELn (the apparent error rate is always optimistic; Hills

(1966)).

(2) EL

(R)n

≤ L∗ (the apparent error rate is in fact very optimistic; Glick

(1973)).

(3) ELn ≤ L∗ + e−2n∆2(for a similar inequality related to Hellinger dis-

tances, see Glick (1973)). This is an exponential but distribution-dependenterror rate.

Problem 27.2. Discrete Lipschitz classes. Consider the class of regressionfunctions η ∈ [0, 1] with |η(x) − η(z)| ≤ cρ(x, z)α, where x, z ∈ 0, 1d, ρ(·, ·)denotes the Hamming distance, and c > 0 and α > 0 are constants (note thatα is not bounded from above). The purpose is to design a discrimination rulefor which uniformly over all distributions of (X,Y ) on 0, 1d × 0, 1 with suchη(x) = PY = 1|X = x, we have

ELn − L∗ ≤ Ψ(c, α, d)√n

,

where the function Ψ(c, α, d) is as small as possible. Note: for c = 1, the classcontains all regression functions on the hypercube, and thus Ψ(1, α, d) = 1.075 ·2d/2 (Theorem 27.1). How small should c be to make Ψ polynomial in d?

Problem 27.3. With a cubic histogram partition of [0, 1]d into kd cells (of vol-ume 1/kd each), we have, for the Lipschitz (c, α) class F of Theorem 27.4,

sup(X,Y )∈F

δ = c

(√d

k

)α.

This grows as dα/2. Can you define a partition into kd cells for which sup(X,Y )∈F δis smaller? Hint: Consult Conway and Sloane (1993).

Problem 27.4. Consider the following randomized histogram rule: X1, . . . , Xkpartition [0, 1]d into polyhedra based on the nearest neighbor rule. Within eachcell, we employ a majority rule based upon Xk+1, . . . , Xn. If X is uniform on[0, 1]d and η is Lipschitz (c, α) (as in Theorem 27.4), then can you derive anupper bound for ELn − L∗ as a function of k, n, c, α, and d? How does yourbound compare with the cubic histogram rule that uses the same number (k) ofcells?


Problem 27.5. Let F be the class of all Lipschitz (c, α) functions η′ ∈ Rd →[0, 1]. Let (X,Y ) ∈ F denote the fact that (X,Y ) has regression function η(x) =PY = 1|X = x in F . Then, for any cubic histogram rule, show that

sup(X,Y )∈F

ELn − L∗ ≥ 1

2− L∗.

Thus, the compactness condition on the space is essential for the distribution-freeerror bound given in Theorem 27.4.

Problem 27.6. Independent model. Show that in the independent model, themaximum likelihood estimate p(i) of p(i) is given by∑n

j=1IX(i)

j=1,Yj=1∑n

j=1IYj=1

.

Problem 27.7. Independent model. For the plug-in rule in the independentmodel, is it true that ELn − L∗ = O(1/

√n) uniformly over all pairs (X,Y )

on 0, 1d × 0, 1? If so, find a constant c depending upon d only, such thatELn − L∗ ≤ c/

√n. If not, provide a counterexample.

Problem 27.8. Consider a hypercube problem in which X = (X(1), . . . , X(d))and each X(i) ∈ −1, 0, 1 (a ternary generalization). Assume that the X(i)’s areindependent but not necessarily identically distributed. Show that there exists aquadratic Bayes rule, i.e., g∗(x) is 1 on the set

α0 +

d∑i=1

αix(i) +

d∑i,j=1

αijx(i)x(j) > 0,

where α0, αi, and αij are some weights (Kazmierczak and Steinbuch (1963)).

Problem 27.9. Let A be the class of all sets on the hypercube 0, 1d of the formx(i1) · · ·x(ik) = 1, where (x(1), . . . , x(d)) ∈ 0, 1d, 1 ≤ i1 < · · · < ik ≤ d. (Thus,A is the class of all sets carved out by the monomials.) Show that the vc dimen-sion of C is d. Hint: The set (0, 1, 1, . . . , 1), (1, 0, 1, . . . , 1), . . . , (1, 1, 1, . . . , 0) isshattered by A. No set of size d + 1 can be shattered by A by the pigeonholeprinciple.

Problem 27.10. Show that the Catalan number

1

n+ 1

(2n

n

)∼ 4n√

πn3.

(See, e.g., Kemp (1984).)

Problem 27.11. Provide an argument to show that the number of boolean func-tions with at most k operations “and” or “or” and d operands of the form x(i)

or 1− x(i), x(i) ∈ 0, 1, is not more than

2(2d)k+1 1

k + 1

(2k

k

)(this is 2k times less than the bound given in the text).


Problem 27.12. Provide upper and lower bounds on the vc dimension of theclass of sets A on the hypercube 0, 1d that can be described by a booleanexpression with the x(i)’s or 1−x(i)’s as operands and with at most k operations“and” or “or.”

Problem 27.13. Linear discrimination on the hypercube. Let gn be therule that minimizes the empirical error Ln(φ) over all linear rules φ, when thedata are drawn from any distribution on 0, 1d×0, 1. Let Ln be its probabilityof error, and let L be the minimum error probability over all linear rules. Showthat for ε > 0,

PLn − L > ε ≤ 4

(2d

d

)e−nε

2/2.

Deduce that

ELn − L ≤ 2

√1 + log 4

(2d

d

)√

2n≤ 2

d+ 1√2n

.

Compare this result with the general Vapnik-Chervonenkis bound for linear rules(Theorem 13.11) and deduce when the bound given above is better. Hint: Countthe number of possible linear rules.

Problem 27.14. On the hypercube 0, 1d, show that the kernel rule of Aitchi-son and Aitken (1976) is strongly consistent when limn→∞ h = 0.

Problem 27.15. Pick h in the kernel estimate by minimizing the resubstitutionestimate L

(R)n , and call it H

(R)n . For L

(D)n , we call it H

(D)n . Assume that the kernel

function is of the form K(‖ · ‖/h) with K ≥ 0, K(u) ↓ 0 as u ↑ ∞. Let Ln be theerror estimate for the kernel rule with one of these two choices. Is it possible tofind a constant c, depending upon d and K only, such that

E

Ln − inf

h≥0Ln(h)

≤ c√

n?

If so, give a proof. If not, provide a counterexample. Note: if the answer is positive,a minor corollary of this result is Tutz’s theorem. However, an explicit constant cmay aid in determining appropriate sample sizes. It may also be minimized withrespect to K.

Problem 27.16. (Simon (1991).) Construct a partition of the hypercube 0, 1din the following manner, based upon a binary classification tree with perpendicu-lar splits: Every node at level i splits the subset according to x(i) = 0 or x(i) = 1,so that there are at most d levels of nodes. (In practice, the most important com-ponent should be x(1).) For example, all possible partitions of 0, 12 obtainablewith 2 cuts are


Assign to each internal node (each region) a class. Define the Horton-Strahlernumber ξ of a tree as follows: if a tree has one node, then ξ = 0. If the root of thetree has left and right subtrees with Horton-Strahler numbers ξ1 and ξ2, then set

ξ = max(ξ1, ξ2) + Iξ1=ξ2.

Let C be the class of classifiers g described above with Horton-Strahler number≤ ξ.

(1) Let S = x ∈ 0, 1d : ‖x‖ ≤ ξ, where ‖ · ‖ denotes Hamming distancefrom the all-zero vector. Show that S is shattered by the class of setsg = 1 : g ∈ C.

(2) Show that |S| =∑ξ

i=0

(di

).

(3) Conclude that the vc dimension of C is at least∑ξ

i=0

(di

). (Simon has

shown that the vc dimension of C is exactly this, but that proof is moreinvolved.)

(4) Assuming L∗ = 0, obtain an upper bound for ELn as a function ofξ and d, where Ln is the probability of error for the rule picked byminimizing the empirical error over C.

(5) Interpret ξ as the height of the largest complete binary tree that can beembedded in the classification tree.


28Epsilon Entropy and Totally BoundedSets

28.1 Definitions

This chapter deals with discrimination rules that are picked from a certainclass of classifiers by minimizing the empirical probability of error over afinite set of carefully selected rules. We begin with a class F of regressionfunctions (i.e., a posteriori probability functions) η : Rd → [0, 1] from whichηn will be picked by the data. The massiveness of F can be measured inmany ways—the route followed here is suggested in the work of Kolmogorovand Tikhomirov (1961). We will depart from their work only in details. Wesuggest comparing the results here with those from Chapters 12 and 15.

Let Fε = η(1), . . . , η(N) be a finite collection of functions Rd → [0, 1]such that

F ⊂⋃

η′∈Fε

Sη′,ε,

where Sη′,ε is the ball of all functions ξ : Rd → [0, 1] with

‖ξ − η′‖∞ = supx|ξ(x)− η′(x)| < ε.

In other words, for each η′ ∈ F , there exists an η(i) ∈ Fε with supx |η′(x)−η(i)(x)| < ε. The fewer η(i)’s needed to cover F , the smaller F is, in acertain sense. Fε is called an ε-cover of F . The minimal value of |Fε| overall ε-covers is called the ε-covering number (Nε). Following Kolmogorov andTikhomirov (1961),

log2Nε

28.2 Examples: Totally Bounded Classes 499

is called the ε-entropy of F . We will also call it the metric entropy. Acollection F is totally bounded if Nε < ∞ for all ε > 0. It is with suchclasses that we are concerned in this chapter. The next section gives a fewexamples. In the following section, we define the skeleton estimate basedupon picking the empirically best member from Fε.

Figure 28.1. An ε-cover ofthe unit square.

28.2 Examples: Totally Bounded Classes

The simple scalar parametric class F =e−θ|x|, x ∈ R; θ > 0

is not totally

bounded (Problem 28.1). This is due simply to the presence of an unre-stricted scale factor. It would still fail to be totally bounded if we restrictedθ to [1,∞) or [0, 1]. However, if we force θ ∈ [0, 1] and change the class Fto have functions

η′(x) =e−θ|x| if |x| ≤ 10 otherwise,

then the class is totally bounded. While it is usually difficult to computeNε exactly, it is often simple to obtain matching upper and lower bounds.Here is a simple argument. Take θi = 2εi, 0 ≤ i ≤ b1/(2ε)c and defineθ∗ = 1. Each of these θ-values defines a function η′. Collect these in Fε.Note that with η′ ∈ F , with parameter θ, if θ is the nearest value amongθi, 0 ≤ i ≤ b1/(2ε)c ∪ θ∗, then |θ − θ| ≤ ε. But then

sup|x|≤1

∣∣∣e−θ|x| − e−θ|x|∣∣∣ ≤ |θ − θ| ≤ ε.

Hence Fε is an ε-cover of F of cardinality b1/(2ε)c + 2. We conclude thatF is totally bounded, and that

Nε ≤ 2 + b1/(2ε)c.

(See Problem 28.2 for a d-dimensional generalization.)

28.2 Examples: Totally Bounded Classes 500

For a lower bound, we use once again an idea from Kolmogorov andTikhomirov (1961). Let Gε =

η(1), . . . , η(M)

⊂ F be a subset with the

property that for every i 6= j, supx∣∣η(i)(x)− η(j)(x)

∣∣ ≥ ε. The set Gε is thusε-separated. The maximal cardinality ε-separated set is called the ε-packingnumber (or ε-separation number) Mε. It is easy to see that

M2ε ≤ Nε ≤Mε

(Problem 28.3). With this in hand, we see that Gε may be constructed asfollows for our example: Begin with θ0 = 0. Then define θ1 by e−θ1 = 1− ε,e−θ2 = 1−2ε, etcetera, until θi > 1. It is clear that this way an ε-separatedset Gε may be constructed with |Gε| = b(1− 1/e)/εc+ 1. Thus,

Nε ≥M2ε ≥⌊

1− 1/e2ε

⌋+ 1.

The ε-entropy of F grows as log2(1/ε) as ε ↓ 0.Consider next a larger class, not of a parametric nature: let F be a class

of functions η on [0,∆] satisfying the Lipschitz condition

|η(x)− η(x′)| ≤ c|x− x′|

and taking values on [0, 1]. Kolmogorov and Tikhomirov (1961) showedthat if

ε ≤ min(

14,

116∆c

),

then

∆cε

+ log2

1ε− 3 ≤ log2M2ε

≤ log2Nε

≤ ∆cε

+ log2

1ε

+ 3

(see Problem 28.5). Observe that the metric entropy is exponentially largerthan for the parametric class considered above. This has a major impacton sample sizes needed for similar performances (see the next sections).

Another class of functions of interest is that containing functions η :[0, 1] → [0, 1] that are s-times differentiable and for which the s-th deriva-tive η(s) satisfies a Holder condition of order α,∣∣∣η(s)(x)− η(s)(x′)

∣∣∣ ≤ c|x− x′|α, x, x′ ∈ [0, 1],

where c is a constant. In that case, log2Mε and log2Nε are both Θ(ε−1/(s+α)

)as ε ↓ 0, where an = Θ(bn) means that an = O(bn) and bn = O(an).This result, also due to Kolmogorov and Tikhomirov (1961), establishes

28.3 Skeleton Estimates 501

a continuum of rates of increase of the ε-entropy. In Rd, with functionsη : [0, 1]d → [0, 1], if the Holder condition holds for all derivatives of orders, then log2Nε = Θ

(ε−d/(s+α)

). Here we have an interesting interpretation

of dimension: doubling the dimension roughly offsets the effect of doublingthe number of derivatives (or the degree of smoothness) of the η’s. Workingwith Lipschitz functions on R1 is roughly equivalent to working with func-tions on R25 for which all 24-th order derivatives are Lipschitz! As thereare 2524 such derivatives, we note immediately how much we must pay forcertain performances in high dimensions.

Let F be the class of all entire analytic functions η : [0, 2π] → [0, 1] whoseperiodic continuation satisfies

|η(z)| ≤ ceσ|=(z)|

for some constants c, σ (z is a complex variable, =(z) is its imaginary part).For this class, we know that

log2Nε ∼ (4bσc+ 2) log1ε

as ε ↓ 0

(Kolmogorov and Tikhomirov (1961)). The class appears to be as small asour parametric class. See also Vitushkin (1961).

28.3 Skeleton Estimates

The members of Fε form a representative skeleton of F . We assume thatFε ⊂ F (this condition was not imposed in the definition of an ε-cover).For each η′ ∈ F , we define its discrimination rule as

g(x) =

1 if η′(x) > 1/20 otherwise.

Thus, we will take the liberty of referring to η′ as a rule. For each such η′,we define the probability of error as usual:

L(η′) = Pg(X) 6= Y .

The empirical probability of error of η′ is denoted by

Ln(η′) =1n

n∑i=1

Ig(Xi) 6=Yi.

We define the skeleton estimate ηn by

ηn = arg minη′∈Fε

Ln(η′).


One of the best rules in F is denoted by η∗:

L(η∗) ≤ L(η′), η′ ∈ F .

The first objective, as in standard empirical risk minimization, is to ensurethat L(ηn) is close to L(η∗). If the true a posteriori probability functionη is in F (recall that η(x) = PY = 1|X = x), then it is clear thatL∗ = L(η∗). It will be seen from the theorem below that under this con-dition, the skeleton estimate has nice consistency and rate-of-convergenceproperties. The result is distribution-free in the sense that no structure onthe distribution of X is assumed. Problems 6.9 and 28.9 show that conver-gence of L(ηn) to L(η∗) for all η’s—that is, not only for those in F—holdsif X has a density. In any case, because the skeleton estimate is selectedfrom a finite deterministic set (that may be constructed before data arecollected!), probability bounding is trivial: for all ε > 0, δ > 0, we have

PL(ηn)− inf

η′∈Fε

L(η′) > δ

≤ |Fε| sup

η′∈Fε

P|Ln(η′)− L(η′)| > δ/2

(by Lemma 8.2)

≤ 2|Fε|e−nδ2/2 (Hoeffding’s inequality).

Theorem 28.1. Let F be a totally bounded class of functions η′ : Rd →[0, 1]. There is a sequence εn > 0 and a sequence of skeletons Fεn ⊂ Fsuch that if ηn is the skeleton estimate drawn from Fεn , then

L(ηn) → L∗ with probability one,

whenever the true regression function η(x) = PY = 1|X = x is in F .It suffices to take Fε as an ε-cover of F (note that |Fε| need not equal

the ε-covering number Nε), and to define εn as the smallest positive numberfor which

2n ≥ log |Fεn |ε2n

.

Finally, with εn picked in this manner,

E L(ηn)− L∗ ≤ (2 +√

8)εn +√π

n.

Proof. We note that infη′∈Fε L(η′) ≤ L∗ + 2ε, because if η′ ∈ Sη,ε, then

E|η′(X)− η(X)| ≤ supx|η′(x)− η(x)| ≤ ε


and thus, by Theorem 2.2, L(η′)− L∗ ≤ 2ε. Then for any δ ≥ 2εn,

PL(ηn)− L∗ > δ ≤ PL(ηn)− inf

η′∈Fεn

L(η′) > δ − 2εn

≤ 2|Fεn |e−n(δ−2εn)2/2 (see above)

≤ 2e2nε2n−n(δ−2εn)2/2,

which is summable in n, as εn = o(1). This shows the first part of thetheorem. For the second part, we have

EL(ηn) − L∗

≤ 2εn + EL(ηn) − infη′∈Fεn

L(η′)

≤ 2εn +∫ ∞

0

min(2e2nε

2n−nt

2/2, 1)dt

≤ (2 +√

8)εn + 2∫ ∞

√8εn

e2nε2n−nt

2/2dt

≤ (2 +√

8)εn +∫ ∞

0

e−nt2/4dt

(since ε2n − t2/4 ≤ −t2/8 for t ≥√

8εn)

= (2 +√

8)εn +√π

n.

The proof is completed. 2

Observe that the estimate ηn is picked from a deterministic class. This,of course, requires quite a bit of preparation and knowledge on behalf of theuser. Knowledge of the ε-entropy (or at least an upper bound) is absolutelyessential. Furthermore, one must be able to construct Fε. This is certainlynot computationally simple. Skeleton estimates should therefore be mainlyof theoretical importance. They may be used, for example, to establish theexistence of estimates with a guaranteed error bound as given in Theorem28.1. A similar idea in nonparametric density estimation was proposed andworked out in Devroye (1987).

Remark. In the first step of the proof, we used the inequality

E|η′(X)− η(X)| ≤ supx|η′(x)− η(x)| < ε.

It is clear from this that what we really need is not an ε-cover of F withrespect to the supremum norm, but rather, with respect to the L1(µ) norm.In other words, the skeleton estimate works equally well if the skeleton isan ε-cover of F with respect to the L1(µ) norm, that is, it is a list of finitely

28.4 Rate of Convergence 504

many candidates with the property that for each η′ ∈ F there exists an η(i)

in the list such that E|η′(X)− η(i)(X)| < ε. It follows from the inequal-ity above that the smallest such covering has fewer elements than that ofany ε-cover of F with respect to the supremum norm. Therefore, estimatesbased on such skeletons perform better. In fact, the difference may be es-sential. As an example, consider the class F of all regression functions on[0, 1] that are monotone increasing. For ε < 1/2, this class does not have afinite ε-cover with respect to the supremum norm. However, for any µ it ispossible to find an ε-cover of F , with respect to L1(µ), with not more than4d1/εe elements (Problem 28.6). Unfortunately, an ε-cover with respect toL1(µ) depends on µ, the distribution of X. Since µ is not known in advance,we cannot construct this better skeleton. However, in some cases, we mayhave some a priori information about µ. For example, if we know that µis a member of a known class of distributions, then we may be able toconstruct a skeleton that is an ε-cover for all measures in the class. In theabove example, if we know that µ has a density bounded by a known num-ber, then there is a finite skeleton with this property (see Problem 28.7).We note here that the basic idea behind the empirical covering method ofBuescher and Kumar (1994a), described in Problem 12.14, is to find a goodskeleton based on a fraction of the data. Investigating this question further,we notice that even covering in L1(µ) is more than what we really need.From the proof of Theorem 28.1, we see that all we need is a skeleton Fεsuch that infη′∈Fε

L(η′) ≤ L∗ + ε for all η ∈ F . Staying with the exampleof the class of monotonically increasing η’s, we see that we may take in Fεthe functions η(i)(x) = Ix≥qi, where qi is the i-th ε-quantile of µ, that is,qi is the smallest number z such that PX ≤ z ≥ i/ε. This collection offunctions forms a skeleton in the required sense with about (1/ε) elements,instead of the 4d1/εe obtained by covering in L1(µ), a significant improve-ment. Problem 28.8 illustrates another application of this idea. For morework on this we refer to Vapnik (1982), Benedek and Itai (1988), Kulkarni(1991), Dudley, Kulkarni, Richardson, and Zeitouni (1994), and Buescherand Kumar (1994a). 2

28.4 Rate of Convergence

In this section, we take a closer look at the distribution-free upper bound

E L(ηn)− L∗ ≤ (2 +√

8)εn +√π

n.

For typical parametric classes (such as the one discussed in a Section 28.2),we have

Nε = Θ(

1ε

).


If we take |Fε| close enough to Nε, then εn is the solution of

logNεε2

∼= n,

or εn = Θ(log n/√n), and we achieve a guaranteed rate of O(log n/

√n).

The same is true for the example of the class of analytic functions discussedearlier.

The situation is different for massive classes such as the Lipschitz func-tions on [0, 1]d. Recalling that logNε = Θ(1/εd) as ε ↓ 0, we note thatεn = Θ(n−1/(2+d)). For this class, we have

E L(ηn)− L∗ = O(n−

12+d

).

Here, once again, we encounter the phenomenon called the “curse of di-mensionality.” In order to achieve the performance E L(ηn)− L∗ ≤ ε, weneed a sample of size n ≥ (1/ε)2+d, exponentially large in d. Note that theclass of classifiers defined by this class of functions has infinite vc dimen-sion. The skeleton estimates thus provide a vehicle for dealing with verylarge classes.

Finally, if we take all η on [0, 1] for which s derivatives exist, and η(s)

is Lipschitz with a given constant c, similar considerations show that therate of convergence is O

(n−(s+1)/(2s+3)

), which ranges from O

(n−1/3

)(at

s = 0) to O(n−1/2

)(as s → ∞). As the class becomes smaller, we can

guarantee better rates of convergence. Of course, this requires more a prioriknowledge about the true η.

We also note that if logNε/ε2 = 2n in Theorem 28.1, then the bound is√π

n+ (2 +

√8)εn =

√π

n+ (2 +

√8)

√logNεn

2n.

The error grows only sub-logarithmically in the size of the skeleton set. Itgrows as the square root of the ε-entropy. Roughly speaking (and ignoringthe dependence of εn upon n), we may say that for the same performanceguarantees, doubling the ε-entropy implies that we should double the sam-ple size (to keep logNε/n constant). When referring to ε-entropy, it isimportant to keep this sample size interpretation in mind.


Problem 28.1. Show that the class of functions e−θ|x| on the real line, withparameter θ > 0, is not totally bounded.

Problem 28.2. Compute a good upper bound for Nε as a function of d and εfor the class F of all functions on Rd given by

η′(x) =

e−θ‖x‖ if ‖x‖ ≤ 10 otherwise,


where θ ∈ [0, 1] is a parameter. Repeat this question if θ1, . . . , θd are in [0, 1] and

η′(x) =

e−

√∑d

i=1θix

2i

if max1≤i≤d |xi| ≤ 10 otherwise.

Hint: Both answers are polynomial in 1/ε as ε ↓ 0.

Problem 28.3. Show that M2ε ≤ Nε ≤ Mε for any totally bounded set F(Kolmogorov and Tikhomirov (1961)).

Problem 28.4. Find a class F of functions η′ : Rd → [0, 1] such that(a) for every ε > 0, it has a finite ε-cover;(b) the vc dimension of A = x : η′(x) > 1/2; η′ ∈ F is infinite.

Problem 28.5. Show that if F = η′ : [0,∆] → [0, 1], η′ is Lipschitz with con-stant c, then for ε small enough, log2Nε ≤ A/ε, where A is a constant dependingupon ∆ and c. (This is not as precise as the statement in the text obtained byKolmogorov and Tikhomirov, but it will give you excellent practice.)

Problem 28.6. Obtain an estimate for the cardinality of the smallest ε-coverwith respect to the L1(µ) norm for the class of η’s on [0, 1] that are increasing.In particular, show that for any µ it is possible to find an ε-cover with 4d1/εe

elements. Can you do something similar for η’s on [0, 1]d that are increasing ineach coordinate?

Problem 28.7. Consider the class of η’s on [0, 1] that are increasing. Show thatfor every ε > 0, there is a finite list η(1), . . . η(N) such that for all η in the class,

infi≤N

E|η(X)− η(i)(X)| < ε,

whenever X has a density bounded by B <∞. Estimate the smallest such N .

Problem 28.8. Assume that X has a bounded density on [0, 1]2, and that η ismonotonically increasing in both coordinates. (This is a reasonable assumption inmany applications.) Then the set x : g∗(x) = 0 is a monotone layer. Considerthe following classification rule: take a k × k grid in [0, 1]2, and minimize theempirical error over all classifiers φ such that x : φ(x) = 0 is a monotone layer,and it is a union of cells in the k × k grid. What is the optimal choice of k?Obtain an upper bound for L(gn)−L∗. Compare your result with that obtainedfor empirical error minimization over the class of all monotone layers in Section13.4. Hint: Count the number of different classifiers in the class. Use Hoeffding’sinequality and the union-of-events bound for the estimation error. Bound theapproximation error using the bounded density assumption.

Problem 28.9. Apply Problem 6.9 to extend the consistency result in Theorem28.1 as follows. Let F be a totally bounded class of functions η′ : Rd → [0, 1]such that λ(x : η′(x) = 1/2) = 0 for each η′ ∈ F (λ is the Lebesgue measure onRd). Show that there is a sequence εn > 0 and a sequence of skeletons Fεn suchthat if ηn is the skeleton estimate drawn from Fεn , then L(ηn) → infη′∈F L(η′)with probability one, whenever X has a density. In particular, L(ηn) → L∗ withprobability one if the Bayes rule takes the form g∗(x) = Iη′(x)>1/2 for someη′ ∈ F . Note: the true regression function η is not required to be in F .


29Uniform Laws of Large Numbers

29.1 Minimizing the Empirical Squared Error

In Chapter 28 the data Dn were used to select a function ηn from a classF of candidate regression functions η′ : Rd → [0, 1]. The correspondingclassification rule gn is Iηn>1/2. Selecting ηn was done in two steps: askeleton—an ε-covering—of F was formed, and the empirical error countwas minimized over the skeleton. This method is computationally cum-bersome. It is tempting to use some other empirical quantity to select aclassifier. Perhaps the most popular among these measures is the empiricalsquared error:

en(η′) =1n

n∑i=1

(η′(Xi)− Yi)2.

Assume now that the function ηn is selected by minimizing the empiricalsquared error over F , that is,

en(ηn) ≤ en(η′), η′ ∈ F .

As always, we are interested in the error probability,

L(ηn) = PIηn(x)>1/2 6= Y |Dn

,

of the resulting classifier. If the true regression function η(x) = PY =1|X = x is not in the class F , then it is easy to see that empirical squarederror minimization may fail miserably (see Problem 29.1). However, if η ∈

29.2 Uniform Deviations of Averages from Expectations 508

F , then for every η′ ∈ F we have

L(η′)− L∗ ≤ 2√

E (η′(X)− η(X))2 (by Corollary 6.2)

= 2√

E (η′(X)− Y )2 −E (η(X)− Y )2

= 2√

E (η′(X)− Y )2 − infη∈F

E (η(X)− Y )2,

where the two equalities follow from the fact that η(X) = EY |X. Thus,we have

L(ηn)− L∗

≤ 2√

E (ηn(X)− Y )2|Dn − infη′∈F

E (η′(X)− Y )2

≤ 2

√√√√2 supη′∈F

∣∣∣∣∣ 1nn∑i=1

(η′(Xi)− Yi)2 −E (η′(X)− Y )2

∣∣∣∣∣, (29.1)

by an argument as in the proof of Lemma 8.2. Thus, the method is con-sistent if the supremum above converges to zero. If we define Zi = (Xi, Yi)and f(Zi) = (η′(Xi)− Yi)2, then we see that we need only to bound

supf∈F

∣∣∣∣∣ 1nn∑i=1

f(Zi)−Ef(Z1)

∣∣∣∣∣ ,where F is a class of bounded functions. In the next four sections we developupper bounds for such uniform deviations of averages from their expecta-tions. Then we apply these techniques to establish consistency of gener-alized linear classifiers obtained by minimization of the empirical squarederror.

29.2 Uniform Deviations of Averages fromExpectations

Let F be a class of real-valued functions defined on Rd, and let Z1, . . . , Znbe i.i.d. Rd-valued random variables. We assume that for each f ∈ F ,0 ≤ f(x) ≤M for all x ∈ Rd and some M <∞. By Hoeffding’s inequality,

P

∣∣∣∣∣ 1nn∑i=1

f(Zi)−Ef(Z1)

∣∣∣∣∣ > ε

≤ 2e−2nε2/M2

for any f ∈ F . However, it is much less trivial to obtain information aboutthe probabilities

P

supf∈F

∣∣∣∣∣ 1nn∑i=1

f(Zi)−Ef(Z1)

∣∣∣∣∣ > ε

.


Vapnik and Chervonenkis (1981) were the first to obtain bounds for theprobability above. For example, the following simple observation makesTheorems 12.5, 12.8, and 12.10 easy to apply in the new situation:

Lemma 29.1.

supf∈F

∣∣∣∣∣ 1nn∑i=1

f(Zi)−Ef(Z)

∣∣∣∣∣ ≤M supf∈F,t>0

∣∣∣∣∣ 1nn∑i=1

If(Zi)>t −Pf(Z) > t

∣∣∣∣∣ .Proof. Exploiting the identity

∫∞0

PX > tdt = EX for nonnegativerandom variables, we have

supf∈F

∣∣∣∣∣ 1nn∑i=1

f(Zi)−Ef(Z)

∣∣∣∣∣= sup

f∈F

∣∣∣∣∣∫ ∞

0

(1n

n∑i=1


)dt

∣∣∣∣∣≤ M sup

f∈F,t>0

∣∣∣∣∣ 1nn∑i=1


∣∣∣∣∣ . 2

For example, from Theorem 12.5 and Lemma 29.1 we get

Corollary 29.1. Define the collection of sets

F = Af,t : f ∈ F , t ∈ [0,M ] ,

where for every f ∈ F and t ∈ [0,M ] the set Af,t ∈ Rd is defined as

Af,t = z : f(z) > t.

Then

P

supf∈F

∣∣∣∣∣ 1nn∑i=1

f(Zi)−Ef(Z1)

∣∣∣∣∣ > ε

≤ 8s(F , n)e−nε

2/(32M2).

Example. Consider the empirical squared error minimization problemsketched in the previous section. Let F be the class of monotone increasingfunctions η′ : R → [0, 1], and let ηn be the function selected by minimizingthe empirical squared error. By (29.1), if η(x) = PY = 1|X = x is alsomonotone increasing, then


≤ P

supη′∈F

∣∣∣∣∣ 1nn∑i=1

(η′(Xi)− Yi)2 −E(η′(X)− Y )2

∣∣∣∣∣ > ε2

8

.


If F contains all subsets of R× 0, 1 of the form

Aη′,t =(x, y) : (η′(x)− y)2 > t

, η′ ∈ F , t ∈ [0, 1],

then it is easy to see that its n-th shatter coefficient satisfies s(F , n) ≤(n/2+1)2. Thus, Corollary 29.1 can be applied, and the empirical squarederror minimization is consistent. 2

In many cases, Corollary 29.1 does not provide the best possible bound.To state a similar, but sometimes more useful result, we introduce l1-covering numbers. The notion is very similar to that of covering numbersdiscussed in Chapter 28. The main difference is that here the balls aredefined in terms of an l1-distance, rather than the supremum norm.

Definition 29.1. Let A be a bounded subset of Rd. For every ε > 0, thel1-covering number, denoted by N (ε, A), is defined as the cardinality of thesmallest finite set in Rd such that for every z ∈ A there is a point t ∈ Rd

in the finite set such that (1/d)‖z−t‖1 < ε. (‖x‖1 =∑di=1 |x(i)| denotes the

l1-norm of the vector x = (x(1), . . . , x(d)) in Rd.) In other words, N (ε, A)is the smallest number of l1-balls of radius εd, whose union contains A.logN (ε, A) is often called the metric entropy of A.

We will mainly be interested in covering numbers of special sets. Letzn1 = (z1, . . . , zn) be n fixed points in Rd, and define the following set:

F(zn1 ) = (f(z1), . . . , f(zn)); f ∈ F ⊂ Rn.

The l1-covering number of F(zn1 ) is N (ε,F(zn1 )).If Zn1 = (Z1, . . . , Zn) is a sequence of i.i.d. random variables, thenN (ε,F(Zn1 ))

is a random variable, whose expected value plays a central role in our prob-lem:

Theorem 29.1. (Pollard (1984)). For any n and ε > 0,

P

supf∈F

∣∣∣∣∣ 1nn∑i=1

f(Zi)−Ef(Z1)

∣∣∣∣∣ > ε

≤ 8E N (ε/8,F(Zn1 )) e−nε

2/(128M2).

The proof of the theorem is given in Section 29.4.

Remark. Theorem 29.1 is a generalization of the basic Vapnik-Chervonenkisinequality. To see this, define l∞-covering numbers based on the maximumnorm (Vapnik and Chervonenkis (1981)): N∞(ε, A) is the cardinality of thesmallest finite set in Rd such that for every z ∈ A there is a point t ∈ Rd

in the set such that max1≤i≤d |z(i) − t(i)| < ε. If the functions in F areindicators of sets from a class A of subsets of Rd, then it is easy to see thatfor every ε ∈ (0, 1/2),

N∞(ε,F(zn1 )) = NA(z1, . . . , zn),

29.3 Empirical Squared Error Minimization 511

where NA(z1, . . . , zn) is the combinatorial quantity that was used in Defi-nition 12.1 of shatter coefficients. Since

N (ε,F(zn1 )) ≤ N∞(ε,F(zn1 )),

Theorem 29.1 remains true with l∞-covering numbers, therefore, it is a gen-eralization of Theorem 12.5. To see this, notice that if F contains indicatorsof sets of the class A, then

supf∈F

∣∣∣∣∣ 1nn∑i=1

f(Zi)−Ef(Z1)

∣∣∣∣∣ = supA∈A

∣∣∣∣∣ 1nn∑i=1

IZi∈A −PZ1 ∈ A

∣∣∣∣∣ . 2

For inequalities sharper and more general than Theorem 29.1 we referto Vapnik (1982), Pollard (1984; 1986), Haussler (1992), and Anthony andShawe-Taylor (1990).

29.3 Empirical Squared Error Minimization

We return to the minimization of the empirical squared error. Let F be aclass of functions η′ : Rd → [0, 1], containing the true a posteriori functionη. The empirical squared error

1n

n∑i=1

(η′(Xi)− Yi)2

is minimized over η′ ∈ F , to obtain the estimate ηn. The next result showsthat empirical squared error minimization is consistent under general con-ditions. Observe that these are the same conditions that we assumed inTheorem 28.1 to prove consistency of skeleton estimates.

Theorem 29.2. Assume that F is a totally bounded class of functions.(For the definition see Chapter 28.) If η ∈ F , then the classification ruleobtained by minimizing the empirical squared error over F is strongly con-sistent, that is,

limn→∞

L(ηn) = L∗ with probability one.

Proof. Recall that by (29.1),


≤ P

supη′∈F

∣∣∣∣∣ 1nn∑i=1

(η′(Xi)− Yi)2 −E(η′(X)− Y )2

∣∣∣∣∣ > ε2

8

.

We apply Theorem 29.1 to show that for every ε > 0, the probability on theright-hand side converges to zero exponentially as n→∞. To this end, we

29.4 Proof of Theorem 29.1 512

need to find a suitable upper bound on EN (ε,J (Zn1 )), where J is theclass of functions f ′(x, y) = (η′(x) − y)2 from Rd × 0, 1 to [0, 1], whereη′ ∈ F , and Zi = (Xi, Yi). Observe that for any f ′, f ∈ J ,

1n

n∑i=1

|f ′(Zi)− f(Zi)| =1n

n∑i=1

∣∣(η′(Xi)− Yi)2 − (η(Xi)− Yi)2∣∣

≤ 2n

n∑i=1

|η′(Xi)− η(Xi)|

≤ 2 supx|η′(x)− η(x)|.

This inequality implies that for every ε > 0,

EN (ε,J (Zn1 )) ≤ Nε/2,

where Nε is the covering number of F defined in Chapter 28. By the as-sumption of total boundedness, for every ε > 0, Nε/2 <∞. Since Nε/2 doesnot depend on n, the theorem is proved. 2

Remark. The nonasymptotic exponential nature of the inequality in The-orem 29.1 makes it possible to obtain upper bounds for the rate of conver-gence of L(ηn) to L∗ in terms of the covering numbers Nε of the class F .However, since we started our analysis by the loose inequality L(η′)−L∗ ≤2√

E (η′(X)− η(X))2, the resulting rates are likely to be suboptimal(see Theorem 6.5). Also, the inequality of Theorem 29.1 may be loose inthis case. In a somewhat different setup, Barron (1991) developed a proofmethod based on Bernstein’s inequality that is useful for obtaining tighterupper bounds for L(ηn)− L∗ in certain cases. 2

29.4 Proof of Theorem 29.1

The main tricks in the proof resemble those of Theorem 12.5. We can showthat for nε2 ≥ 2M2,

P

supf∈F

∣∣∣∣∣ 1nn∑i=1

f(Zi)−Ef(Z1)

∣∣∣∣∣ > ε

≤ 4P

supf∈F

∣∣∣∣∣ 1nn∑i=1

σif(Zi)

∣∣∣∣∣ > ε

4

,

where σ1, . . . , σn are i.i.d. −1, 1-valued random variables, independent ofthe Zi’s, with Pσi = 1 = Pσi = −1 = 1/2. The only minor differencewith Theorem 12.5 appears when Chebyshev’s inequality is applied. Weuse the fact that by boundedness, Var(f(Z1)) ≤M2/4 for every f ∈ F .

29.4 Proof of Theorem 29.1 513

Now, take a minimal ε/8-covering of F(Zn1 ), that is, M = N (ε/8,F(Zn1 ))functions g1, . . . , gM such that for every f ∈ F there is a g∗ ∈ g1, . . . , gMwith

1n

n∑i=1

|f(Zi)− g∗(Zi)| ≤ε

8.

For any function f , we have∣∣∣∣∣ 1nn∑i=1

σif(Zi)

∣∣∣∣∣ ≤ 1n

n∑i=1

|σif(Zi)| =1n

n∑i=1

|f(Zi)| ,

and thus∣∣∣∣∣ 1nn∑i=1

σif(Zi)

∣∣∣∣∣ ≤

∣∣∣∣∣ 1nn∑i=1

σig∗(Zi)

∣∣∣∣∣+∣∣∣∣∣ 1n

n∑i=1

σi(f(Zi)− g∗(Zi))

∣∣∣∣∣≤

∣∣∣∣∣ 1nn∑i=1

σig∗(Zi)

∣∣∣∣∣+ 1n

n∑i=1

|f(Zi)− g∗(Zi)| .

As (1/n)∑ni=1 |g∗(Zi)− f(Zi)| ≤ ε/8,

P

supf∈F

∣∣∣∣∣ 1nn∑i=1

σif(Zi)

∣∣∣∣∣ > ε

4

∣∣∣∣∣Z1, . . . , Zn

≤ P

supf∈F

∣∣∣∣∣ 1nn∑i=1

σig∗(Zi)

∣∣∣∣∣+ 1n

n∑i=1

|f(Zi)− g∗(Zi)| >ε

4

∣∣∣∣∣Z1, . . . , Zn

≤ P

maxgj

∣∣∣∣∣ 1nn∑i=1

σigj(Zi)

∣∣∣∣∣ > ε

8

∣∣∣∣∣Z1, . . . , Zn

.

Now that we have been able to convert the “sup” into a “max,” we can usethe union bound:

P

maxgj

∣∣∣∣∣ 1nn∑i=1

σigj(Zi)

∣∣∣∣∣ > ε

8

∣∣∣∣∣Z1, . . . , Zn

≤ N( ε

8,F(Zn1 )

)maxgj

P

∣∣∣∣∣ 1nn∑i=1

σigj(Zi)

∣∣∣∣∣ > ε

8

∣∣∣∣∣Z1, . . . , Zn

.

We need only find a uniform bound for the probability following the “max.”This, however, is easy, since after conditioning on Z1, . . . , Zn,

∑ni=1 σigj(Zi)

is the sum of independent bounded random variables whose expected valueis zero. Therefore, Hoeffding’s inequality gives

P

∣∣∣∣∣ 1nn∑i=1

σigj(Zi)

∣∣∣∣∣ > ε

8

∣∣∣∣∣Z1, . . . , Zn

≤ 2e−nε

2/(128M2).

29.5 Covering Numbers and Shatter Coefficients 514

In summary,

P

supf∈F

∣∣∣∣∣ 1nn∑i=1

σif(Zi)

∣∣∣∣∣ > ε

4

≤ E

P

maxgj

∣∣∣∣∣ 1nn∑i=1

σigj(Zi)

∣∣∣∣∣ > ε

8

∣∣∣∣∣Z1, . . . , Zn

≤ 2EN( ε

8,F(Zn1 )

)e−nε

2/(128M2).

The theorem is proved. 2

Properties of N (ε,F(Zn1 )) will be studied in the next section.

29.5 Covering Numbers and Shatter Coefficients

In this section we study covering numbers, and relate them to shattercoefficients of certain classes of sets. As in Chapter 28, we introduce l1-packing numbers. Let F be a class of functions on Rd, taking their valuesin [0,M ]. Let µ be an arbitrary probability measure on Rd. Let g1, . . . , gmbe a finite collection of functions from F with the property that for anytwo of them ∫

Rd

|gi(x)− gj(x)|µ(dx) ≥ ε.

The largestm for which such a collection exists is called the packing numberof F (relative to µ), and is denoted by M(ε,F). If µ places probability 1/non each of z1, . . . , zn, then by definition M(ε,F) = M(ε,F(zn1 )), and it iseasy to see (Problem 28.3) that

M(2ε,F(zn1 )) ≤ N (ε,F(zn1 )) ≤M(ε,F(zn1 )).

An important feature of a class of functions F is the vc dimension VF+

ofF+ = (x, t) : t ≤ f(x) ; f ∈ F .

This is clarified by the following theorem, which is a slight refinement ofa result by Pollard (1984), which is based on Dudley’s (1978) work. Itconnects the packing number of F with the shatter coefficients of F+. Seealso Haussler (1992) for a somewhat different argument.

Theorem 29.3. Let F be a class of [0,M ]-valued functions on Rd. Forevery ε > 0 and probability measure µ,

M(ε,F) ≤ s(F+, k

),

where k =⌈M

εlog

eεM2(ε,F)2M

⌉.


Proof. Let g1, g2, . . . , gm be an arbitrary ε-packing of F of size m ≤M(ε,F). The proof is in the spirit of the probabilistic method of com-binatorics (see, e.g., Spencer (1987)). To prove the inequality, we createk random points on Rd × [0,M ] in the following way, where k is a posi-tive integer specified later. We generate k independent random variablesS1, . . . , Sk on Rd with common distribution µ, and independently of this,we generate another k independent random variables T1, . . . , Tk, uniformlydistributed on [0,M ]. This yields k random pairs (S1, T1), . . . , (Sk, Tk). Forany two functions gi and gj in an ε-packing, the probability that the setsGi = (x, t) : t ≤ gi(x) and Gj = (x, t) : t ≤ gj(x) pick the same pointsfrom our random set of k points is bounded as follows:

PGi and Gj pick the same points

=k∏l=1

(1−P(Sl, Tl) ∈ Gi4Gj)

= (1−E P (S1, T1) ∈ Gi4Gj|S1)k

=(

1− 1M

E|gi(S1)− gj(S1)|)k

≤ (1− ε/M)k

≤ e−kε/M ,

where we used the definition of the functions g1, . . . , gm. Observe that theexpected number of pairs (gi, gj) of these functions, such that the corre-sponding sets Gi = (x, t) : t ≤ gi(x) and Gj = (x, t) : t ≤ gj(x) pickthe same points, is bounded by

E |(gi, gj);Gi and Gj pick the same points|

=(m

2

)PGi and Gj pick the same points ≤

(m

2

)e−kε/M .

Since for k randomly chosen points the average number of pairs that pickthe same points is bounded by

(m2

)e−kε/M , there exist k points in Rd ×

[0,M ], such that the number of pairs (gi, gj) that pick the same points isactually bounded by

(m2

)e−kε/M . For each such pair we can add one more

point in Rd × [0,M ] such that the point is contained in Gi4Gj . Thus,we have obtained a set of no more than k +

(m2

)e−kε/M points such that

the sets G1, . . . , Gm pick different subsets of it. Since k was arbitrary, wecan choose it to minimize this expression. This yields

⌊Mε log

(eε(m2

)/M)⌋

points, so the shatter coefficient of F+ corresponding to this number mustbe greater than m, which proves the statement. 2

The meaning of Theorem 29.3 is best seen from the following simplecorollary:


Corollary 29.2. Let F be a class of [0,M ]-valued functions on Rd. Forevery ε > 0 and probability measure µ,

M(ε,F) ≤(

4eMε

log2eMε

)VF+

.

Proof. Recall that Theorem 13.2 implies

s(F+, k) ≤(ek

VF+

)VF+

.

The inequality follows from Theorem 29.3 by straightforward calculation.The details are left as an exercise (Problem 29.2). 2

Recently Haussler (1991) was able to get rid of the “log” factor in theabove upper bound. He proved that if ε = k/n for an integer k, then

M(ε,F) ≤ e(d+ 1)(

2eε

)VF+

.

The quantity VF+ is sometimes called the pseudo dimension of F (seeProblem 29.3). It follows immediately from Theorem 13.9 that if F is alinear space of functions of dimension r, then its pseudo dimension is atmost r + 1. A few more properties are worth mentioning:

Theorem 29.4. (Wenocur and Dudley, (1981)). Let g : Rd → R bean arbitrary function, and consider the class of functions G = g + f ; f ∈F. Then

VG+ = VF+ .

Proof. If the points (s1, t1), . . . , (sk, tk) ∈ Rd × R are shattered by F+,then the points (s1, t1 + g(s1)), . . . , (sk, tk + g(sk)) are shattered by G+.This proves

VG+ ≥ VF+ .

The proof of the other inequality is similar. 2

Theorem 29.5. (Nolan and Pollard (1987); Dudley, (1987)). Letg : [0,M ] → R be a fixed nondecreasing function, and define the classG = g f ; f ∈ F. Then

VG+ ≤ VF+ .

Proof. Assume that n ≤ VG+ , and let the functions f1, . . . , f2n ∈ F besuch that the binary vector(

Ig(fj(s1))≥t1, . . . , Ig(fj(sn))≥tn)


takes all 2n values if j = 1, . . . , 2n. For all 1 ≤ i ≤ n define the numbers

ui = min1≤j≤2n

fj(si) : g(fj(si)) ≥ ti

andli = max

1≤j≤2nfj(si) : g(fj(si)) < ti .

By the monotonicity of g, ui > li. Then the binary vector(Ifj(s1)≥

u1+l12 , . . . , Ifj(sn)≥un+ln

2

)takes the same value as(

Ig(fj(s1))≥t1, . . . , Ig(fj(sn))≥tn)

for every j ≤ 2n. Therefore, the pairs(s1,

u1 + l12

), . . . ,

(sn,

un + ln2

)are shattered by F+, which proves the theorem. 2

Next we present a few results about covering numbers of classes of func-tions whose members are sums or products of functions from other classes.Similar results can be found in Nobel (1992), Nolan and Pollard (1987),and Pollard (1990).

Theorem 29.6. Let F1, . . . ,Fk be classes of real functions on Rd. For narbitrary, fixed points zn1 = (z1, . . . , zn) in Rd, define the sets F1(zn1 ), . . .,Fk(zn1 ) in Rn by

Fj(zn1 ) = (fj(z1), . . . , fj(zn)); fj ∈ Fj ,

j = 1, . . . , k. Also, introduce

F(zn1 ) = (f(z1), . . . , f(zn)); f ∈ F

for the class of functions

F = f1 + . . .+ fk; fj ∈ Fj , j = 1, . . . , k .

Then for every ε > 0 and zn1

N (ε,F(zn1 )) ≤k∏j=1

N (ε/k,Fj(zn1 )).


Proof. Let S1, . . . , Sk ⊂ Rn be minimal ε/k-coverings of F1(zn1 ), . . . ,Fk(zn1 ),respectively. This implies that for any fj ∈ Fj there is a vector sj =(s(1)j , . . . , s

(n)j ) ∈ Sj such that

1n

n∑i=1

∣∣∣fj(zi)− s(i)j

∣∣∣ < ε/k,

for every j = 1, . . . , k. Moreover, |Sj | = N (ε/k,Fj(zn1 )). We show that

S = s1 + . . .+ sk; sj ∈ Sj , j = 1, . . . , k

is an ε-covering of F(zn1 ). This follows immediately from the triangle in-equality, since for any f1, . . . , fk there is s1, . . . , sk such that

1n

n∑i=1

∣∣∣f1(zi) + . . .+ fk(zi)− (s(i)1 + . . .+ s(i)k )∣∣∣

≤ 1n

n∑i=1

∣∣∣f1(zi)− s(i)1

∣∣∣+ . . .+1n

n∑i=1

∣∣∣fk(zi)− s(i)k

∣∣∣ < kε

k. 2

Theorem 29.7. (Pollard (1990)). Let F and G be classes of real func-tions on Rd, bounded by M1 and M2, respectively. (That is, e.g., |f(x)| ≤M1 for every x ∈ Rd and f ∈ F .) For arbitrary fixed points zn1 = (z1, . . . , zn)in Rd define the sets F(zn1 ) and G(zn1 ) in Rn as in Theorem 29.6. Introduce

J (zn1 ) = (h(z1), . . . , h(zn));h ∈ J

for the class of functions

J = fg; f ∈ F , g ∈ G .

Then for every ε > 0 and zn1

N (ε,J (zn1 )) ≤ N(

ε

2M2,F(zn1 )

)· N

(ε

2M1,G(zn1 )

).

Proof. Let S ⊂ [−M1,M1]n be an ε/(2M2)-covering of F(zn1 ), that is, forany f ∈ F there is a vector s = (s(1), . . . , s(n)) ∈ S such that

1n

n∑i=1

∣∣∣f(zi)− s(i)∣∣∣ < ε

2M2.

It is easy to see that S can be chosen such that |S| = N (ε/(2M2),F(zn1 )).Similarly, let T ⊂ [−M2,M2] be an ε/(2M1)-covering of G(zn1 ) with |T | =


N (ε/(2M1),G(zn1 )) such that for any g ∈ G there is a t = (t(1), . . . , t(n)) ∈ Twith

1n

n∑i=1

∣∣∣g(zi)− t(i)∣∣∣ < ε

2M1.

We show that the setU = st; s ∈ S, t ∈ T

is an ε-covering of J (zn1 ). Let f ∈ F and g ∈ G be arbitrary and s ∈ S andt ∈ T the corresponding vectors such that

1n

n∑i=1

∣∣∣f(zi)− s(i)∣∣∣ < ε

2M2and

1n

n∑i=1

∣∣∣g(zi)− t(i)∣∣∣ < ε

2M1.

Then

1n

n∑i=1

|f(zi)g(zi)− s(i)t(i)|

=1n

n∑i=1

|f(zi)(t(i) + g(zi)− t(i))− s(i)t(i)|

=1n

n∑i=1

|t(i)(f(zi)− s(i)) + f(zi)(g(zi)− t(i))|

≤ M2

n

n∑i=1

∣∣∣f(zi)− s(i)∣∣∣+ M1

n

n∑i=1

∣∣∣g(zi)− t(i)∣∣∣ < ε. 2

29.6 Generalized Linear Classification

In this section we use the uniform laws of large numbers discussed in thischapter to prove that squared error minimization over an appropriatelychosen class of generalized linear classifiers yields a universally consistentrule. Consider the class C(kn) of generalized linear classifiers, whose mem-bers are functions φ : Rd → 0, 1 of the form

φ(x) =

0 if∑kn

j=1 ajψj(x) ≤ 01 otherwise,

where ψ1, . . . , ψknare fixed basis functions, and the coefficients a1, . . . , akn

are arbitrary real numbers. The training sequence Dn is used to determinethe coefficients ai. In Chapter 17 we studied the behavior of the classifierwhose coefficients are picked to minimize the empirical error probability

Ln =1n

n∑i=1

Iφ(Xi) 6=Yi.


Instead of minimizing the empirical error probability Ln(φ), several authorssuggested minimizing the empirical squared error

1n

n∑i=1

kn∑j=1

ajψj(Xi)− (2Yi − 1)

2

(see, e.g., Duda and Hart (1973), Vapnik (1982)). This is rather dangerous.For example, for k = 1 and d = 1 it is easy to find a distribution such thatthe error probability of the linear classifier that minimizes the empiricalsquared error converges to 1−ε, while the error probability of the best linearclassifier is ε, where ε is an arbitrarily small positive number (Theorem 4.7).Clearly, similar examples can be found for any fixed k. This demonstratespowerfully the danger of minimizing squared error instead of error count.Minimizing the latter yields a classifier whose average error probability isalways within O

(√log n/n

)of the optimum in the class, for fixed k. We

note here that in some special cases minimization of the two types of errorare equivalent (see Problem 29.5). Interestingly though, if kn → ∞ as nincreases, we can obtain universal consistency by minimizing the empiricalsquared error.

Theorem 29.8. Let ψ1, ψ2, . . . be a sequence of bounded functions with|ψj(x)| ≤ 1 such that the set of all finite linear combinations of the ψj’s

∞⋃k=1

k∑j=1

ajψj(x); a1, a2, . . . ∈ R

is dense in L2(µ) on all balls of the form x : ‖x‖ ≤M for any probabilitymeasure µ. Let the coefficients a∗1, . . . , a

∗kn

minimize the empirical squarederror

1n

n∑i=1

kn∑j=1

ajψj(Xi)− (2Yi − 1)

2

under the constraint∑kn

j=1 |aj | ≤ bn, bn ≥ 1. Define the generalized linearclassifier gn by

gn(x) =

0 if∑kn

j=1 a∗jψj(x) ≤ 0

1 otherwise.

If kn and bn satisfy

kn →∞, bn →∞ andknb

4n log(bn)n

→ 0,

then EL(gn) → L∗ for all distributions of (X,Y ), that is, the rule gn isuniversally consistent. If we assume additionally that b4n log n = o(n), thengn is strongly universally consistent.


Proof. Let δ > 0 be arbitrary. Then there exists a constant M such thatP‖X‖ > M < δ. Thus,

L(gn)−L∗ ≤ δ+Pgn(X) 6= Y, ‖X‖ ≤M |Dn−Pg∗(X) 6= Y, ‖X‖ ≤M.

It suffices to show that Pgn(X) 6= Y, ‖X‖ ≤ M |Dn − Pg∗(X) 6=Y, ‖X‖ ≤ M → 0 in the required sense for every M > 0. Introducethe notation f∗n(x) =

∑kn

j=1 a∗jψj(x). By Corollary 6.2, we see that

Pgn(X) 6= Y, ‖X‖ ≤M |Dn −Pg∗(X) 6= Y, ‖X‖ ≤M

≤√∫

‖x‖≤M(f∗n(x)− (2η(x)− 1))2 µ(dx).

We prove that the right-hand side converges to zero in probability. Observethat since E2Y − 1|X = x = 2η(x)− 1, for any function h(x),

(h(x)− (2η(x)− 1))2

= E(h(X)− (2Y − 1))2|X = x −E((2Y − 1)− (2η(X)− 1))2|X = x

(see Chapter 2), therefore, denoting the class of functions over which weminimize by

Fn =

kn∑j=1

ajψj ;kn∑j=1

|aj | ≤ bn

,

we have ∫‖x‖≤M

(f∗n(x)− (2η(x)− 1))2 µ(dx)

= E

(f∗n(X)− (2Y − 1))2 I‖X‖≤M

∣∣∣Dn

−E

((2Y − 1)− (2η(X)− 1))2 I‖X‖≤M

=

(E

(f∗n(X)− (2Y − 1))2 I‖X‖≤M

∣∣∣Dn

− inf

f∈Fn

E

(f(X)− (2Y − 1))2 I‖X‖≤M

)+ inf

f∈Fn

E

(f(X)− (2Y − 1))2 I‖X‖≤M

−E

((2Y − 1)− (2η(X)− 1))2 I‖X‖≤M

.

The last two terms may be combined to yield

inff∈Fn

∫‖x‖≤M

(f(x)− (2η(x)− 1))2 µ(dx) ,


which converges to zero by the denseness assumption. To prove that thefirst term converges to zero in probability, observe that we may assumewithout loss of generality that P‖X‖ > M = 0. As in the proof ofLemma 8.2, it is easy to show that

E

(f∗n(X)− (2Y − 1))2 I‖X‖≤M

∣∣∣Dn

− inf

f∈Fn

E

(f(X)− (2Y − 1))2 I‖X‖≤M

= E

(f∗n(X)− (2Y − 1))2

∣∣∣Dn

− inff∈Fn

E

(f(X)− (2Y − 1))2

≤ 2 supf∈Fn

∣∣∣∣∣ 1nn∑i=1

(f(Xi)− (2Yi − 1))2 −E

(f(X)− (2Y − 1))2∣∣∣∣∣

= 2 suph∈J

∣∣∣∣∣ 1nn∑i=1

h(Xi, Yi)−Eh(X,Y )

∣∣∣∣∣ ,where the class of functions J is defined by

J =h(x, y) = (f(x)− (2y − 1))2 ; f ∈ Fn

.

Observe that since |2y − 1| = 1 and |ψj(x)| ≤ 1, we have

0 ≤ h(x, y) ≤

kn∑j=1

|aj |+ 1

2

≤ 2(b2n + 1) ≤ 4b2n.

Therefore, Theorem 29.1 asserts that

PE

(f∗n(X)− (2Y − 1))2∣∣∣Dn

− inf

f∈Fn

E

(f(X)− (2Y − 1))2> ε

≤ P

suph∈J

∣∣∣∣∣ 1nn∑i=1

h(Xi, Yi)−Eh(X,Y )

∣∣∣∣∣ > ε/2

≤ 8EN( ε

16,J (Zn1 )

)e−nε

2/(512(4b2n)2),

where Zn1 = (X1, Y1), . . . , (Xn, Yn). Next, for fixed zn1 , we estimate thecovering number N (ε/16,J (zn1 )). For arbitrary f1, f2 ∈ Fn, consider thefunctions h1(x, y) = (f1(x)− (2y− 1))2 and h2(x, y) = (f2(x)− (2y− 1))2.Then for any probability measure ν on Rd × 0, 1,∫

|h1(x, y)− h2(x, y)|ν(d(x, y))

=∫ ∣∣(f1(x)− (2y − 1))2 − (f2(x)− (2y − 1))2

∣∣ ν(d(x, y))


≤∫

2|f1(x)− f2(x)|(bn + 1)ν(d(x, y))

≤ 4bn∫|f1(x)− f2(x)|µ(dx),

where µ is the marginal measure for ν onRd. Thus, for any zn1 = (x1, y1), . . .,(xn, yn) and ε,

N (ε,J (zn1 )) ≤ N(

ε

4bn,Fn(xn1 )

).

Therefore, it suffices to estimate the covering number corresponding to Fn.Since Fn is a subset of a linear space of functions, we have VF+

n≤ kn + 1

(Theorem 13.9). By Corollary 29.2,

N(

ε

4bn,Fn(xn1 )

)≤(

8ebnε/(4bn)

log4ebnε/(4bn)

)kn+1

≤(

32eb2nε

)2(kn+1)

.

Summarizing, we have

PE

(f∗n(X)− (2Y − 1))2∣∣∣Dn

− inf

f∈Fn

E

(f(X)− (2Y − 1))2> ε

≤ 8(

210eb2nε

)2(kn+1)

e−nε2/(213b4n),

which goes to zero if knb4n log(bn)/n → 0. The proof of the theorem iscompleted.

It is easy to see that if we assume additionally that b4n log n/n→ 0, thenstrong universal consistency follows by applying the Borel-Cantelli lemmato the last probability. 2

Remark. Minimization of the squared error is attractive because there areefficient algorithms to find the minimizing coefficients, while minimizingthe number of errors committed on the training sequence is computation-ally more difficult. If the dimension k of the generalized linear classifieris fixed, then stochastic approximation asymptotically provides the mini-mizing coefficients. For more information about this we refer to Robbinsand Monro (1951), Kiefer and Wolfowitz (1952), Dvoretzky (1956), Fabian(1971), Tsypkin (1971), Nevelson and Khasminskii (1973), Kushner (1984),Ruppert (1991) and Ljung, Pflug, and Walk (1992). For example, Gyorfi(1984) proved that if (U1, V1), (U2, V2), . . . form a stationary and ergodicsequence, in which each pair is distributed as the bounded random vari-able pair (U, V ) ∈ Rk × R, and the vector of coefficients a = (a1, . . . , ak)minimizes

R(a) = E(aTU − V

)2,


and a(0) ∈ Rk is arbitrary, then the sequence of coefficient vectors definedby

a(n+1) = a(n) − 1n+ 1

(a(n)TUn+1 − Vn+1

)Un+1

satisfieslimn→∞

R(a(n)) = R(a) a.s. 2


Problem 29.1. Find a class F containing two functions η1, η2 : R → [0, 1] anda distribution of (X,Y ) such that min(L(η1), L(η2)) = L∗, but as n → ∞, theprobability

PL(ηn) = max(L(η1), L(η2))converges to one, where ηn is selected from F by minimizing the empirical squarederror.

Problem 29.2. Prove Corollary 29.2.

Problem 29.3. Let F be a class of functions onRd, taking their values in [0,M ].Haussler (1992) defines the pseudo dimension of F as the largest integer n forwhich there exist n points in Rd, z1, . . . , zn, and a vector v =

(v(1), . . . , v(n)

)∈

Rn such that the binary n-vector(Iv(1)+f(z1)>0, . . . , Iv(n)+f(zn)>0

)takes all 2n possible values as f ranges through F . Prove that the pseudo dimen-sion of F equals the quantity VF+ defined in the text.

Problem 29.4. Consistency of clustering. Let X,X1, . . . , Xn be i.i.d. ran-dom variables in Rd, and assume that there is a number 0 < M < ∞ such thatPX ∈ [−M,M ]d = 1. Take the empirically optimal clustering of X1, . . . , Xn,that is, find the points a1, . . . , ak that minimize the empirical squared error:

en(a1, . . . , ak) =1

n

n∑i=1

min1≤j≤k

‖aj −Xi‖2.

The error of the clustering is defined by the mean squared error

e(a1, . . . , ak) = E

min

1≤j≤k‖aj −X‖2

∣∣∣∣X1, . . . , Xn

.

Prove that if a1, . . . , ak denote the empirically optimal cluster centers, then

e(a1, . . . , ak)− infb1,...,bk∈Rd

e(b1, . . . , bk) ≤ 2 supb1,...,bk∈Rd

|en(b1, . . . , bk)−e(b1, . . . , bk)|,

and that for every ε > 0

P

sup

b1,...,bk∈Rd

|en(b1, . . . , bk)− e(b1, . . . , bk)| > ε

≤ 4e8n2k(d+1)e−nε

2/(32M4).


Conclude that the error of the empirically optimal clustering converges to thatof the truly optimal one as n → ∞. (Pollard (1981; 1982), Linder, Lugosi, andZeger (1994)). Hint: For the first part proceed as in the proof of Lemma 8.2. Forthe second part use the technique shown in Corollary 29.1. To compute the vcdimension, exploit Corollary 13.2.

Problem 29.5. Let ψ1, . . . , ψk be indicator functions of cells of a k-way partitionof Rd. Consider generalized linear classifiers based on these functions. Show thatthe classifier obtained by minimizing the number of errors made on the trainingsequence is the same as for the classifier obtained by minimizing the empiricalsquared error. Point out that this is just the histogram classifier based on thepartition defined by the ψi’s (Csibi (1975)).


30Neural Networks

30.1 Multilayer Perceptrons

The linear discriminant or perceptron (see Chapter 4) makes a decision

φ(x) =

0 if ψ(x) ≤ 1/21 otherwise,

based upon a linear combination ψ(x) of the inputs,

ψ(x) = c0 +d∑i=1

cix(i) = c0 + cTx, (30.1)

where the ci’s are weights, x = (x(1), . . . , x(d))T , and c = (c1, . . . , cd)T . Thisis called a neural network without hidden layers (see Figure 4.1).

In a (feed-forward) neural network with one hidden layer, one takes

ψ(x) = c0 +k∑i=1

ciσ(ψi(x)), (30.2)

where the ci’s are as before, and each ψi is of the form given in (30.1):ψi(x) = bi +

∑dj=1 aijx

(j) for some constants bi and aij . The function σis called a sigmoid. We define sigmoids to be nondecreasing functions withσ(x) → −1 as x ↓ −∞ and σ(x) → 1 as x ↑ ∞. Examples include:

(1) the threshold sigmoid

σ(x) =−1 if x ≤ 01 if x > 0;

30.1 Multilayer Perceptrons 527

(2) the standard, or logistic, sigmoid

σ(x) =1− e−x

1 + e−x;

(3) the arctan sigmoid

σ(x) =2π

arctan(x);

(4) the gaussian sigmoid

σ(x) = 2∫ x

−∞

1√2πe−u

2/2du− 1.

Figure 30.1. A neural network with one hidden layer. Thehidden neurons are those within the frame.

Figure 30.2. The threshold, standard, arctan, and gaus-sian sigmoids.


For early discussion of multilayer perceptrons, see Rosenblatt (1962),Barron (1975), Nilsson (1965), and Minsky and Papert (1969). Surveysmay be found in Barron and Barron (1988), Ripley (1993; 1994), Hertz,Krogh, and Palmer (1991), and Weiss and Kulikowski (1991).

In the perceptron with one hidden layer, we say that there are k hiddenneurons—the output of the i-th hidden neuron is ui = σ(ψi(x)). Thus,(30.2) may be rewritten as

ψ(x) = c0 +k∑i=1

ciui,

which is similar in form to (30.1). We may continue this process and createmultilayer feed-forward neural networks. For example, a two-hidden-layerperceptron uses

ψ(x) = c0 +l∑i=1

cizi,

where

zi = σ

di0 +k∑j=1

dijuj

, 1 ≤ i ≤ l,

and

uj = σ

(bj +

d∑i=1

ajix(i)

), 1 ≤ j ≤ k,

and the dij ’s, bj ’s, and aji’s are constants. The first hidden layer has khidden neurons, while the second hidden layer has l hidden neurons.

Figure 30.3. A feed-forward neural network with two hid-den layers.


The step from perceptron to a one-hidden-layer neural network is non-trivial. We know that linear discriminants cannot possibly lead to univer-sally consistent rules. Fortunately, one-hidden-layer neural networks yielduniversally consistent discriminants provided that we allow k, the numberof hidden neurons, to grow unboundedly with n. The interest in neuralnetworks is undoubtedly due to the possibility of implementing them di-rectly via processors and circuits. As the hardware is fixed beforehand, onedoes not have the luxury to let k become a function of n, and thus, theclaimed universal consistency is a moot point. We will deal with both fixedarchitectures and variable-sized neural networks. Because of the universalconsistency of one-hidden-layer neural networks, there is little theoreticalgain in considering neural networks with more than one hidden layer. Theremay, however, be an information-theoretic gain as the number of hiddenneurons needed to achieve the same performance may be substantially re-duced. In fact, we will make a case for two hidden layers, and show thatafter two hidden layers, little is gained for classification.

For theoretical analysis, the neural networks are rooted in a classicaltheorem by Kolmogorov (1957) and Lorentz (1976) which states that everycontinuous function f on [0, 1]d can be written as

f(x) =2d+1∑i=1

Fi

d∑j=1

Gij(x(j))

,

where the Gij ’s and the Fi’s are continuous functions whose form dependson f . We will see that neural networks approximate any measurable func-tion with arbitrary precision, despite the fact that the form of the sigmoidsis fixed beforehand.

As an example, consider d = 2. The function x(1)x(2) is rewritten as

14

((x(1) + x(2)

)2

−(x(1) − x(2)

)2),

which is in the desired form. However, it is much less obvious how one wouldrewrite more general continuous functions. In fact, in neural networks, weapproximate the Gij ’s and Fi’s by functions of the form σ(b + aTx) andallow the number of tunable coefficients to be high enough such that anycontinuous function may be represented—though no longer rewritten ex-actly in the form of Kolmogorov and Lorentz. We discuss other examplesof approximations based upon such representations in a later section.

30.2 Arrangements 530

Figure 30.4. The general Kolmogorov-Lorentz representa-tion of a continuous function.

30.2 Arrangements

A finite set A of hyperplanes in Rd partitions the space into connected con-vex polyhedral pieces of various dimensions. Such a partition P = P(A)is called an arrangement. An arrangement is called simple if any d hyper-planes of A have a unique point in common and if d+ 1 hyperplanes haveno point in common.

Figure 30.5. An arrangementof five lines in the plane.

Figure 30.6. An arrangementclassifier.

A simple arrangement creates polyhedral cells. Interestingly, the numberof these cells is independent of the actual configuration of the hyperplanes.


In particular, the number of cells is exactly 2k if d ≥ k, and

d∑i=0

(k

i

)if d < k,

where |A| = k. For a proof of this, see Problem 22.1, or Lemma 1.2 ofEdelsbrunner (1987). For general arrangements, this is merely an upperbound.

We may of course use arrangements for designing classifiers. We let gAbe the natural classifier obtained by taking majority votes over all Yi’s forwhich Xi is in the same cell of the arrangement P = P(A) as x.

All classifiers discussed in this section possess the property that they areinvariant under linear transformations and universally consistent (in somecases, we assume that X has a density, but that is only done to avoid messytechnicalities).

If we fix k and find that A with |A| = k for which the empirical error

L(R)n =

1n

n∑i=1

IgA(Xi) 6=Yi

is minimal, we obtain—perhaps at great computational expense—the em-pirical risk optimized classifier. There is a general theorem for such classi-fiers—see, for example, Corollary 23.2—the conditions of which are as fol-lows:

(1) It must be possible to select a given sequence of A’s for which Ln(gA)(the conditional probability of error with gA) tends to L∗ in proba-bility. But if k → ∞, we may align the hyperplanes with the axes,and create a cubic histogram, for which, by Theorem 6.2, we haveconsistency if the grid expands to ∞ and the cell sizes in the gridshrink to 0. Thus, as k →∞, this condition holds trivially.

(2) The collection G = gA is not too rich, in the sense that n/ logS(G, n)→∞, where S(G, n) denotes the shatter coefficient of G, that is, themaximal number of ways (x1, y1), . . . , (xn, yn) can be split by sets ofthe form ⋃

A∈P(A)

A× 0

∪

⋃A∈P(A)

A× 1

.

If |A| = 1, we know that S(G, n) ≤ 2(nd + 1) (see Chapter 13). For|A| = k, a trivial upper bound is(

2(nd + 1))k.

The consistency condition is fulfilled if k = o(n/ log n).


We have

Theorem 30.1. The empirical-risk-optimized arrangement classifier basedupon arrangements with |A| ≤ k has ELn → L∗ for all distributions ifk →∞ and k = o(n/ log n).

Arrangements can also be made from the data at hand in a simpler way.Fix k points X1, . . . , Xk in general position and look at all possible

(kd

)hyperplanes you can form with these points. These form your collection A,which defines your arrangement. No optimization of any kind is performed.We take the natural classifier obtained by a majority vote within the cellsof the partition over (Xk+1, Yk+1), . . . , (Xn, Yn).

Figure 30.7. Arrangement determinedby k = 4 data points on the plane.

Here we cannot apply the powerful consistency theorem mentioned above.Also, the arrangement is no longer simple. Nevertheless, the partition ofspace depends on the Xi’s only, and thus Theorem 6.1 (together withLemma 20.1) is useful. The rule thus obtained is consistent if diam(A(X)) →0 in probability and the number of cells is o(n), where A(X) is the cell towhich X belongs in the arrangement. As the number of cells is certainlynot more than

d∑i=0

(k′

i

),

where k′ =(kd

), we see that the number of cells divided by n tends to zero

ifkd

2/n→ 0.

This puts a severe restriction on the growth of k. However, it is easy toprove the following:

Lemma 30.1. If k → ∞, then diam(A(X)) → 0 in probability wheneverX has a density.

Proof. As noted in Chapter 20 (see Problem 20.6), the set of all x forwhich for all ε > 0, we have µ(x + εQi) > 0 for all quadrants Q1, . . . , Q2d

having one vertex at (0, 0, . . . , 0) and sides of length one, has µ-measureone. For such x, if at least one of the Xi’s (i ≤ k) falls in each of the 2d

quadrants x+ εQi, then diam(A(x)) ≤ 2dε (see Figure 30.8).


Figure 30.8. The diameter of the cellcontaining x is less than 4ε if there is adata point in each of the four quadrantsof size ε around x.

Therefore, for arbitrary ε > 0,

Pdiam(A(x)) > 2dε ≤ 2d(

1− min1≤i≤2d

µ(x+ εQi))k

→ 0.

Thus, by the Lebesgue dominated convergence theorem,

Pdiam(A(X)) > 2dε → 0. 2

Theorem 30.2. The arrangement classifier defined above is consistent when-ever X has a density and

k →∞, kd2

= o(n).

The theorem points out that empirical error minimization over a finiteset of arrangements can also be consistent. Such a set may be formed asthe collection of arrangements consisting of hyperplanes through d pointsof X1, . . . , Xk. As nothing new is added here to the discussion, we refer thereader to Problem 30.1.

So how do we deal with arrangements in a computer? Clearly, to reacha cell, we find for each hyperplane A ∈ A the side to which x belongs.If f(x) = aTx + a0, then f(x) > 0 in one halfplane, f(x) = 0 on thehyperplane, and f(x) < 0 in the other halfplane. If A = A1, . . . , Ak, thevector (IH1(x)>0, . . . , IHk(x)>0) describes the cell to which x belongs,where Hi(x) is a linear function that is positive if x is on one side of thehyperplane Ai, negative if x is on the other side of Ai, and 0 if x ∈ Ai. Adecision is thus reached in time O(kd). More importantly, the whole processis easily parallelizable and can be pictured as a battery of perceptrons. Itis easy to see that the classifier depicted in Figure 30.9 is identical to thearrangement classifier. In neural network terminology, the first hidden layerof neurons corresponds to just k perceptrons (and has k(d+ 1) weights orparameters, if you wish). The first layer outputs a k-vector of bits thatpinpoints the precise location of x in the cells of the arrangement. Thesecond layer only assigns a class (decision) to each cell of the arrangementby firing up one neuron. It has 2k neurons (for class assignments), butof course, in natural classifiers, these neurons do not require training orlearning—the majority vote takes care of that.


Figure 30.9. Arrangement classifier realized by a two-hidden-layer neural network. Each of the 2k cells in the sec-ond hidden layer performs an “and” operation: the outputof node “101” is 1 if its three inputs are 1,0, and 1, respec-tively. Otherwise its output is 0. Thus, one and only one ofthe 2k outputs is 1.

If a more classical second layer is needed—without boolean operations—let b = (b1, . . . , bk) be the k-vector of bits seen at the output of the firstlayer. Assign a perceptron in the second layer to each region of the arrange-ment and define the output z ∈ −1, 1 to be σ

(∑ki=1 cjbj − k + 1/2

),

where cj ∈ −1, 1 are weights. For each region of the arrangement, wehave a description in terms of c = (c1, . . . , ck). The argument of the sig-moid function is 1/2 if 2bj − 1 = cj for all j and is negative otherwise.Hence z = 1 if and only if 2b−1 = c. Assume we now take a decision basedupon the sign of ∑

l

wlzl + w0,

where the wl’s are weights and the zl’s are the outputs of the second hiddenlayer. Assume that we wish to assign class 1 to s regions in the arrangementand class 0 to t other regions. For a class 1 region l, set wl = 1, and for aclass 0 region, set wl = −1. Define w0 = 1+ s− t. Then, if zj = 1, zi = −1,i 6= j,

∑l

wlzl + w0 = wj + w0 −∑i 6=j

wi =

1 if wj = 1−1 if wj = −1.


Figure 30.10. The second hidden layer of a two-hidden-layer neural network with threshold sigmoids in the firstlayer. For each k-vector of bits b = (b1, . . . , bk) at the outputof the first layer, we may find a decision g(b) ∈ 0, 1. Nowreturn to the two-hidden-layer network of Figure 30.9 andassign the values g(b) to the neurons in the second hiddenlayer to obtain an equivalent network.

Thus, every arrangement classifier corresponds to a neural network withtwo hidden layers, and threshold units. The correspondence is also recip-rocal. Assume someone shows a two-hidden-layer neural network with thefirst hidden layer as above—thus, outputs consist of a vector of k bits—andwith the second hidden layer consisting once again a battery of perceptrons(see Figure 30.10). Whatever happens in the second hidden layer, the de-cision is just a function of the configuration of k input bits. The outputof the first hidden layer is constant over each region of the arrangementdefined by the hyperplanes given by the input weights of the units of thefirst layer. Thus, the neural network classifier with threshold units in thefirst hidden layer is equivalent to an arrangement classifier with the samenumber of hyperplanes as units in the first hidden layer. The equivalencewith tree classifiers is described in Problem 30.2.

Of course, equivalence is only valid up to a certain point. If the num-ber of neurons in the second layer is small, then neural networks are morerestricted. This could be an advantage in training. However, the major-ity vote in an arrangement classifier avoids training of the second layer’sneurons altogether, and offers at the same time an easier interpretation ofthe classifier. Conditions on consistency of general two-hidden-layer neuralnetworks will be given in Section 30.4.

30.3 Approximation by Neural Networks 536

30.3 Approximation by Neural Networks

Consider first the class C(k) of classifiers (30.2) that contains all neuralnetwork classifiers with the threshold sigmoid and k hidden nodes in twohidden layers. The training data Dn are used to select a classifier from C(k).For good performance of the selected rule, it is necessary that the best rulein C(k) has probability of error close to L∗, that is, that

infφ∈C(k)

L(φ)− L∗

is small. We call this quantity the approximation error. Naturally, for fixedk, the approximation error is positive for most distributions. However, forlarge k, it is expected to be small. The question is whether the last state-ment is true for all distributions of (X,Y ). We showed in the section onarrangements that the class of two-hidden-layer neural network classifierswith m nodes in the first layer and 2m nodes in the second layer containsall arrangement classifiers with m hyperplanes. Therefore, for k = m+2m,the class of all arrangement classifiers with m hyperplanes is a subclass ofC(k). From this, we easily obtain the following approximation result:

Theorem 30.3. If C(k) is the class of all neural network classifiers withthe threshold sigmoid and k neurons in two hidden layers, then

limk→∞

infφ∈C(k)

L(φ)− L∗ = 0

for all distributions of (X,Y ).

It is more surprising that the same property holds if C(k) is the class ofone-hidden-layer neural networks with k hidden neurons, and an arbitrarysigmoid. More precisely, C(k) is the class of classifiers

φ(x) =


where ψ is as in (30.2).By Theorem 2.2, we have

L(φ)− L∗ ≤ 2E|ψ(X)− η(X)|,

where η(x) = PY = 1|X = x. Thus, infφ∈C(k) L(φ)− L∗ → 0 as k → ∞if

E|ψk(x)− η(x)| → 0

for some sequence ψk with φk ∈ C(k) for φk(x) = Iψk(x)>1/2. For uni-versal consistency, we need only assure that the family of ψ’s can approx-imate any η in the L1(µ) sense. In other words, the approximation errorinfφ∈C(k) L(φ)− L∗ converges to zero if the class of functions ψ is dense in


L1(µ) for every µ. Another sufficient condition for this—but of course muchtoo severe—is that the class F of functions ψ becomes dense in the L∞(supremum-) norm in the space of continuous functions C[a, b]d on [a, b]d,where [a, b]d denotes the hyperrectangle of Rd defined by its opposite ver-tices a and b, for any a and b.

Lemma 30.2. Assume that a sequence of classes of functions Fk becomesdense in the L∞ norm in the space of continuous functions C[a, b]d (where[a, b]d is the hyperrectangle of Rd defined by a, b). In other words, assumethat for every a, b ∈ Rd, and every bounded function g,

limk→∞

inff∈Fk

supx∈[a,b]d

|f(x)− g(x)| = 0.

Then for any distribution of (X,Y ),

limk→∞

infφ∈C(k)

L(φ)− L∗ = 0,

where C(k) is the class of classifiers φ(x) = Iψ(x)>1/2 for ψ ∈ Fk.

Proof. For fixed ε > 0, find a, b such that µ([a, b]d) ≥ 1 − ε/3, where µis the probability measure of X. Choose a continuous function η vanishingoff [a, b]d such that

E|η(X)− η(X)| ≤ ε

6.

Find k and f ∈ Fk such that

supx∈[a,b]d

|f(x)− η(x)| ≤ ε

6.

For φ(x) = If(x)>1/2, we have, by Theorem 2.2,

L(φ)− L∗ ≤ 2E|f(X)− η(X)|IX∈[a,b]d

+ε

3

≤ 2E|f(X)− η(X)|IX∈[a,b]d

+ 2E |η(X)− η(X)|+

ε

3

≤ 2 supx∈[a,b]d

|f(x)− η(x)|+ +2E |η(X)− η(X)|+ε

3

≤ ε. 2

This text is basically about all such good families, such as families thatare obtainable by summing kernel functions, and histogram families. Thefirst results for approximation with neural networks with one hidden layerappeared in 1989, when Cybenko (1989), Hornik, Stinchcombe, and White(1989), and Funahashi (1989) proved independently that feedforward neu-ral networks with one hidden layer are dense with respect to the supremum


norm on bounded sets in the set of continuous functions. In other words,by taking k large enough, every continuous function on Rd can be approx-imated arbitrarily closely, uniformly over any bounded set by functionsrealized by neural networks with one hidden layer. For a survey of variousdenseness results we refer to Barron (1989) and Hornik (1993). The proofgiven here uses ideas of Chen, Chen, and Liu (1990). It uses the densenessof the class of trigonometric polynomials in the L∞ sense for C[0, 1]d (thisis a special case of the Stone-Weierstrass theorem; see Theorem A.9 in theAppendix), that is, by functions of the form∑

i

(αi cos(2πaTi x) + βi sin(2πbTi x)

),

where ai, bi are integer-valued vectors of Rd.

Theorem 30.4. For every continuous function f : [a, b]d → R and for ev-ery ε > 0, there exists a neural network with one hidden layer and functionψ(x) as in (30.2) such that

supx∈[a,b]d

|f(x)− ψ(x)| < ε.

Proof. We prove the theorem for the threshold sigmoid

σ(x) =−1 if x ≤ 01 if x > 0.

The extension to general nondecreasing sigmoids is left as an exercise (Prob-lem 30.3). Fix ε > 0. We take the Fourier series approximation of f(x). Bythe Stone-Weierstrass theorem (Theorem A.9), there exists a large positiveinteger M , nonzero real coefficients α1, . . . , αM , β1, . . . , βM , and integersmi,j for i = 1, . . . ,M, j = 1, . . . , d, such that

supx∈[a,b]d

∣∣∣∣∣M∑i=1

(αi cos

(πamTi x)

+ βi sin(πamTi x))

− f(x)

∣∣∣∣∣ < ε

2,

where mi = (mi,1, . . . ,mi,d), i = 1, . . . ,M . It is clear that every contin-uous function on the real line that is zero outside some bounded intervalcan be arbitrarily closely approximated uniformly on the interval by one-dimensional neural networks, that is, by functions of the form

k∑i=1

ciσ(aix+ bi) + c0.

Just observe that the indicator function of an interval [b, c] may be writtenas σ(x − b) + σ(−x + c). This implies that bounded functions such as sin


and cos can be approximated arbitrarily closely by neural networks. Inparticular, there exist neural networks ui(x), vi(x) with i = 1, . . . ,M , (i.e.,mappings from Rd to R) such that

supx∈[a,b]d

∣∣∣ui(x)− cos(πamTi x)∣∣∣ < ε

4M |αi|

andsup

x∈[a,b]d

∣∣∣vi(x)− sin(πamTi x)∣∣∣ < ε

4M |βi|.

Therefore, applying the triangle inequality we get

supx∈[a,b]d

∣∣∣∣∣M∑i=1

(αi cos

(πamTi x)


−M∑i=1

(αiui(x) + βivi(x))

∣∣∣∣∣<ε

2.

Since the ui’s and vi’s are neural networks, their linear combination

ψ(x) =M∑i=1

(αiui(x) + βivi(x))

is a neural network too and, in fact,

supx∈[a,b]d

|f(x)− ψ(x)|

≤ supx∈[a,b]d

∣∣∣∣∣f(x)−M∑i=1

(αi cos

(πamTi x)

+ βi sin(πamTi x))∣∣∣∣∣

+ supx∈[a,b]d

∣∣∣∣∣M∑i=1

(αi cos

(πamTi x)


− ψ(x)

∣∣∣∣∣<

2ε2

= ε. 2

The convergence may be arbitrarily slow for some f . By restricting theclass of functions, it is possible to obtain upper bounds for the rate ofconvergence. For an example, see Barron (1993). The following corollary ofTheorem 30.4 is obtained via Lemma 30.2:

Corollary 30.1. Let C(k) contain all neural network classifiers defined bynetworks of one hidden layer with k hidden nodes, and an arbitrary sigmoidσ. Then for any distribution of (X,Y ),

limk→∞

infφ∈C(k)

L(φ)− L∗ = 0.

30.4 VC Dimension 540

The above convergence also holds if the range of the parameters aij , bi, ci isrestricted to an interval [−βk, βk], where limk→∞ βk = ∞.

Remark. It is also true that the class of one-hidden-layer neural networkswith k hidden neurons becomes dense in L1(µ) for every probability mea-sure µ on Rd as k → ∞ (see Problem 30.4). Then Theorem 2.2 may beused directly to prove Corollary 30.1. 2

In practice, the network architecture (i.e., k in our case) is given to thedesigner, who can only adjust the parameters aij , bi, and ci, dependingon the data Dn. In this respect, the above results are only of theoreticalinterest. It is more interesting to find out how far the error probability ofthe chosen rule is from infφ∈Ck

L(φ). We discuss this problem in the nextfew sections.

30.4 VC Dimension

Assume now that the data Dn = ((X1, Y1), . . . , (Xn, Yn)) are used to tunethe parameters of the network. To choose a classifier from C(k), we focuson the difference between the probability of error of the selected rule andthat of the best classifier in C(k). Recall from Chapters 12 and 14 that thevc dimension VC(k) of the class C(k) determines the performance of somelearning algorithms. Theorem 14.5 tells us that no method of picking aclassifier from C(k) can guarantee better than Ω

(√VC(k)/n

)performance

uniformly for all distributions. Thus, for meaningful distribution-free per-formance guarantees, the sample size n has to be significantly larger thanthe vc dimension. On the other hand, by Corollary 12.1, there exists a wayof choosing the parameters of the network—namely, by minimization of theempirical error probability—such that the obtained classifier φ∗n satisfies

E L(φ∗n) − infφ∈C(k)

L(φ) ≤ 16

√VC(k) log n+ 4

2n

for all distributions. On the other hand, if VC(k) = ∞, then for any n and anyrule, some bad distributions exist that induce very large error probabilities(see Theorem 14.3).

We start with a universal lower bound on the vc dimension of networkswith one hidden layer.

Theorem 30.5. (Baum (1988)). Let σ be an arbitrary sigmoid andconsider the class C(k) of neural net classifiers with k nodes in one hiddenlayer. Then

VC(k) ≥ 2⌊k

2

⌋d.


Proof. We prove the statement for the threshold sigmoid, and leave theextension as an exercise (Problem 30.7). We need to show that there is aset of n = 2bk/2cd points in Rd that can be shattered by sets of the formx : ψ(x) > 1/2, where ψ is a one-layer neural network of k hidden nodes.Clearly, it suffices to prove this for even k. In fact, we prove more: if k iseven, any set of n = kd points in general position can be shattered (pointsare in general position if no d+1 points fall on the same d− 1-dimensionalhyperplane). Let x1, . . . , xn be a set of n = kd such points. For eachsubset of this set, we construct a neural network ψ with k hidden nodessuch that ψ(xi) > 1/2 if and only if xi is a member of this subset. We mayassume without loss of generality that the cardinality of the subset to bepicked out is at most n/2, since otherwise we can use 1/2 − ψ(x), whereψ picks out the complement of the subset. Partition the subset into atmost n/(2d) = k/2 groups, each containing at most d points. For each suchgroup, there exists a hyperplane aTx+b = 0 that contains these points, butno other point from x1, . . . , xn. Moreover, there exists a small positivenumber h such that aTxi+ b ∈ [−h, h] if and only if xi is among this groupof at most d points. Therefore, the simple network

σ(aTx+ b+ h) + σ(−aTx− b+ h)

is larger than 0 on xi for exactly these xi’s. Denote the vectors a, andparameters b, h obtained for the k/2 groups by a1, . . . , ak/2, b1, . . . , bk/2,and h1, . . . , hk/2. Let h = minj≤k/2 hj . It is easy to see that

ψ(x) =k/2∑j=1

(σ(aTj x+ bj + h) + σ(−aTj x− bj + h)

)+

12

is larger than 1/2 for exactly the desired xi’s. This network has k hiddennodes. 2

Theorem 30.5 implies that there is no hope for good performance guar-antees unless the sample size is much larger than kd. Recall Chapter 14,where we showed that n VC(k) is necessary for a guaranteed small errorprobability, regardless of the method of tuning the parameters. Bartlett(1993) improved Theorem 30.5 in several ways. For example, he provedthat

VC(k) ≥ dmin(k,

2d

d2/2 + d+ 1

)+ 1.

Bartlett also obtained similar lower bounds for not fully connected networks—see Problem 30.9—and for two-hidden-layer networks.

Next we show that for the threshold sigmoid, the bound of Theorem 30.5is tight up to a logarithmic factor, that is, the vc dimension is at most ofthe order of kd log k.


Theorem 30.6. (Baum and Haussler (1989)). Let σ be the thresholdsigmoid and let C(k) be the class of neural net classifiers with k nodes inthe hidden layer. Then the shatter coefficients satisfy

S(C(k), n) ≤

(d+1∑i=0

(n

i

))k(k+1∑i=0

(n

i

))

≤(

ne

d+ 1

)k(d+1)(ne

k + 1

)k+1

≤ (ne)kd+2k+1,

which implies that for all k, d ≥ 1,

VC(k) ≤ (2kd+ 4k + 2) log2(e(kd+ 2k + 1)).

Proof. Fix n points x1, . . . , xn ∈ Rd. We bound the number of differ-ent values of the vector (φ(x1), . . . , φ(xn)) as φ ranges through C(k). Anode j in the hidden layer realizes a dichotomy of the n points by ahyperplane split. By Theorems 13.9 and 13.3, this can be done at most∑d+1i=0

(ni

)≤ (ne/(d + 1))d+1 different ways. The different splittings ob-

tained at the k nodes determine the k-dimensional input of the outputnode. Different choices of the parameters c0, c1, . . . , ck of the output nodedetermine different k-dimensional linear splits of the n input vectors. Thiscannot be done in more than

∑k+1i=0

(ni

)≤ (ne/(k + 1))k+1 different ways

for a fixed setting of the aij and bi parameters. This altogether yields atmost (

d+1∑i=0

(n

i

))k(k+1∑i=0

(n

i

))different dichotomies of the n points x1, . . . , xn, as desired. The bound onthe vc dimension follows from the fact that VC(k) ≤ n if S(C(k), n) ≥ 2n. 2

For threshold sigmoids, the gap between the lower and upper boundsabove is logarithmic in kd. Notice that the vc dimension is about the num-ber of weights (or tunable parameters) w = kd + 2k + 1 of the network.Surprisingly, Maass (1994) proved that for networks with at least two hid-den layers, the upper bound has the right order of magnitude, that is, thevc dimension is Ω(w logw). A simple application of Theorems 30.4 and30.6 provides the next consistency result that was pointed out in Faragoand Lugosi (1993):

Theorem 30.7. Let σ be the threshold sigmoid. Let gn be a classifier fromC(k) that minimizes the empirical error

Ln(φ) =1n

n∑i=1

Iφ(Xi) 6=Yi


over φ ∈ C(k). If k → ∞ such that k log n/n → 0 as n → ∞, then gn isstrongly universally consistent, that is,

limn→∞

L(gn) = L∗

with probability one for all distributions of (X,Y ).

Proof. By the usual decomposition into approximation and estimationerrors,

L(gn)− L∗ =(L(gn)− inf

φ∈C(k)L(φ)

)+(

infφ∈C(k)

L(φ)− L∗).

The second term on the right-hand side tends to zero by Corollary 30.1.For the estimation error, by Theorems 12.6 and 30.6,

PL(gn)− inf

φ∈C(k)L(φ) > ε

≤ 8S(C(k), n)e−nε

2/128

≤ 8(ne)(kd+2k+1)e−nε2/128,

which is summable if k = o(n/ log n). 2

The theorem assures us that if σ is the threshold sigmoid, then a sequenceof properly sized networks may be trained to asymptotically achieve theoptimum probability of error, regardless of what the distribution of (X,Y )is. For example, k ≈

√n will do the job. However, this is clearly not the op-

timal choice in the majority of cases. Since Theorem 30.6 provides suitableupper bounds on the vc dimension of each class C(k), one may use com-plexity regularization as described in Chapter 18 to find a near-optimumsize network.

Unfortunately, the situation is much less clear for more general, contin-uous sigmoids. The vc dimension then depends on the specific sigmoid. Itis not hard to see that the vc dimension of C(k) with an arbitrary nonde-creasing sigmoid is always larger than or equal to that with the thresholdsigmoid (Problem 30.8). Typically, the vc dimension of a class of such net-works is significantly larger than that for the threshold sigmoid. In fact,it can even be infinite! Macintyre and Sontag (1993) demonstrated theexistence of continuous, infinitely many times differentiable monotone in-creasing sigmoids such that the vc dimension of C(k) is infinite if k ≥ 2.Their sigmoids have little squiggles, creating the large variability. It is evenmore surprising that infinite vc dimension may occur for even smoothersigmoids, whose second derivative is negative for x > 0 and positive forx < 0. In Chapter 25 (see Problem 25.11) we basically proved the followingresult. The details are left to the reader (Problem 30.13).


Theorem 30.8. There exists a sigmoid σ that is monotone increasing,continuous, concave on (0,∞), and convex on (−∞, 0), such that VC(k) = ∞for each k ≥ 8.

We recall once again that infinite vc dimension implies that there is nohope of obtaining nontrivial distribution-free upper bounds on

L(gn)− infφ∈C(k)

L(φ),

no matter how the training sequence Dn is used to select the parametersof the neural network. However, as we will see later, it is still possibleto obtain universal consistency. Finiteness of the vc dimension has beenproved for many types of sigmoids. Maass (1993) and Goldberg and Jerrum(1993) obtain upper bounds for piecewise polynomial sigmoids. The resultsof Goldberg and Jerrum (1993) apply for general classes parametrized byreal numbers, e.g., for classes of neural networks with the sigmoid

σ(x) =

1− 1/(2x+ 2) if x ≥ 01/(2− 2x) if x < 0.

Macintyre and Sontag (1993) prove VC(k) <∞ for a large class of sigmoids,which includes the standard, arctan, and gaussian sigmoids. While finite-ness is useful, the lack of an explicit tight upper bound on VC(k) prevents usfrom getting meaningful upper bounds on the performance of gn, and alsofrom applying the structural risk minimization of Chapter 18. For the stan-dard sigmoid, and for networks with k hidden nodes and w tunable weightsKarpinski and Macintyre (1994) recently reported the upper bound

V ≤ kw(kw − 1)2

+ w(1 + 2k) + w(1 + 3k) log(3w + 6kw + 3).

See also Shawe-Taylor (1994).Unfortunately, the consistency result of Theorem 30.7 is only of theoret-

ical interest, as there is no efficient algorithm to find a classifier that mini-mizes the empirical error probability. Relatively little effort has been madeto solve this important problem. Farago and Lugosi (1993) exhibit an algo-rithm that finds the empirically optimal network. However, their methodtakes time exponential in kd, which is intractable even for the smallest toyproblems. Much more effort has been invested in the tuning of networks byminimizing the empirical squared error, or the empirical L1 error. Theseproblems are also computationally demanding, but numerous suboptimalhill-climbing algorithms have been used with some success. Most famousamong these is the back propagation algorithm of Rumelhart, Hinton, andWilliams (1986). Nearly all known algorithms that run in reasonable timemay get stuck at local optima, which results in classifiers whose probabilityof error is hard to predict. In the next section we study the error probabilityof neural net classifiers obtained by minimizing empirical Lp errors.


We end this section with a very simple kind of one-layer network. Thecommittee machine (see, e.g., Nilsson (1965) and Schmidt (1994)) is a spe-cial case of a one-hidden-layer neural network of the form (30.2) with c0 = 0,c1 = · · · = ck = 1, and the threshold sigmoid.

Figure 30.11. The committee machine has fixed weights atthe output of the hidden layer.

Committee machines thus use a majority vote over the outcomes of thehidden neurons. It is interesting that the lower bound of Theorem 30.5remains valid when C(k) is the class of all committee machines with kneurons in the hidden layer. It is less obvious, however, that the class ofcommittee machines is large enough for the asymptotic property

limk→∞

infφ∈C(k)

L(φ) = L∗

for all distributions of (X,Y ) (see Problem 30.6).

Figure 30.12. A partition of theplane determined by a committeemachine with 5 hidden neurons.The total vote is shown in each re-gion. The region in which we de-cide “0” is shaded.

30.5 L1 Error Minimization 546

30.5 L1 Error Minimization

In the previous section, we obtained consistency for the standard thresholdsigmoid networks by empirical risk minimization. We could not apply thesame methodology for general sigmoids simply because the vc dimensionfor general sigmoidal networks is not bounded. It is bounded for certainclasses of sigmoids, and for those, empirical risk minimization yields uni-versally consistent classifiers. Even if the vc dimension is infinite, we mayget consistency, but this must then be proved by other methods, such asmethods based upon metric entropy and covering numbers (see Chapters 28and 29, as well as the survey by Haussler (1992)). One could also train theclassifier by minimizing another empirical criterion, which is exactly whatwe will do in this section. We will be rewarded with a general consistencytheorem for all sigmoids.

For 1 ≤ p < ∞, the empirical Lp error of a neural network ψ is definedby

J (p)n (ψ) =

(1n

n∑i=1

|ψ(Xi)− Yi|p)1/p

.

The most interesting cases are p = 1 and p = 2. For p = 2 this is just theempirical squared error, while p = 1 yields the empirical absolute error.Often it makes sense to attempt to choose the parameters of the networkψ such that J (p)

n (ψ) is minimized. In situations where one is not only inter-ested in the number of errors, but also how robust the decision is, such errormeasures may be meaningful. In other words, these error measures penalizeeven good decisions if ψ is close to the threshold value 0. Minimizing J (p)

n

is like finding a good regression function estimate. Our concern is primarilywith the error probability. In Chapter 4 we already highlighted the dan-gers of squared error minimization and Lp errors in general. Here we willconcentrate on the consistency properties. We minimize the empirical errorover a class of functions, which should not be too large to avoid overfitting.However, the class should be large enough to contain a good approximationof the target function. Thus, we let the class of candidate functions growwith the sample size n, as in Grenander’s “method of sieves” (Grenander(1981)). Its consistency and rates of convergence have been widely studiedprimarily for least squares regression function estimation and nonparamet-ric maximum likelihood density estimation—see Geman and Hwang (1982),Gallant (1987), and Wong and Shen (1992).

Remark. Regression function estimation. In the regression functionestimation setup, White (1990) proved consistency of neural network es-timates based on squared error minimization. Barron (1991; 1994) used acomplexity-regularized modification of these error measures to obtain thefastest possible rate of convergence for nonparametric neural network es-timates. Haussler (1992) provides a general framework for empirical error


minimization, and provides useful tools for handling neural networks. Vari-ous consistency properties of nonparametric neural network estimates havebeen proved by White (1991), Mielniczuk and Tyrcha (1993), and Lugosiand Zeger (1995). 2

We only consider the p = 1 case, as the generalization to other values ofp is straightforward. Define the L1 error of a function ψ : Rd → R by

J(ψ) = E|ψ(X)− Y |.

We pointed out in Problem 2.12 that one of the functions minimizing J(ψ)is the Bayes rule g∗ whose error is denoted by

J∗ = infψJ(ψ) = J(g∗).

Then clearly, J∗ = L∗. We have also seen that if we define a decision by

g(x) =


then its error probability L(g) = Pg(X) 6= Y satisfies the inequality

L(g)− L∗ ≤ J(ψ)− J∗.

Our approach is to select a neural network from a suitably chosen class ofnetworks by minimizing the empirical error

Jn(ψ) =1n

n∑i=1

|ψ(Xi)− Yi|.

Denoting this function by ψn, according to the inequality above, the clas-sifier

gn(x) =

0 if ψn(x) ≤ 1/21 otherwise,

is consistent if the L1 error

J(ψn) = E |ψn(X)− Y ||Dn

converges to J∗ in probability. Convergence with probability one providesstrong consistency. For universal convergence, the class over which the min-imization is performed has to be defined carefully. The following theoremshows that this may be achieved by neural networks with k nodes, in whichthe range of the output weights c0, c1, . . . , ck is restricted.

Theorem 30.9. (Lugosi and Zeger (1995)). Let σ be an arbitrarysigmoid. Define the class Fn of neural networks by

Fn =

kn∑i=1

ciσ(aTi x+ bi) + c0 : ai ∈ Rd, bi ∈ R,kn∑i=0

|ci| ≤ βn

,


and let ψn be a function that minimizes the empirical L1 error

Jn(ψ) =1n

n∑i=1

|ψ(Xi)− Yi|

over ψ ∈ Fn. If kn and βn satisfy

limn→∞

kn = ∞, limn→∞

βn = ∞, and limn→∞

knβ2n log(knβn)

n= 0,

then the classification rule

gn(x) =

0 if ψn(x) ≤ 1/21 otherwise

is universally consistent.

Remark. Strong universal consistency may also be shown by imposingslightly more restrictive conditions on kn and βn (see Problem 30.16). 2

Proof. By the argument preceding the theorem, it suffices to prove thatJ(ψn)− J∗ → 0 in probability. Write

J(ψn)− J∗ =(J(ψn)− inf

ψ∈Fn

J(ψ))

+(

infψ∈Fn

J(ψ)− J∗).

To handle the approximation error—the second term on the right-handside—let ψ′ ∈ Fn be a function such that

E|ψ′(X)− g∗(X)| ≤ E|ψ(X)− g∗(X)|

for each ψ ∈ Fn. The existence of such a function may be seen by notingthat E|ψ(X)−g∗(X)| is a continuous function of the parameters ai, bi, ciof the neural network ψ. Clearly,

infψ∈Fn

J(ψ)− J∗ ≤ J(ψ′)− J∗

= E|ψ′(X)− Y | −E|g∗(X)− Y |≤ E|ψ′(X)− g∗(X)|,

which converges to zero as n→∞, by Problem 30.4. We start the analysisof the estimation error by noting that as in Lemma 8.2, we have

J(ψn)− infψ∈Fn

J(ψ) ≤ 2 supψ∈Fn

|J(ψ)− Jn(ψ)|

= 2 supψ∈Fn

∣∣∣∣∣E|ψ(X)− Y | − 1n

n∑i=1

|ψ(Xi)− Yi|

∣∣∣∣∣ .


Define the class Mn of functions on Rd × 0, 1 by

Mn

=

m(x, y) =

∣∣∣∣∣kn∑i=1

ciσ(aTi x+ bi) + c0 − y

∣∣∣∣∣ : ai ∈ Rd, bi ∈ R,kn∑i=0

|ci| ≤ βn

.

Then the previous bound becomes

2 supm∈Mn

∣∣∣∣∣Em(X,Y ) − 1n

n∑i=1

m(Xi, Yi)

∣∣∣∣∣ .Such quantities may be handled by the uniform law of large numbers ofTheorem 29.1, which applies to classes of uniformly bounded functions.Indeed, for each m ∈Mn

m(x, y) =

∣∣∣∣∣kn∑i=1

ciσ(aTi x+ bi) + c0 − y

∣∣∣∣∣≤ 2 max

(kn∑i=1

ciσ(aTi x+ bi) + c0, 1

)

≤ 2 max

(kn∑i=1

|ci|, 1

)≤ 2βn,

if n is so large that βn ≥ 1. For such n’s, Theorem 29.1 states that

P

sup

m∈Mn

∣∣∣∣∣Em(X,Y ) − 1n

n∑i=1

m(Xi, Yi)

∣∣∣∣∣ > ε

≤ 8E N (ε/8,Mn(Dn)) e−nε

2/(512β2n) ,

where N (ε,Mn(Dn)) denotes the l1-covering number of the random set

Mn(Dn) = (m(X1, Y1), . . . ,m(Xn, Yn)) : m ∈Mn ⊂ Rn,

defined in Chapter 29. All we need now is to estimate these covering num-bers. Observe that for m1,m2 ∈ Mn with m1(x, y) = |ψ1(x) − y| andm2(x, y) = |ψ2(x)− y|, for any probability measure ν on Rd × 0, 1,∫

|m1(x, y)−m2(x, y)|ν(d(x, y)) ≤∫|ψ1(x)− ψ2(x)|µ(dx),

where µ is the marginal of ν onRd. Therefore, it follows thatN (ε,Mn(Dn)) ≤N (ε,Fn(Xn

1 )), where Xn1 = (X1, . . . , Xn). It means that an upper bound


on the covering number of the class of neural networks Fn is also an up-per bound on the quantity that interests us. This bounding may be doneby applying lemmas from Chapters 13 and 30. Define the following threecollections of functions:

G1 =aTx+ b; a ∈ Rd, b ∈ R

,

G2 =σ(aTx+ b); a ∈ Rd, b ∈ R

,

G3 =cσ(aTx+ b); a ∈ Rd, b ∈ R, c ∈ [−βn, βn]

.

By Theorem 13.9, the vc dimension of the class of sets

G+1 = (x, t) : t ≤ ψ(x), ψ ∈ G1

is VG+1≤ d + 2. This implies by Lemma 29.5 that VG+

2≤ d + 2, so by

Corollary 29.2, for any xn1 = (x1, . . . , xn),

N (ε,G2(xn1 )) ≤ 2(

4eε

)2(d+2)

,

where G2(xn1 ) = z ∈ Rn : z = (g(x1), . . . , g(xn)), g ∈ G2. Now, usingsimilar notations, Theorem 29.7 allows us to estimate covering numbers ofG3(xn1 ):

N (ε,G3(xn1 )) ≤ 4εN (ε/(2βn),G2(xn1 )) ≤

(8eβnε

)2d+5

if βn > 2/e. Finally, we can apply Lemma 29.6 to obtain

N (ε,Fn(xn1 )) ≤ 2βn(kn + 1)ε

N (ε/(kn + 1),G3(xn1 ))kn

≤(

8e(kn + 1)βnε

)kn(2d+5)+1

.

Thus, substituting this bound into the probability inequality above, we getfor n large enough,

P

supψ∈Fn

∣∣∣∣∣∣E|ψ(X)− Y | − 1n

n∑j=1

|ψ(Xj)− Yj |

∣∣∣∣∣∣ > ε

≤ 8

(64e(kn + 1)βn

ε

)kn(2d+5)+1

e−nε2/(512β2

n),

which tends to zero ifknβ

2n log(knβn)

n→ 0,

30.6 The Adaline and Padaline 551

concluding the proof. 2

There are yet other ways to obtain consistency for general sigmoidalnetworks. We may restrict the network by discretizing the values of thecoefficients in some way—thus creating a sieve with a number of membersthat is easy to enumerate—and applying complexity regularization (Chap-ter 18). This is the method followed by Barron (1988; 1991).

30.6 The Adaline and Padaline

Widrow (1959) and Widrow and Hoff (1960) introduced the Adaline, andSpecht (1967; 1990) studied polynomial discriminant functions such as thePadaline. Looked at formally, the discriminant function ψ used in the deci-sion g(x) = Iψ(x)>0 is of a polynomial nature, with ψ consisting of sumsof monomials like

a(x(1)

)i1· · ·(x(d)

)id,

where i1, . . . , id ≥ 0 are integers, and a is a coefficient. The order of amonomial is i1 + · · · + id. Usually, all terms up to, and including those oforder r are included. Widrow’s Adaline (1959) has r = 1. The total numberof monomials of order r or less does not exceed (r+1)d. The motivation fordeveloping these discriminants is that only up to (r+ 1)d coefficients needto be trained and stored. In applications in which data continuously arrive,the coefficients may be updated on-line and the data can be discarded. Thisproperty is, of course, shared with standard neural networks. In most cases,order r polynomial discriminants are not translation invariant. Minimizinga given criterion on-line is a phenomenal task, so Specht noted that trainingis not necessary if the a’s are chosen so as to give decisions that are closeto those of the kernel method with normal kernels.

For example, if K(u) = e−u2/2, the kernel method picks a smoothing

factor h (such that h→ 0 and nhd →∞; see Chapter 10) and uses

ψ(x) =1n

n∑i=1

(2Yi − 1)K(‖x−Xi‖

h

)

=1n

n∑i=1

(2Yi − 1)e−xT x/(2h2)ex

TXi/h2e−X

Ti Xi/(2h

2) .

The same decision is obtained if we use

ψ(x) =1n

n∑i=1

(2Yi − 1)exTXi/h

2−XTi Xi/(2h

2) .

Now, approximate this by using Taylor’s series expansion and truncatingto the order r terms. For example, the coefficient of

(x(1)

)i1 · · · (x(d))id in

30.7 Polynomial Networks 552

the expansion of ψ(x) would be, if∑dj=1 ij = i,

1i1! · · · id!

· 1h2i

·n∑j=1

(2Yj − 1)(X

(1)j

)i1· · ·(X

(d)j

)ide−X

Tj Xj/(2h

2) .

These sums are easy to update on-line, and decisions are based on the signof the order r truncation ψr of ψ. The classifier is called the Padaline.Specht notes that overfitting in ψr does not occur due to the fact thatoverfitting does not occur for the kernel method based on ψ. His methodinterpolates between the latter method, generalized linear discrimination,and generalizations of the perceptron.

For fixed r, the Padaline defined above is not universally consistent (forthe same reason linear discriminants are not universally consistent), but ifr is allowed to grow with n, the decision based on the sign of ψr becomesuniversally consistent (Problem 30.14). Recall, however, that Padaline wasnot designed with a variable r in mind.

30.7 Polynomial Networks

Besides Adaline and Padaline, there are several ways of constructing poly-nomial many-layered networks in which basic units are of the form

(x(1)

)i1 · · ·(x(k)

)ik for inputs x(1), . . . , x(k) to that level. Pioneers in this respect areGabor (1961), Ivakhnenko (1968; 1971) who invented the gmdh method—the group method of data handling—and Barron (1975). See also Ivakhnenko,Konovalenko, Tulupchuk, Tymchenko (1968) and Ivakhnenko, Petrache,and Krasyts’kyy (1968). These networks can be visualized in the followingway:

Figure 30.13. Simple polynomial network: Each gi repre-sents a simple polynomial function of its input. In Bar-ron’s work (Barron (1975), Barron and Barron (1988)), thegi’s are sometimes 2-input elements of the form gi(x, y) =ai + bix+ ciy + dixy.

30.7 Polynomial Networks 553

If the gi’s are Barron’s quadratic elements, then the network shown inFigure 30.13 represents a particular polynomial of order 8 in which thelargest degree of any x(i) is at most 4. The number of unknown coefficientsis 36 in the example shown in the figure, while in a full-fledged order-8polynomial network, it would be much larger.

For training these networks, many strategies have been proposed by Bar-ron and his associates. Ivakhnenko (1971), for example, trains one layer ata time and lets only the best neurons in each layer survive for use as inputin the next layer.

It is easy to see that polynomial networks, even with only two inputs pernode, and with degree in each cell restricted to two, but with an unrestrictednumber of layers, can implement any polynomial in d variables. As thepolynomials are dense in the L∞ sense on C[a, b]d for all a, b ∈ Rd, wenote by Lemma 30.2 that such networks include a sequence of classifiersapproaching the Bayes error for any distribution. Consider several classesof polynomials:

G1 = ψ = a1ψ1 + · · ·+ akψk ,

where ψ1, . . . , ψk are fixed monomials, but the ai’s are free coefficients.

G2 = ψ ∈ G1 : ψ1, . . . , ψk are monomials of order ≤ r.

(The order of a monomial(x(1)

)i1 · · · (x(d))id is i1 + · · ·+ id.)

G3 = ψ ∈ G1 : ψ1, . . . , ψk are monomials of order ≤ r, but k is not fixed.

As k →∞, G1 does not generally become dense in C[a, b]d in the L∞ senseunless we add ψk+1, ψk+2, . . . in a special way. However, G3 becomes denseas r → ∞ by Theorem A.9 and G2 becomes dense as both k → ∞ andr → ∞ (as G2 contains a subclass of G3 for a smaller r depending upon kobtained by including in ψ1, . . . , ψk all monomials in increasing order).

The vc dimension of the class of classifiers based on G1 is not more thank (see Theorem 13.9). The vc dimension of classifiers based upon G2 doesnot exceed those of G3, which in turn is nothing but the number of possiblemonomials of order ≤ r. A simple counting argument shows that this isbounded by (r + 1)d. See also Anthony and Holden (1993).

These simple bounds may be used to study the consistency of polynomialnetworks.

Let us take a fixed structure network in which all nodes are fixed—theyhave at most k inputs with k ≥ 2 fixed and represent polynomials of order≤ r with r ≥ 2 fixed. For example, with r = 2, each cell with inputz1, . . . , zk computes

∑i aiψi(z1, . . . , zk), where the ai’s are coefficients and

the ψi’s are fixed monomials of order r or less, and all such monomials areincluded. Assume that the layout is fixed and is such that it can realize allpolynomials of order ≤ s on the input x(1), . . . , x(d). One way of doing thisis to realize all polynomials of order ≤ r by taking all

(dk

)possible input

30.8 Kolmogorov-Lorentz Networks and Additive Models 554

combinations, and to repeat at the second level with((d

k)k

)cells of neurons,

and so forth for a total of s/r layers of cells. This construction is obviouslyredundant but it will do for now. Then note that the vc dimension is notmore than (s+1)d, as noted above. If we choose the best coefficients in thecells by empirical risk minimization, then the method is consistent:

Theorem 30.10. In the fixed-structure polynomial network described above,if Ln is the probability of error of the empirical risk minimizer, then ELn →L∗ if s→∞ and s = o(n1/d).

Proof. Apply Lemma 30.2 and Theorem 12.6. 2

Assume a fixed-structure network as above such that all polynomials oforder ≤ s are realized plus some other ones, while the number of layers ofcells is not more than l. Then the vc dimension is not more than (rl+ 1)d

because the maximal order is not more than lr. Hence, we have consistencyunder the same conditions as above, that is, s → ∞, and l = o(n1/d).Similar considerations can now be used in a variety of situations.

30.8 Kolmogorov-Lorentz Networksand Additive Models

Answering one of Hilbert’s famous questions, Kolmogorov (1957) and Lorentz(1976) (see also Sprecher (1965) and Hecht-Nielsen (1987)) obtained thefollowing interesting representation of any continuous function on [0, 1]d.

Theorem 30.11. (Kolmogorov (1957); Lorentz (1976)). Let f becontinuous on [0, 1]d. Then f can be rewritten as follows: let δ > 0 be anarbitrary constant, and choose 0 < ε ≤ δ rational. Then

f =2d+1∑k=1

g(zk),

where g : R → R is a continuous function (depending upon f and ε), andeach zk is rewritten as

zk =d∑j=1

λkψ(x(j) + εk

)+ k.

Here λ is real and ψ is monotonic and increasing in its argument. Also,both λ and ψ are universal (independent of f), and ψ is Lipschitz: |ψ(x)−ψ(y)| ≤ c|x− y| for some c > 0.


The Kolmogorov-Lorentz theorem states that f may be represented bya very simple network that we will call the Kolmogorov-Lorentz network.What is amazing is that the first layer is fixed and known beforehand.Only the mapping g depends on f . This representation immediately opensup new revenues of pursuit—we need not mix the input variables. Simpleadditive functions of the input variables suffice to represent all continuousfunctions.

Figure 30.14. The Kolmogorov-Lorentz network of Theo-rem 30.11.

To explain what is happening here, we look at the interleaving of bits tomake one-dimensional numbers out of d-dimensional vectors. For the sakeof simplicity, let f : [0, 1]2 → R. Let

x = 0.x1x2x3 . . . and y = 0.y1y2y3 . . .

be the binary expansions of x and y, and consider a representation for thefunction f(x, y). The bit-interleaved number z ∈ [0, 1] has binary expansion

z = 0.x1y1x2y2x3y3 . . .

and may thus be written as z = φ(x) + (1/2)φ(y), where

φ(x) = 0.x10x20x30 . . .

and thus12φ(y) = 0.0y10y20y3 . . . .

We can also retrieve x and y from z by noting that

x = ψ1(z), ψ1(z) = 0.z1z3z5z7 . . . ,

andy = ψ2(z), ψ2(z) = 0.z2z4z6z8 . . . .


Therefore,

f(x, y) = f(ψ1(z), ψ2(z))def= g(z) (a one-dimensional function of z)

= g

(φ(x) +

12φ(y)

).

The function φ is strictly monotone increasing. Unfortunately, φ, ψ1, andψ2 are not continuous. Kolmogorov’s theorem for this special case is asfollows:

Theorem 30.12. (Kolmogorov (1957)). There exist five monotone func-tionsφ1, . . . , φ5 : [0, 1] → R satisfying |φi(x1)− φi(x2)| ≤ |x1− x2|, with the fol-lowing property: for every f ∈ C[0, 1]2 (the continuous functions on [0, 1]2),there exists a continuous function g such that for all (x1, x2) ∈ [0, 1]2,

f(x1, x2) =5∑i=1

g

(φi(x1) +

12φi(x2)

).

The difference with pure bit-interleaving is that now the φi’s are contin-uous and g is continuous whenever f is. Also, just as in our simle example,Kolmogorov gives an explicit construction for φ1, . . . , φ5.

Kolmogorov’s theorem may be used to show the denseness of certainclasses of functions that may be described by networks. There is one pitfallhowever: any such result must involve at least one neuron or cell that hasa general function in it, and we are back at square one, because a generalfunction, even on only one input, may be arbitrarily complicated and wild.

Additive models include, for example, models such as

α+d∑i=1

ψi(x(i)),

where the ψi’s are unspecified univariate functions (Friedman and Silver-man (1989), Hastie and Tibshirani (1990)). These are not powerful enoughto approximate all functions. A generalized additive model is

σ

(α+

d∑i=1

ψi(x(i))

)

(Hastie and Tibshirani (1990)), where σ is now a given or unspecified func-tion. From Kolmogorov’s theorem, we know that the model

2d+1∑k=1

ψ

(αk +

d∑i=1

ψi,k(x(i))

)

30.9 Projection Pursuit 557

with ψi,k, ψ unspecified functions, includes all continuous functions on com-pact sets and is thus ideally suited for constructing networks. In fact, wemay take αk = k and take all ψi,k’s as specified in Kolmogorov’s theorem.This leaves only ψ as the unknown.

Now, consider the following: any continuous univariate function f canbe approximated on bounded sets to within ε by simple combinations ofthreshold sigmoids σ of the form

k∑i=1

aiσ(x− ci),

where ai, ci, k are variable. This leads to a two-hidden-layer neural networkrepresentation related to that of Kurkova (1992), where only the last layerhas unknown coefficients for a total of 2k.

Theorem 30.13. Consider a network classifier of the form described abovein which σ(·) is the threshold sigmoid, and the ai’s and ci’s are found byempirical error minimization. Then ELn → L∗ for all distributions of(X,Y ), if k →∞ and k log n/n→ 0.

Proof. We will only outline the proof. First observe that we may approx-imate all functions on C[a, b]d by selecting k large enough. By Lemma 30.2and Theorem 12.6, it suffices to show that the vc dimension of our classof classifiers is o(n/ log n). Considering αk +

∑di=1 ψi,k(x

(i)) as new inputelements, called yk, 1 ≤ k ≤ 2d+ 1, we note that the vc dimension is notmore than that of the classifiers based on

2d+1∑k=1

k∑i=1

aiσ(yj − ci),

which in turn is not more than that of the classifiers given by

k(2d+1)∑i=1

blσ(zl − dl),

where bl, dl are parameters, and zl is an input sequence. By Theorem13.9, the vc dimension is not more than k(2d+1). This concludes the proof.2

For more results along these lines, we refer to Kurkova (1992).

30.9 Projection Pursuit

In projection pursuit (Friedman and Tukey (1974), Friedman and Stuet-zle (1981), Friedman, Stuetzle, and Schroeder (1984), Huber (1985), Hall


(1989), Flick, Jones, Priest, and Herman (1990)), one considers functionsof the form

ψ(x) =k∑j=1

ψj(bj + aTj x), (30.3)

where bj ∈ R, aj ∈ Rd are constants and ψ1, . . . , ψk are fixed functions.This is related to, but not a special case of, one-hidden-layer neural net-works. Based upon the Kolmogorov-Lorentz representation theorem, wemay also consider

ψ(x) =d∑j=1

ψj(x(j)), (30.4)

for fixed functions ψj (Friedman and Silverman (1989), Hastie and Tibshi-rani (1990)). In (30.4) and (30.3), the ψj ’s may be approximated in turn byspline functions or other nonparametric constructs. This approach is cov-ered in the literature on generalized additive models (Stone (1985), Hastieand Tibshirani (1990)).

The class of functions eaT x, a ∈ Rd, satisfies the conditions of the Stone-

Weierstrass theorem (Theorem A.9) and is therefore dense in the L∞ normon C[a, b]d for any a, b ∈ Rd (see, e.g., Diaconis and Shahshahani (1984)).

As a corollary, we note that the same denseness result applies to thefamily

k∑i=1

ψi(aTi x), (30.5)

where k ≥ 1 is arbitrary and ψ1, ψ2, . . . are general functions. The latterresult is at the basis of projection pursuit methods for approximating func-tions, where one tries to find vectors ai and functions ψi that approximatea given function very well.

Remark. In some cases, approximations by functions as in (30.5) may beexact. For example,

x(1)x(2) =14(x(1) + x(2))2 − 1

4(x(1) − x(2))2

andmax(x(1), x(2)) =

12|x(1) + x(2)| − 1

2|x(1) − x(2)|. 2

Theorem 30.14. (Diaconis and Shahshahani (1984)). Let m be apositive integer. There are

(m+d−1m

)distinct vectors aj ∈ Rd such that

any homogeneous polynomial f of order m can be written as

f(x) =(m+d−1

m )∑j=1

cj(aTj x)m

for some real numbers cj.


Every polynomial of order m over Rd is a homogeneous polynomial oforder m over Rd+1 by replacing the constant 1 by a component x(d+1)

raised to an appropriate power. Thus, any polynomial of order m over Rd

may be decomposed exactly by

f(x) =(m+d

m )∑j=1

cj(aTj x+ bj

)mfor some real numbers bj , cj and vectors aj ∈ Rd.

Polynomials may thus be represented exactly in the form∑ki=1 φi(a

Ti x)

with k =(m+dm

). As the polynomials are dense in C[a, b]d, we have yet

another proof that∑k

i=1 φi(aTi x)

is dense in C[a, b]d. See Logan and

Shepp (1975) or Logan (1975) for other proofs.The previous discussion suggests at least two families from which to select

a discriminant function. As usual, we let g(x) = Iψ(x)>0 for a discriminantfunction ψ. Here ψ could be picked from

Fm =

m∑j=1

cj(aTj x+ bj)m; bj , cj ∈ R, aj ∈ Rd

or

F ′m =

m∑j=1

cjeaT

j x; cj ∈ R, aj ∈ Rd

,

where m is sufficiently large. If we draw ψ by minimizing the empiricalerror (admittedly at a tremendous computational cost), then convergencemay result if m is not too large. We need to know the vc dimension of theclasses of classifiers corresponding to Fm and F ′m. Note that Fm coincideswith all polynomials of order m and each such polynomial is the sum ofat most (m+ 1)d monomials. If we invoke Lemma 30.2 and Theorem 12.6,then we get

Theorem 30.15. Empirical risk minimization to determine aj , bj , cj inFm leads to a universally consistent classifier provided that m → ∞ andm = o(n1/d/ log n).

Projection pursuit is very powerful and not at all confined to our limiteddiscussion above. In particular, there are many other ways of constructinggood consistent classifiers that do not require extensive computations suchas empirical error minimization.

30.10 Radial Basis Function Networks 560

30.10 Radial Basis Function Networks

We may perform discrimination based upon networks with functions of theform

ψ(x) =k∑i=1

aiK

(x− xihi

)(and decision g(x) = Iψ(x)>0), where k is an integer, a1, . . . , ak h1, . . . , hkare constants, x1, . . . , xk ∈ Rd, and K is a kernel function (such as K(u) =e−‖u‖

2or K(u) = 1/(1 + ‖u‖2)). In this form, ψ covers several well-known

methodologies:

(1) The kernel rule (Chapter 10): Take k = n, ai = 2Yi− 1, hi = h, xi =Xi. With this choice, for a large class of kernels, we are guaranteedconvergence if h → 0 and nhd → ∞. This approach is attractive asno difficult optimization problem needs to be solved.

(2) The potential function method. In Bashkirov, Braverman, and Much-nik (1964), the parameters are k = n, hi = h, xi = Xi. The weights aiare picked to minimize the empirical error on the data, and h is heldfixed. The original kernel suggested there is K(u) = 1/(1 + ‖u‖2).

(3) Linear discrimination. For k = 2, K(u) = e−‖u‖2, hi ≡ h, a1 = 1,

a2 = −1, the set x : ψ(x) > 0 is a linear halfspace. This, of course,is not universally consistent. Observe that the separating hyperplaneis the collection of all points x at equal distance from x1 and x2. Byvarying x1 and x2, all hyperplanes may be obtained.

(4) Radial basis function (rbf) neural networks (e.g., Powell (1987), Broom-head and Lowe (1988), Moody and Darken (1989), Poggio and Girosi(1990), Xu, Krzyzak, and Oja (1993), Xu, Krzyzak, and Yuille (1994),and Krzyzak, Linder, and Lugosi (1993)). An even more general func-tion ψ is usually employed here:

ψ(x) =k∑i=1

ciK((x− xi)TAi(x− xi)

)+ c0,

where the Ai’s are tunable d× d matrices.

(5) Sieve methods. Grenander (1981) and Geman and Hwang (1982) ad-vocate the use of maximum likelihood methods to find suitable valuesfor the tunable parameters in ψ (for k,K fixed beforehand) subjectto certain compactness constraints on these parameters to control theabundance of choices one may have. If we were to use empirical errorminimization, we would find, if k ≥ n, that all data points can be cor-rectly classified (take the hi’s small enough, set ai = 2Yi−1, xi = Xi,

30.10 Radial Basis Function Networks 561

k = n), causing overfitting. Hence, k must be smaller than n if pa-rameters are picked in this manner. Practical ways of choosing theparameters are discussed by Kraaijveld and Duin (1991), and Chouand Chen (1992). In both (4) and (5), the xi’s may be thought ofas representative prototypes, the ai’s as weights, and the hi’s as theradii of influence. As a rule, k is much smaller than n as x1, . . . , xksummarizes the information present at the data.

To design a consistent rbf neural network classifier, we may proceed asin (1). We may also take k → ∞ but k = o(n). Just let (x1, . . . , xk) ≡(X1, . . . , Xk), (a1, . . . , ak ≡ 2Y1 − 1, . . . , 2Yk − 1), and choose Ai or hi tominimize a given error criterion based upon Xk+1, . . . , Xn, such as

Ln(g) =1

n− k

n∑i=k+1

Ig(Xi) 6=Yi,

where g(x) = Iψ(x)>0, and ψ is as in (1). This is nothing but data splitting(Chapter 22). Convergence conditions are described in Theorem 22.1.

A more ambitious person might try empirical risk minimization to findthe best x1, . . . , xk, a1, . . . , ak, A1, . . . , Ak (d× d matrices as in (4)) basedupon

1n

n∑i=1

Ig(Xi) 6=Yi.

If k → ∞, the class of rules contains a consistent subsequence, and there-fore, it suffices only to show that the vc dimension is o(n/ log n). This isa difficult task and some kernels yield infinite vc dimension, even if d = 1and k is very small (see Chapter 25). However, there is a simple argumentif K = IR for a simple set R. Let

A =x : x = a+Ay, y ∈ R : a ∈ Rd, A a d× d matrix

.

If R is a sphere, then A is the class of all ellipsoids. The number of ways ofshattering a set x1, . . . , xn by intersecting with members from A is notmore than

nd(d+1)/2+1

(see Theorem 13.9, Problem 13.10). The number of ways of shattering aset x1, . . . , xn by intersecting with sets of the form

a1IR1 + · · ·+ akIRk> 0 : R1, . . . , Rk ∈ A, a1, . . . , ak ∈ R

is not more than the product of all ways of shattering by intersections withR1, with R2, and so forth, that is,

nk(d(d+1)/2+1).

The logarithm of the shatter coefficient is o(n) if k = o(n/ log n). Thus, byCorollary 12.1, we have


Theorem 30.16. (Krzyzak, Linder, and Lugosi (1993)). If we takek →∞, k = o(n/ log n) in the rbf classifier (4) in which K = IR, R beingthe unit ball of Rd, and in which all the parameters are chosen by empir-ical risk minimization, then ELn → L∗ for all distributions of (X,Y ).Furthermore, if Gk is the class of all rbf classifiers with k prototypes,

ELn − infg∈Gk

L(g) ≤ 16

√k(d(d+ 1)/2 + 1) log n+ log(8e)

2n.

The theorem remains valid (with modified constants in the error esti-mate) when R is a hyperrectangle or polytope with a bounded number offaces. However, for more general K, the vc dimension is more difficult toevaluate.

For general kernels, consistent rbf classifiers can be obtained by em-pirical L1 or L2 error minimization (Problem 30.32). However, no efficientpractical algorithms are known to compute the minima.

Finally, as suggested by Chou and Chen (1992) and Kraaijveld and Duin(1991), it is a good idea to place x1, . . . , xk by k-means clustering or anotherclustering method and to build an rbf classifier with those values or byoptimization started at the given cluster centers.


Problem 30.1. Let k, l be integers, with d ≤ k < n, and l ≤(kd

). Assume X

has a density. Let A be a collection of hyperplanes drawn from the(kd

)possible

hyperplanes through d points of X1, . . . , Xk, and let gA be the correspondingnatural classifier based upon the arrangement P(A). Take l such collections A at

random and with replacement, and pick the best A by minimizing Ln(gA), where

Ln(gA) =1

n− k

n∑i=k+1

IYi 6=gA(Xi).

Show that the selected classifier is consistent if l→∞, ld = o(n), n/(l log k) →∞.(Note: this is applicable with k = bn/2c, that is, half the sample is used to defineA, and the other half is used to pick a classifier empirically.)

Problem 30.2. We are given a tree classifier with k internal linear splits andk+1 leaf regions (a bsp tree). Show how to combine the neurons in a two-hidden-layer perceptron with k and k + 1 hidden neurons in the two hidden layers so asto obtain a decision that is identical to the tree-based classifier (Brent (1991),Sethi (1990; 1991)). For more on the equivalence of decision trees and neuralnetworks, see Meisel (1990), Koutsougeras and Papachristou (1989), or Goleaand Marchand (1990). Hint: Mimic the argument for arrangements in text.

Problem 30.3. Extend the proof of Theorem 30.4 so that it includes any non-decreasing sigmoid with limx→−∞ σ(x) = −1 and limx→∞ σ(x) = 1. Hint: If t islarge, σ(tx) approximates the threshold sigmoid.


Problem 30.4. This exercise states denseness in L1(µ) for any probability mea-sure µ. Show that for every probability measure µ on Rd, every measurablefunction f : Rd → R with

∫|f(x)|µ(dx) < ∞, and every ε > 0, there exists a

neural network with one hidden layer and function ψ(x) as in (30.2) such that∫|f(x)− ψ(x)|µ(dx) < ε

(Hornik (1991)). Hint: Proceed as in Theorem 30.4, considering the following:(1) Approximate f in L1(µ) by a continuous function g(x) that is zero out-

side some bounded set B ⊂ Rd. Since g(x) is bounded, its maximumβ = maxx∈Rd |g(x)| is finite.

(2) Now, choose a to be a positive number large enough so that both B ⊂[−a, a]d and µ

([−a, a]d

)is large.

(3) Extend the restriction of g(x) to [−a, a]d periodically by tiling over allof Rd. The obtained function, g(x) is a good approximation of g(x) inL1(µ).

(4) Take the Fourier series approximation of g(x), and use the Stone-Weier-strass theorem as in Theorem 30.4.

(5) Observe that every continuous function on the real line that is zero out-side some bounded interval can be arbitrarily closely approximated uni-formly over the whole real line by one-dimensional neural networks. Thus,bounded functions such as the sine and cosine functions can be approxi-mated arbitrarily closely by neural networks in L1(ν) for any probabilitymeasure ν on R.

(6) Apply the triangle inequality to finish the proof.

Problem 30.5. Generalize the previous exercise for denseness in Lp(µ). Moreprecisely, let 1 ≤ p <∞. Show that for every probability measure µ on Rd, everymeasurable function f : Rd → R with

∫|f(x)|pµ(dx) < ∞, and every ε > 0,

there exists a neural network with one hidden layer h(x) such that(∫|f(x)− h(x)|pµ(dx)

)1/p

< ε

(Hornik (1991)).

Problem 30.6. Committee machines. Let C(k) be the class of all committeemachines. Prove that for all distributions of (X,Y ),

limk→∞

infφ∈C(k)

L(φ) = L∗.

Hint: For a one-hidden-layer neural network with coefficients ci in (30.2), ap-proximate the ci’s by discretization (truncation to a grid of values), and notethat ciψi(x) may thus be approximated in a committee machine by a sufficientnumber of identical copies of ψi(x). This only forces the number of neurons tobe a bit larger.

Problem 30.7. Prove Theorem 30.5 for arbitrary sigmoids. Hint: Approximatethe threshold sigmoid by σ(tx) for a sufficiently large t.


Problem 30.8. Let σ be a nondecreasing sigmoid with σ(x) → −1 if x → −∞and σ(x) → 1 if x→∞. Denote by C(k)

σ the class of corresponding neural networkclassifiers with k hidden layers. Show that VC(k)

σ≥ VC(k) , where C(k) is the class

corresponding to the threshold sigmoid.

Problem 30.9. This exercise generalizes Theorem 30.5 for not fully connectedneural networks with one hidden layer. Consider the class C(k) of one-hidden-layer neural networks with the threshold sigmoid such that each of the k nodesin the hidden layer are connected to d1, d2, . . . , dk inputs, where 1 ≤ di ≤ d. Moreprecisely, C(k) contains all classifiers based on functions of the form

ψ(x) = c0 +

k∑i=1

ciσ(ψi(x)), where ψi(x) =

di∑j=1

ajx(mi,j),

where for each i, (mi,1, . . . ,mi,di) is a vector of distinct positive integers notexceeding d. Show that

VC(k) ≥k∑i=1

di + 1

if the aj ’s, ci’s and mi,j ’s are the tunable parameters (Bartlett (1993)).

Problem 30.10. Let σ be a sigmoid that takes m different values. Find upperbounds on the vc dimension of the class C(k)

σ .

Problem 30.11. Consider a one-hidden-layer neural network ψ. If (X1, Y1), . . .,(Xn, Yn) are fixed and all Xi’s are different, show that with n hidden neurons, weare always able to tune the weights such that Yi = ψ(Xi) for all i. (This remainstrue if Y ∈ R instead of Y ∈ 0, 1.) The property above describes a situation ofoverfitting that occurs when the neural network becomes too “rich”—recall alsothat the vc dimension, which is at least d times the number of hidden neurons,must remain smaller than n for any meaningful training.

Problem 30.12. The Bernstein perceptron. Consider the following percep-tron for one-dimensional data:

φ(x) =

1 if

∑k

i=1aix

i(1− x)k−i > 1/20 otherwise.

Let us call this the Bernstein perceptron since it involves Bernstein polynomials.If n data points are collected, how would you choose k (as a function of n) andhow would you adjust the weights (the ai’s) to make sure that the Bernsteinperceptron is consistent for all distributions of (X,Y ) with PX ∈ [0, 1] = 1?Can you make the Bernstein perceptron consistent for all distributions of (X,Y )on R× 0, 1?


Figure 30.15. The Bernstein

perceptron.

Problem 30.13. Use the ideas of Section 25.5 and Problem 25.11 to prove The-orem 30.8.

Problem 30.14. Consider Specht’s Padaline with r = rn ↑ ∞. Let h = hn ↓ 0,and nhd →∞. Show that for any distribution of (X,Y ), ELn → L∗.

Problem 30.15. Bounded first layers. Consider a feed-forward neural net-work with any number of layers, with only one restriction, that is, the first layerhas at most k < d outputs z1, . . . , zk, where

zj = σ

(bj +

d∑i=1

ajix(i)

),

x =(x(1), . . . , x(d)

)is the input, and σ is an arbitrary function R→ R (not just

a sigmoid). The integer k remains fixed. Let A denote the k × (d+ 1) matrix ofweights aji, bj .

(1) If L∗(A) is the Bayes error for a recognition problem based upon (Z, Y ),with Z = (Z1, . . . , Zk) and

Zj = σ

(bj +

d∑i=1

ajiX(i)

), X =

(X(1), . . . , X(d)

),

then show that for some distribution of (X,Y ), infA L∗(A) > L∗, where

L∗ is the Bayes probability of error for (X,Y ).(2) If k ≥ d however, show that for any strictly monotonically increasing

sigmoid σ, infA L∗(A) = L∗.

(3) Use (1) to conclude that any neural network based upon a first layerwith k < d outputs is not consistent for some distribution, regardless ofhow many layers it has (note however, that the inputs of each layer arerestricted to be the outputs of the previous layer).


Figure 30.16. The first layer

is restricted to have k outputs. It

has k(d+1) tunable parameters.

Problem 30.16. Find conditions on kn and βn that guarantee strong universalconsistency in Theorem 30.9.

Problem 30.17. Barron networks. Call a Barron network a network of anynumber of layers (as in Figure 30.13) with 2 inputs per cell and cells that performthe operation α+βx+γy+δxy on inputs x, y ∈ R, with trainable weights α, β, γ, δ.If ψ is the output of a network with d inputs and k cells (arranged in any way),compute an upper bound for the vc dimension of the classifier g(x) = Iψ(x)>0,as a function of d and k. Note that the structure of the network (i.e., the positionsof the cells and the connections) is variable.

Problem 30.18. Continued. Restrict the Barron network to l layers and k cellsper layer (for kl cells total), and repeat the previous exercise.

Problem 30.19. Continued. Find conditions on k and l in the previous exer-cises that would guarantee the universal consistency of the Barron network, if wetrain to minimize the empirical error.

Problem 30.20. Consider the family of functions Fk,l of the form

k∑i=1

l∑j=1

wij (ψi(x))j , ψi(x) =

d∑i′=1

l∑j′=1

w′ii′j′(x(i′)

)j′, 1 ≤ i ≤ k,

where all the wij ’s and w′ii′j′ ’s are tunable parameters. Show that for every

f ∈ C[0, 1]d and ε > 0, there exist k, l large enough so that for some g ∈ Fk,l,supx∈[0,1]d |f(x)− g(x)| ≤ ε.

Problem 30.21. Continued. Obtain an upper bound on the vc dimension ofthe above two-hidden-layer network. Hint: The vc dimension is usually aboutequal to the number of degrees of freedom (which is (1 + d)lk here).


Problem 30.22. Continued. If gn is the rule obtained by empirical error min-imization over Fk,l, then show that L(gn) → L∗ in probability if k →∞, l→∞,and kl = o(n/ logn).

Problem 30.23. How many different monomials of order r in x(1), . . . , x(d) arethere? How does this grow with r when d is held fixed?

Problem 30.24. Show that ψ1, ψ2, and φ in the bit-interleaving example in thesection on Kolmogorov-Lorentz representations are not continuous. Which arethe points of discontinuity?

Problem 30.25. Let

F ′m =

m∑j=1

cjeaT

j x; cj ∈ R, aj ∈ Rd

.

Pick aj , cj by empirical risk minimization for the classifier g(x) = Iψ(x)>0,ψ ∈ F ′

m. Show that ELn → L∗ for all distributions of (X,Y ) when m → ∞and m = o(n1/d/ logn).

Problem 30.26. Write(x(1)x(2)

)2as∑4

i=1

(ai1x

(1) + ai2x(2))2

and identify thecoefficients ai1, ai2. Show that there is an entire subspace of solutions (Diaconisand Shahshahani (1984)).

Problem 30.27. Show that ex(1)x(2)

and sin(x(1)x(2)) cannot be written in the

form∑k

i=1ψi(a

Ti x) for any finite k, where x = (x(1), x(2)) and ai ∈ R2, 1 ≤ i ≤ k

(Diaconis and Shahshahani (1984)). Thus, the projection pursuit representationof functions can only at best approximate all continuous functions on boundedsets.

Problem 30.28. Let F be the class of classifiers of the form g = Iψ>0, where

ψ(x) = a1f1(x(1)) + a2f2(x

(2)), for arbitrary functions f1, f2, and coefficientsa1, a2 ∈ R. Show that for some distribution of (X,Y ) onR2×0, 1, infg∈F L(g) >L∗, so that there is no hope of meaningful distribution-free classification basedon additive functions only.

Problem 30.29. Continued. Repeat the previous exercise for functions of theform ψ(x) = a1f1(x

(2), x(3)) + a2f2(x(1), x(3)) + a3f3(x

(1), x(2)) and distributionsof (X,Y ) on R3 × 0, 1. Thus, additive functions of pairs do not suffice either.

Problem 30.30. LetK : R→ R be a nonnegative bounded kernel with∫K(x)dx <

∞. Show that for any ε > 0, any measurable function f : Rd →R and probabilitymeasure µ such that

∫|f(x)|µ(dx) < ∞, there exists a function ψ of the form

ψ(x) =∑k

i=1ciK

((x− bi)

TAi(x− bi))

+ c0 such that∫|f(x) − ψ(x)µ(dx) < ε

(see Poggio and Girosi (1990), Park and Sandberg (1991; 1993), Darken, Don-ahue, Gurvits, and Sontag (1993), Krzyzak, Linder, and Lugosi (1993) for suchdenseness results). Hint: Relate the problem to a similar result for kernel esti-mates.


Problem 30.31. Let C(k) be the class of classifiers defined by the functions

k∑i=1

ciK((x− bi)

TAi(x− bi))

+ c0.

Find upper bounds on its vc dimension when K is an indicator of an intervalcontaining the origin.

Problem 30.32. Consider the class of radial basis function networks

Fn =

kn∑i=1

ciK((x− bi)

TAi(x− bi))

+ c0 :

kn∑i=0

|ci| ≤ βn

,

where K is nonnegative, unimodal, bounded, and continuous. Let ψn be a func-tion that minimizes Jn(ψ) = n−1

∑n

i=1|ψ(Xi) − Yi| over ψ ∈ Fn, and de-

fine gn as the corresponding classifier. Prove that if kn → ∞, βn → ∞, andknβ

2n log(knβn) = o(n) as n → ∞, then gn is universally consistent (Krzyzak,

Linder, and Lugosi (1993)). Hint: Proceed as in the proof of Theorem 30.9.Use Problem 30.30 to handle the approximation error. Bounding the coveringnumbers needs a little additional work.


31Other Error Estimates

In this chapter we discuss some alternative error estimates that have beenintroduced to improve on the performance of the standard estimates—holdout, resubstitution, and deleted—we have encountered so far. The firstgroup of these estimates—smoothed and posterior probability estimates—are used for their small variance. However, we will give examples thatshow that classifier selection based on the minimization of these estimatesmay fail even in the simplest situations. Among other alternatives, we dealbriefly with the rich class of bootstrap estimates.

31.1 Smoothing the Error Count

The resubstitution, deleted, and holdout estimates of the error probability(see Chapters 22 and 23) are all based on counting the number of errorscommitted by the classifier to be tested. This is the reason for the relativelylarge variance inherently present in these estimates. This intuition is basedon the following. Most classification rules can be written into the form

gn(x) =

0 if ηn(x,Dn) ≥ 1/21 otherwise,

where ηn(x,Dn) is either an estimate of the a posteriori probability η(x),as in the case of histogram, kernel, or nearest neighbor rules; or somethingelse, as for generalized linear, or neural network classifiers. In any case,if ηn(x,Dn) is close to 1/2, then we feel that the decision is less robust

31.1 Smoothing the Error Count 570

compared to when the value of ηn(x,Dn) is far from 1/2. In other words,intuitively, inverting the value of the decision gn(x) at a point x, whereηn(x,Dn) is close to 1/2 makes less difference in the error probability,than if |ηn(x,Dn) − 1/2| is large. The error estimators based on countingthe number of errors do not take the value of ηn(x) into account: they“penalize” errors with the same amount, no matter what the value of ηnis. For example, in the case of the resubstitution estimate

L(R)n =

1n

n∑i=1

Ign(Xi) 6=Yi,

each error contributes with 1/n to the overall count. Now, if ηn(Xi, Dn)is close to 1/2, a small perturbation of Xi can flip the decision gn(Xi),and therefore change the value of the estimate by 1/n, although the errorprobability of the rule gn probably does not change by this much. Thisphenomenon is what causes the relatively large variance of error countingestimators.

Glick (1978) proposed a modification of the counting error estimates.The general form of his “smoothed” estimate is

L(S)n =

1n

n∑i=1

(IYi=0r(ηn(Xi, Dn)) + IYi=1(1− r(ηn(Xi, Dn)))

),

where r is a monotone increasing function satisfying r(1/2 − u) = 1 −r(1/2 + u). Possible choices of the smoothing function r(u) are r(u) = u,or r(u) = 1/(1 + e1/2−cu), where the parameter c > 0 may be adjusted toimprove the behavior of the estimate (see also Glick (1978), Knoke (1986),or Tutz (1985)). Both of these estimates give less penalty to errors close tothe decision boundary, that is, to errors where ηn is close to 1/2. Note thattaking r(u) = Iu≥1/2 corresponds to the resubstitution estimate. We willsee in Theorem 31.2 below that if r is smooth, then L(S)

n indeed has a verysmall variance in many situations.

Just like the resubstitution estimate, the estimate L(S)n may be strongly

optimistically biased. Just consider the 1-nearest neighbor rule, when L(S)n =

0 with probability one, whenever X has a density. To combat this defect,one may define the deleted version of the smoothed estimate,

L(SD)n =

1n

n∑i=1

(IYi=0r(ηn−1(Xi, Dn,i)) + IYi=1(1− r(ηn−1(Xi, Dn,i)))

),

where Dn,i is the training sequence with the i-th pair (Xi, Yi) deleted. Thefirst thing we notice is that this estimate is still biased, even asymptotically.


To illustrate this point, consider r(u) = u. In this case,

EL(SD)n

= E

IY=0ηn−1(X,Dn−1) + IY=1(1− ηn−1(X,Dn−1))

= E (1− η(X))ηn−1(X,Dn−1) + η(X)(1− ηn−1(X,Dn−1)) .

If the estimate ηn−1(x,Dn−1) was perfect, that is, equal to η(x) for everyx, then the expected value above would be 2Eη(X)(1− η(X)), which isthe asymptotic error probability LNN of the 1-nn rule. In fact,∣∣∣EL(SD)

n

− LNN

∣∣∣=

∣∣EIY=0ηn−1(X,Dn−1) + IY=1(1− ηn−1(X,Dn−1))

−(IY=0η(X) + IY=1(1− η(X))

)∣∣≤ 2 |E ηn−1(X,Dn−1)− η(X)| .

This means that when the estimate ηn(x) of the a posteriori probability ofη(x) is consistent in L1(µ), then L(SD)

n converges to LNN, and not to L∗!Biasedness of an error estimate is not necessarily a bad property. In

most applications, all we care about is how the classification rule selectedby minimizing the error estimate works. Unfortunately, in this respect,smoothed estimates perform poorly, even compared to other strongly biasederror estimates such as the empirical squared error (see Problem 31.4). Thenext example illustrates our point.

Theorem 31.1. Let the distribution of X be concentrated on two valuessuch that PX = a = PX = b = 1/2, and let η(a) = 3/8 and η(b) =5/8. Assume that the smoothed error estimate

L(S)n (η′) =

1n

n∑i=1

(IYi=0η

′(Xi) + IYi=1(1− η′(Xi)))

is minimized over η′ ∈ F , to select a classifier from F , where the class Fcontains two functions, the true a posteriori probability function η, and η,where

η(a) = 0, η(b) =38.

Then the probability that η is selected converges to one as n→∞.

Proof. Straightforward calculation shows that

EL(S)n (η)

=

1532

and EL(S)n (η)

=

632.

The statement follows from the law of large numbers. 2


Remark. The theorem shows that even if the true a posteriori probabilityfunction is contained in a finite class of candidates, the smoothed estimatewith r(u) = u is unable to select a good discrimination rule. The resultmay be extended to general smooth r’s. As Theorems 15.1 and 29.2 show,empirical squared error minimization or maximum likelihood never fail inthis situation. 2

Finally we demonstrate that if r is smooth, then the variance of L(S)n

is indeed small. Our analysis is based on the work of Lugosi and Pawlak(1994). The bounds for the variance of L(S)

n remain valid for L(SD)n . Con-

sider classification rules of the form

gn(x) =

0 if ηn(x,Dn) ≥ 1/21 otherwise,

where ηn(x,Dn) is an estimate of the a posteriori probability η(x). Exam-ples include the histogram rule, where

ηn(x,Dn) =∑ni=1 YiIXi∈An(x)∑ni=1 IXi∈An(x)

(see Chapters 6 and 9), the k-nearest neighbor rule, where

ηn(x,Dn) =1k

k∑i=1

Y(i)(x)

(Chapters 5 and 11), or the kernel rule, where

ηn(x,Dn) =∑ni=1 YiKh(x−Xi)nEKh(x−X)

(Chapter 10). In the sequel, we concentrate on the performance of thesmoothed estimate of the error probability of these nonparametric rules.The next theorem shows that for these rules, the variance of the smoothederror estimate is O(1/n), no matter what the distribution is. This is asignificant improvement over the variance of the deleted estimate, which,as pointed out in Chapter 23, can be larger than 1/

√2πn.

Theorem 31.2. Assume that the smoothing function r(u) satisfies 0 ≤r(u) ≤ 1 for u ∈ [0, 1], and is uniformly Lipschitz continuous, that is,

|r(u)− r(v)| ≤ c|u− v|

for all u, v, and for some constant c. Then the smoothed estimate L(S)n of

the histogram, k-nearest neighbor, and moving window rules (with kernelK = IS0,1) satisfies

P∣∣∣L(S)

n −EL(S)n

∣∣∣ ≥ ε≤ 2e−2nε2/C ,


andVarL(S)

n ≤ C

4n,

where C is a constant depending on the rule only. In the case of the his-togram rule the value of C is C = (1+4c)2, for the k-nearest neighbor ruleC = (1+2cγd)2, and for the moving window rule C = (1+2cβd)2. Here c isthe constant in the Lipschitz condition, γd is the minimal number of conescentered at the origin of angle π/6 that cover Rd, and βd is the minimalnumber of balls of radius 1/2 that cover the unit ball in Rd.

Remark. Notice that the inequalities of Theorem 31.2 are valid for alln, ε, and h > 0 for the histogram and moving window rules, and k for thenearest neighbor rules. Interestingly, the constant C does not change withthe dimension in the histogram case, but grows exponentially with d forthe k-nearest neighbor and moving window rules. 2

Proof. The probability inequalities follow from appropriate applications ofMcDiarmid’s inequality. The upper bound on the variance follows similarlyfrom Theorem 9.3. We consider each of the three rules in turn.

The histogram rule. Let (x1, y1), . . . , (xn, yn) be a fixed training se-quence. If we can show that by replacing the value of a pair (xi, yi) in thetraining sequence by some (x′i, y

′i) the value of the estimate L(S)

n can changeby at most (1+2c)/n, then the inequality follows by applying Theorem 9.2with ci = 5/n.

The i-th term of the sum in L(S)n can change by one, causing 1/n change

in the average. Obviously, all the other terms in the sum that can changeare the ones corresponding to the xj ’s that are in the same set of thepartition as either xi or x′i. Denoting the number of points in the same setwith xi and x′i by k and k′ respectively, it is easy to see that the estimateof the a posteriori probabilities in these points can change by at most 2/kand 2/k′, respectively. It means that the overall change in the value of thesum can not exceed (1 + k 2c

k + k′ 2ck′ )/n = (1 + 4c)/n.

The k-nearest neighbor rule. To avoid difficulties caused by breakingdistance ties, assume that X has a density. Then recall that the applicationof Lemma 11.1 for the empirical distribution implies that no Xj can be oneof the k nearest neighbors of more than kγd points fromDn. Thus, changingthe value of one pair in the training sequence can change at most 1 + 2kγdterms in the expression of L(D)

n , one of them by at most 1, and all theothers by at most c/k. Theorem 9.2 yields the result.

The moving window rule. Again, we only have to check the con-dition of Theorem 9.2 with ci = (1 + 2cβd)/n. Fix a training sequence(x1, y1), . . . , (xn, yn) and replace the pair (xi, yi) by (x′i, y

′i). Then the i-

th term in the sum of the expression of L(S)n can change by at most one.

31.2 Posterior Probability Estimates 574

Clearly, the j-th term, for which xj 6∈ Sxi,h and xj 6∈ Sx′i,h, keeps its

value. It is easy to see that all the other terms can change by at mostc ·max1/kj , 1/k′j, where kj and k′j are the numbers of points xk, k 6= i, j,from the training sequence that fall in Sxi,h and Sx′

i,h, respectively. Thus,

the overall change in the sum does not exceed

1 +∑

xj∈Sxi,h

c

kj+

∑xj∈Sx′

i,h

c

k′j.

It suffices to show by symmetry that∑xj∈Sxi,h

1/kj ≤ βd. Let nj =|xk, k 6= i, j : xk ∈ Sxi,h ∩ Sxj ,h|. Then clearly,∑

xj∈Sxi,h

1kj≤

∑xj∈Sxi,h

1nj.

To bound the right-hand side from above, cover Sxi,h by βd balls S1, . . . , Sβd

of radius h/2. Denote the number of points falling in them by lm (m =1, . . . , βd):

lm = |xk, k 6= i : xk ∈ Sxi,h ∩ Sm|.Then ∑

xj∈Sxi,h

1nj

≤∑

xj∈Sxi,h

∑m:xj∈Sm

1lm

= βd,

and the theorem is proved. 2

31.2 Posterior Probability Estimates

The error estimates discussed in this section improve on the biasednessof smoothed estimates, while preserving their small variance. Still, theseestimates are of questionable utility in classifier selection. Considering theformula

L(gn) = EIηn(X,Dn)≤1/2η(X) + Iηn(X,Dn)>1/2(1− η(X))

∣∣Dn

for the error probability of a classification rule gn(x) = Iηn(x,Dn)>1/2, itis plausible to introduce the estimate

L(P )n =

1n

n∑i=1

(Iηn(Xi,Dn)≤1/2ηn(Xi, Dn)

+ Iηn(Xi,Dn)>1/2(1− ηn(Xi, Dn))),

that is, the expected value is estimated by a sample average, and insteadof the (unknown) a posteriori probability η(x), its estimate ηn(x,Dn) is

31.2 Posterior Probability Estimates 575

plugged into the formula of Ln. The estimate L(P )n is usually called the

posterior probability error estimate. In the case of nonparametric rules suchas histogram, kernel, and k-nn rules it is natural to use the correspond-ing nonparametric estimates of the a posteriori probabilities for pluggingin the expression of the error probability. This, and similar estimates ofLn have been introduced and studied by Fukunaga and Kessel (1973),Hora and Wilcox (1982), Fitzmaurice and Hand (1987), Ganesalingam andMcLachlan (1980), Kittler and Devijver (1981), Matloff and Pruitt (1984),Moore, Whitsitt, and Landgrebe (1976), Pawlak (1988), Schwemer andDunn (1980), and Lugosi and Pawlak (1994). It is interesting to noticethe similarity between the estimates L(S)

n and L(P )n , although they were

developed from different scenarios.To reduce the bias, we can use the leave-one-out, (or deleted) version of

the estimate,

L(PD)n =

1n

n∑i=1

(Iηn−1(Xi,Dn,i)≤1/2ηn−1(Xi, Dn,i))

+ Iηn−1(Xi,Dn,i)>1/2(1− ηn−1(Xi, Dn,i))).

The deleted version L(PD)n , has a much better bias than L

(SD)n . We have

the bound∣∣∣EL(PD)n

−ELn−1

∣∣∣=

∣∣EIgn(X)=0ηn−1(X,Dn−1) + Ign(X)=1(1− ηn−1(X,Dn−1))

−(Ign(X)=0η(X) + Ign(X)=1(1− η(X))

)∣∣≤ 2 |E ηn−1(X,Dn−1)− η(X)| .

This means that if the estimate ηn(x) of the a posteriori probability of η(x)is consistent in L1(µ), then EL(PD)

n converges to L∗. This is the case forall distributions for the histogram and moving window rules if h → 0 andnhd → ∞, and the k-nearest neighbor rule if k → ∞ and k/n → 0, as itis seen in Chapters 9, 10, and 11. For specific cases it is possible to obtainsharper bounds on the bias of L(PD)

n . For the histogram rule, Lugosi andPawlak (1994) carried out such analysis. They showed for example that theestimate L(PD)

n is optimistically biased (see Problem 31.2).Posterior probability estimates of Ln share the good stability properties

of smoothed estimates (Problem 31.1).Finally, let us select a function η′ : Rd → [0, 1]—and a corresponding

rule g(x) = Iη′(x)>1/2—from a class F based on the minimization of theposterior probability error estimate

L(P )n (η′) =

1n

n∑i=1

(Iη′(Xi)≤1/2η

′(Xi) + Iη′(Xi)>1/2(1− η′(Xi))).

31.3 Rotation Estimate 576

Observe that L(P )n (η′) = 0 when η′(x) ∈ 0, 1 for all x, that is, rule

selection based on this estimate just does not make sense. The reason isthat L(P )

n ignores the Yi’s of the data sequence!Fukunaga and Kessel (1973) argued that efficient posterior probability

estimators can be obtained if additional unclassified observations are avail-able. Very often in practice, in addition to the training sequenceDn, furtherfeature vectorsXn+1, . . . , Xn+l are given without their labels Yn+1, . . . , Yn+l,where the Xn+i are i.i.d., independent from X and Dn, and they have thesame distribution as that of X. This situation is typical in medical applica-tions, when large sets of medical records are available, but it is usually veryexpensive to get their correct diagnosis. These unclassified samples can beefficiently used for testing the performance of a classifier designed from Dn

by using the estimate

L(U)n,l =

1l

l∑i=1

(Iηn(Xn+i,Dn)≤1/2ηn(Xn+i, Dn)

+ Iηn(Xn+i,Dn)>1/2(1− ηn(Xn+i, Dn))).

Again, using L(U)n,l for rule selection is meaningless.

31.3 Rotation Estimate

This method, suggested by Toussaint and Donaldson (1970), is a combi-nation of the holdout and deleted estimates. It is sometimes called theΠ-estimate. Let s < n be a positive integer (typically much smaller thann), and assume, for the sake of simplicity, that q = n/s is an integer. Therotation method forms the holdout estimate, by holding the first s pairs ofthe training data out, then the second s pairs, and so forth. The estimate isdefined by averaging the q numbers obtained this way. To formalize, denoteby D(s)

n,j the training data, with the j-th s-block held out (j = 1, . . . , q):

D(s)n,j = ((X1, Y1), . . . , (Xs(j−1), Ys(j−1)), (Xsj+1, Ysj+1), . . . , (Xn, Yn)).

The estimate is defined by

L(Π)n,s =

1q

q∑j=1

1s

sj∑i=s(j−1)+1

Ign−s(Xi,D

(s)n,j

) 6=Yi

.s = 1 yields the deleted estimate. If s > 1, then the estimate is usuallymore biased than the deleted estimate, as

EL(Π)n,s = ELn−s,

but usually exhibits smaller variance.

31.4 Bootstrap 577

31.4 Bootstrap

Bootstrap methods for estimating the misclassification error became pop-ular following the revolutionary work of Efron (1979; 1983). All bootstrapestimates introduce artificial randomization. The bootstrap sample

D(b)m =

((X(b)

1 , Y(b)1 ), . . . , (X(b)

m , Y (b)m )

)is a sequence of random variable pairs drawn randomly with replacementfrom the set (X1, Y1), . . . , (Xn, Yn). In other words, conditionally on thetraining sample Dn = ((X1, Y1), . . . , (Xn, Yn)), the pairs (X(b)

i , Y(b)i ) are

drawn independently from νn, the empirical distribution based on Dn inRd × 0, 1.

One of the standard bootstrap estimates aims to compensate the (usuallyoptimistic) bias

B(gn) = EL(gn) − L(R)n (gn)

of the resubstitution estimate L(R)n . To estimate B(gn), a bootstrap sample

of size m = n may be used:

B(b)n (gn) =

n∑i=1

1n− 1n

n∑j=1

IX

(b)j6=Xi

Ign(Xi,D

(b)n ) 6=Yi

.Often, bootstrap sampling is repeated several times to average out effectsof the additional randomization. In our case,

Bn(gn) =1B

B∑b=1

B(b)n (gn)

yields the estimator

L(B)n (gn) = L(R)

n (gn)− Bn(gn).

Another instance of a bootstrap sample of size n, the so-called E0 estimator,uses resubstitution on the training pairs not appearing in the bootstrapsample. The estimator is defined by∑n

j=1 IXj /∈X(b)

1 ,...,X(b)n I

gn(Xi,D(b)n ) 6=Yi

∑nj=1 I

Xj /∈X(b)

1 ,...,X(b)n .

Here, too, averages may be taken after generating the bootstrap sampleseveral times.

Many other versions of bootstrap estimates have been reported, suchas the “0.632 estimate,” “double bootstrap,” and “randomized bootstrap”

31.4 Bootstrap 578

(see Hand (1986), Jain, Dubes, and Chen (1987), and McLachlan (1992) forsurveys and additional references). Clearly, none of these estimates providesa universal remedy, but for several specific classification rules, bootstrapestimates have been experimentally found to outperform the deleted andresubstitution estimates. However, one point has to be made clear: we al-ways lose information with the additional randomization. We summarizethis in the following simple general result:

Theorem 31.3. Let X1, . . . , Xn be drawn from an unknown distributionµ, and let α(µ) be a functional to be estimated. Let r(·) be a convex riskfunction (such as r(u) = u2 or r(u) = |u|). Let X(b)

1 , . . . X(b)m be a bootstrap

sample drawn from X1, . . . , Xn. Then

infψ

Er(ψ(X(b)

1 , . . . X(b)m )− α(µ))

≥ inf

ψE r(ψ(X1, . . . , Xn)− α(µ)) .

The theorem states that no matter how large m is, the class of estimatorsthat are functions of the original sample is always at least as good as theclass of all estimators that are based upon bootstrap samples. In our case,α(µ) plays the role of the expected error probability ELn = Pgn(X) 6=Y . If we take r(u) = u2, then it follows from the theorem that there isno estimator Lm based on the bootstrap sample D(b)

m whose squared errorE

(ELn − Lm)2

is smaller than that of some nonbootstrap estimate. Inthe proof of the theorem we construct such a non-bootstrap estimator. Itis clear, however, that, in general, the latter estimator is too complex tohave any practical value. The randomization of bootstrap methods mayprovide a useful tool to overcome the computational difficulties in findinggood estimators.

Proof. Let ψ be any mapping taking m arguments. Then

Er(ψ(X(b)

1 , . . . X(b)m )− α(µ)

)|X1, . . . , Xn

=

1nm

∑(i1,...,im)⊂1,...,nm

r(ψ(Xi1 , . . . , Xim)− α(µ))

≥ r

1nm

∑(i1,...,im)⊂1,...,nm

(ψ(Xi1 , . . . , Xim)− α(µ))

(by Jensen’s inequality and the convexity of r)

= r

1nm

∑(i1,...,im)⊂1,...,nm

ψ(Xi1 , . . . , Xim)− α(µ)

= r (ψ∗(X1, . . . , Xn)− α(µ)) ,

31.4 Bootstrap 579

where

ψ∗(x1, . . . , xn) =1nm

∑(i1,...,im)⊂1,...,nm

ψ(xi1 , . . . , xim) .

Now, after taking expectations with respect to X1, . . . , Xn, we see that forevery ψ we start out with, there is a ψ∗ that is at least as good. 2

If m = O(n), then the bootstrap has an additional problem related tothe coupon collector problem. Let N be the number of different pairs in thebootstrap sample (X(b)

1 , Y(b)1 ), . . . , (X(b)

m , Y(b)m ). Then, if m ∼ cn for some

constant c, then N/n→ 1−e−c with probability one. To see this, note that

En−N = n

(1− 1

n

)cn∼ ne−c,

so EN/n → 1− e−c. Furthermore, if one of the m drawings is varied, Nchanges by at most one. Hence, by McDiarmid’s inequality, for ε > 0,

P|N −EN| > nε ≤ 2e−2nε2 ,

from which we conclude that N/n → 1 − e−c with probability one. Asn (1− e−c) of the original data pairs do not appear in the bootstrap sam-ple, a considerable loss of information takes place that will be reflected inthe performance. This phenomenon is well-known, and motivated severalmodifications of the simplest bootstrap estimate. For more information, seethe surveys by Hand (1986), Jain, Dubes, and Chen (1987), and McLachlan(1992).



Problem 31.1. Show that the posterior probability estimate L(P )n of the his-

togram, k-nearest neighbor, and moving window rules satisfies

P∣∣L(P )

n −EL(P )n

∣∣ ≥ ε≤ 2e−2nε2/C ,

and

VarL(P )n ≤ C

4n,

where C is a constant, depending on the rule. In the case of the histogram rulethe value of C is C = 25, for the k-nearest neighbor rule C = (1 + 2γd)

2, andfor the moving window rule C = (1 + 2βd)

2. Also show that the deleted version

L(PD)n of the estimate satisfies the same inequalities (Lugosi and Pawlak (1994)).

Hint: Proceed as in the proof of Theorem 31.2.

Problem 31.2. Show that the deleted posterior probability estimate of the errorprobability of a histogram rule is always optimistically biased, that is, for all n,

and all distributions, EL

(PD)n

≤ ELn−1.

Problem 31.3. Show that for any classification rule and any estimate 0 ≤ηn(x,Dn) ≤ 1 of the a posteriori probabilities, for all distributions of (X,Y )for all l, n, and ε > 0

P∣∣∣L(U)

n,l −EL

(U)n,l

∣∣∣ ≥ ε

∣∣∣Dn ≤ 2e−2lε2 ,

and VarL

(U)n,l |Dn

≤ 1/l. Further, show that for all l, E

L

(U)n,l

= E

L

(PD)n+1

.

Problem 31.4. Empirical squared error. Consider the deleted empiricalsquared error

en =1

n

n∑i=1

(Yi − ηn−1(Xi, Dn,i))2 .

Show that

Een =LNN

2+ E

(η(X)− ηn−1(X,Dn−1))

2,

where LNN is the asymptotic error probability on the 1-nearest neighbor rule.Show that if ηn−1 is the histogram, kernel, or k-nn estimate, then Varen ≤c/n for some constant depending on the dimension only. We see that en is anasymptotically optimistically biased estimate of L(gn) when ηn−1 is an L2(µ)-consistent estimate of η. Still, this estimate is useful in classifier selection (seeTheorem 29.2).


32Feature Extraction

32.1 Dimensionality Reduction

So far, we have not addressed the question of how the components of thefeature vector X are obtained. In general, these components are based ond measurements of the object to be classified. How many measurementsshould be made? What should these measurements be? We study thesequestions in this chapter. General recipes are hard to give as the answersdepend on the specific problem. However, there are some rules of thumbthat should be followed. One such rule is that noisy measurements, that is,components that are independent of Y , should be avoided. Also, adding acomponent that is a function of other components is useless. A necessaryand sufficient condition for measurements providing additional informationis given in Problem 32.1.

Our goal, of course, is to make the error probability L(gn) as small aspossible. This depends on many things, such as the joint distribution of theselected components and the label Y , the sample size, and the classifica-tion rule gn. To make things a bit simpler, we first investigate the Bayes er-rors corresponding to the selected components. This approach makes sense,since the Bayes error is the theoretical limit of the performance of anyclassifier. As Problem 2.1 indicates, collecting more measurements cannotincrease the Bayes error. On the other hand, having too many componentsis not desirable. Just recall the curse of dimensionality that we often faced:to get good error rates, the number of training samples should be exponen-tially large in the number of components. Also, computational and storage

32.1 Dimensionality Reduction 582

limitations may prohibit us from working with many components.We may formulate the feature selection problem as follows: Let X(1), . . .,

X(d) be random variables representing d measurements. For a set A ⊆1, . . . , d of indices, let XA denote the |A|-dimensional random vector,whose components are the X(i)’s with i ∈ A (in the order of increasingindices). Define

L∗(A) = infg:R|A|→0,1

Pg(XA) 6= Y

as the Bayes error corresponding to the pair (XA, Y ).

Figure 32.1. A possiblearrangement of L∗(A)’sfor d = 3.

Obviously, L∗(A) ≤ L∗(B) whenever B ⊂ A, and L∗(∅) = min(PY =0,PY = 1). The problem is to find an efficient way of selecting an indexset A with |A| = k, whose corresponding Bayes error is the smallest. Herek < d is a fixed integer. Exhaustive search through the

(dk

)possibilities

is often undesirable because of the imposed computational burden. Manyattempts have been made to find fast algorithms to obtain the best subsetof features. See Fu, Min, and Li (1970), Kanal (1974), Ben-Bassat (1982),and Devijver and Kittler (1982) for surveys. It is easy to see that the bestk individual features—that is, components corresponding to the k smallestvalues of L∗(i)—do not necessarily constitute the best k-dimensionalvector: just consider a case in which X(1) ≡ X(2) ≡ · · · ≡ X(k). Coverand Van Campenhout (1977) showed that any ordering of the 2d subsetsof 1, . . . , d consistent with the obvious requirement L∗(A) ≤ L∗(B) ifB ⊆ A is possible. More precisely, they proved the following surprisingresult:

Theorem 32.1. (Cover and Van Campenhout (1977)). Let A1, A2, . . .,A2d be an ordering of the 2d subsets of 1, . . . , d, satisfying the consistencyproperty i < j if Ai ⊂ Aj. (Therefore, A1 = ∅ and A2d = 1, . . . , d.) Thenthere exists a distribution of the random variables (X,Y ) = (X(1), . . . , X(d),Y ) such that

L∗(A1) > L∗(A2) > · · · > L∗(A2d).


The theorem shows that every feature selection algorithm that finds thebest k-element subset has to search exhaustively through all k-elementsubsets for some distributions. Any other method is doomed to failure forsome distribution. Many suboptimal, heuristic algorithms have been intro-duced trying to avoid the computational demand of exhaustive search (see,e.g., Sebestyen (1962), Meisel (1972), Chang (1973), Vilmansen (1973), andDevijver and Kittler (1982)). Narendra and Fukunaga (1977) introduced anefficient branch-and-bound method that finds the optimal set of features.Their method avoids searching through all subsets in many cases by mak-ing use of the monotonicity of the Bayes error with respect to the partialordering of the subsets. The key of our proof of the theorem is the followingsimple lemma:

Lemma 32.1. Let A1, . . . , A2d be as in Theorem 32.1. Let 1 < i < 2d. As-sume that the distribution of (X,Y ) on Rd × 0, 1 is such that the distri-bution of X is concentrated on a finite set, L∗(A2d) = L∗(1, . . . , 2d) = 0,and L∗(Aj) < 1/2 for each i < j ≤ 2d. Then there exists another finitedistribution such that L∗(Aj) remains unchanged for each j > i, and

L∗(Aj) < L∗(Ai) <12

for each j > i.

Proof. We denote the original distribution of X by µ and the a posterioriprobability function by η. We may assume without loss of generality thatevery atom of the distribution of X is in [0,M)d for some M > 0. SinceL∗(1, . . . , d) = 0, the value of η(x) is either zero or one at each atom. Weconstruct the new distribution by duplicating each atom in a special way.We describe the new distribution by defining a measure µ′ on Rd and ana posteriori function η′ : Rd → 0, 1.

Define the vector vAi∈ Rd such that its m-th component equals M if

m /∈ Ai, and zero if m ∈ Ai. The new measure µ′ has twice as many atomsas µ. For each atom x ∈ Rd of µ, the new measure µ′ has two atoms,namely, x1 = x and x2 = x+ vAi

. The new distribution is specified by

µ′(x1) = qµ(x), µ′(x2) = (1−q)µ(x), η′(x1) = η(x), and η′(x2) = 1−η(x),

where q ∈ (0, 1/2) is specified later. It remains for us to verify that thisdistribution satisfies the requirements of the lemma for some q. First ob-serve that the values L∗(Aj) remain unchanged for all j > i. This followsfrom the fact that there is at least one component in Aj along which thenew set of atoms is strictly separated from the old one, leaving the corre-sponding contribution to the Bayes error unchanged. On the other hand,as we vary q from zero to 1/2, the new value of L∗(Ai) grows continu-ously from the old value of L∗(Ai) to 1/2. Therefore, since by assumptionmaxj>i L∗(Aj) < 1/2, there exists a value of q such that the new L∗(Ai)satisfies maxj>i L∗(Aj) < L∗(Ai) < 1/2 as desired. 2


Proof of Theorem 32.1. We construct the desired distribution in2d − 2 steps, applying Lemma 32.1 in each step. The procedure for d = 3is illustrated in Figure 32.2.

Figure 32.2. Construction of a prespecified ordering. In this three-dimen-sional example, the first four steps of the procedure are shown, when thedesired ordering is L∗(1, 2, 3) ≤ L∗(2, 3) ≤ L∗(1, 2) ≤ L∗(2) ≤L∗(1, 3) ≤ · · ·. Black circles represent atoms with η = 1 and white circlesare those with η = 0.

We start with a monoatomic distribution concentrated at the origin,with η(0) = 1. Then clearly, L∗(A2d) = 0. By Lemma 32.1, we constructa distribution such that L∗(A2d) = 0 and 0 = L∗(A2d) < L∗(A2d−1) <1/2. By applying the lemma again, we can construct a distribution with0 = L∗(A2d) < L∗(A2d−1) < L∗(A2d−2) < 1/2. After i steps, we have adistribution satisfying the last i inequalities of the desired ordering. Theconstruction is finished after 2d − 2 steps. 2

Remark. The original example of Cover and Van Campenhout (1977)uses the multidimensional gaussian distribution. Van Campenhout (1980)developed the idea further by showing that not only all possible orderings,but all possible values of the L∗(Ai)’s can be achieved by some distributions.The distribution constructed in the above proof is discrete. It has 22d−2

atoms. 2


One may suspect that feature extraction is much easier if given Y , thecomponents X(1), . . . , X(d) are conditionally independent. However, threeand four-dimensional examples given by Elashoff, Elashoff, and Goldman(1967), Toussaint (1971), and Cover (1974) show that even the individuallybest two independent components are not the best pair of components. Wedo not know if Theorem 32.1 generalizes to the case when the componentsare conditionally independent. In the next example, the pair of componentsconsisting of the two worst single features is the best pair, and vice versa.

Theorem 32.2. (Toussaint (1971)). There exist binary-valued ran-dom variables X1, X2, X3, Y ∈ 0, 1 such that X1, X2, and X3 are condi-tionally independent (given Y ), and

L∗(1) < L∗(2) < L∗(3),

butL∗(1, 2) > L∗(1, 3) > L∗(2, 3).

Proof. Let PY = 1 = 1/2. Then the joint distribution of X1, X2, X3, Yis specified by the conditional probabilities PXi = 1|Y = 0 and PXi =1|Y = 1, i = 1, 2, 3. Straightforward calculation shows that the values

PX1 = 1|Y = 0 = 0.1, PX1 = 1|Y = 1 = 0.9,

PX2 = 1|Y = 0 = 0.05, PX2 = 1|Y = 1 = 0.8,

PX3 = 1|Y = 0 = 0.01, PX3 = 1|Y = 1 = 0.71

satisfy the stated inequalities. 2

As our ultimate goal is to minimize the error probability, finding thefeature set minimizing the Bayes error is not the best we can do. For exam-ple, if we know that we will use the 3-nearest neighbor rule, then it makesmore sense to select the set of features that minimizes the asymptotic errorprobability L3NN of the 3-nearest neighbor rule. Recall from Chapter 5 that

L3NN = Eη(X)(1− η(X))

(1 + 4η(X)(1− η(X))

).

The situation here is even messier than for the Bayes error. As the nextexample shows, it is not even true that A ⊂ B implies L3NN(A) ≥ L3NN(B),where A,B ⊆ 1, . . . , d are two subsets of components, and L3NN(A) de-notes the asymptotic error probability of the 3-nearest neighbor rule for(XA, Y ). In other words, adding components may increase L3NN! This cannever happen to the Bayes error—and in fact, to any f -error (Theorem 3.3).The anomaly is due to the fact that the function x(1−x)(1 + 4x(1−x)) isconvex near zero and one.


Example. Let the joint distribution of X = (X1, X2) be uniform on [0, 2]2.The joint distribution of (X,Y ) is defined by the a posteriori probabilitiesgiven by

η(x) =

0.1 if x ∈ [0, 1/2)× [0, 1/2)0 if x ∈ [1/2, 1]× [0, 1/2)1 if x ∈ [0, 1/2)× [1/2, 1]0.9 if x ∈ [1/2× 1]× [1/2, 1]

(Figure 32.3). Straightforward calculations show that L3NN(1, 2) = 0.0612,while L3NN(2) = 0.056525, a smaller value! 2

Figure 32.3. An examplewhen an additional featureincreases the error probabil-ity of the 3-nn rule.

Of course, the real measure of the goodness of the selected feature setis the error probability L(gn) of the classifier designed by using trainingdata Dn. If the classification rule gn is not known at the stage of featureselection, then the best one can do is to estimate the Bayes errors L∗(A) foreach set A of features, and select a feature set by minimizing the estimate.Unfortunately, as Theorem 8.5 shows, no method of estimating the Bayeserrors can guarantee good performance. If we know what classifier will beused after feature selection, then the best strategy is to select a set ofmeasurements based on comparing estimates of the error probabilities. Wedo not pursue this question further.

For special cases, we do not need to mount a big search for the best fea-tures. Here is a simple example: given Y = i, let X =

(X(1), . . . , X(d)

)have

d independent components, where given Y = i, X(j) is normal(mji, σ

2j

).

It is easy to verify (Problem 32.2) that if PY = 1 = 1/2, then

L∗ = PN >

r

2

,

where N is a standard normal random variable, and

r2 =d∑j=1

(mj1 −mj0

σj

)2

32.2 Transformations with Small Distortion 587

is the square of the Mahalanobis distance (see also Duda and Hart (1973,pp. 66–67)). For this case, the quality of the j-th feature is measured by(

mj1 −mj0

σj

)2

.

We may as well rank these values, and given that we need only d′ < dfeatures, we are best off taking the d′ features with the highest qualityindex.

It is possible to come up with analytic solutions in other special cases aswell. For example, Raudys (1976) and Raudys and Pikelis (1980) investi-gated the dependence of EL(gn) on the dimension of the feature spacein the case of certain linear classifiers and normal distributions. They pointout that for a fixed n, by increasing the number of features, the expectederror probability EL(gn) first decreases, and then, after reaching an op-timum, grows again.

32.2 Transformations with Small Distortion

One may view the problem of feature extraction in general as the problemof finding a transformation (i.e., a function) T : Rd → Rd so that the Bayeserror L∗T corresponding to the pair (T (X), Y ) is close to the Bayes errorL∗ corresponding to the pair (X,Y ). One typical example of such trans-formations is fine quantization (i.e., discretization) of X, when T maps theobserved values into a set of finitely, or countably infinitely many values.Reduction of dimensionality of the observations can be put in this frame-work as well. In the following result we show that small distortion of theobservation cannot cause large increase in the Bayes error probability.

Theorem 32.3. (Farago and Gyorfi (1975)). Assume that for a se-quence of transformations Tn, n = 1, 2, . . .

‖Tn(X)−X‖ → 0

in probability, where ‖ · ‖ denotes the Euclidean norm in Rd. Then, if L∗

is the Bayes error for (X,Y ) and L∗Tnis the Bayes error for (Tn(X), Y ),

L∗Tn→ L∗.

Proof. For arbitrarily small ε > 0 we can choose a uniformly continuousfunction 0 ≤ η(x) ≤ 1 such that

2E |η(X)− η(X)| < ε.

For any transformation T , consider now the decision problem of (T (X), Y ),where the random variable Y satisfies PY = 1|X = x = η(x) for every

32.3 Admissible and Sufficient Transformations 589

Clearly, the first term on the right-hand side converges to zero by assump-tion, while the second term does not exceed ε by the definition of δ(ε). 2

Remark. It is clear from the proof of Theorem 32.3 that everything re-mains true if the observation X takes its values from a separable met-ric space with metric ρ, and the condition of the theorem is modified toρ(T (X), X) → 0 in probability. This generalization has significance in curverecognition, whenX is a stochastic process. Then Theorem 32.3 asserts thatone does not lose much information by using usual discretization methods,such as, for example, Karhunen-Loeve series expansion (see Problems 32.3to 32.5). 2

32.3 Admissible and Sufficient Transformations

Sometimes the cost of guessing zero while the true value of Y is one isdifferent from the cost of guessing one, while Y = 0. These situations maybe handled as follows. Define the costs

C(m, l), m, l = 0, 1.

Here C(Y, g(X)) is the cost of deciding on g(X) when the true label is Y .The risk of a decision function g is defined as the expected value of thecost:

Rg = EC(Y, g(X)).

Note that if

C(m, l) =

1 if m 6= l0 otherwise,

then the risk is just the probability of error. Introduce the notation

Qm(x) = η(x)C(1,m) + (1− η(x))C(0,m), m = 0, 1.

Then we have the following extension of Theorem 2.1:

Theorem 32.4. Define

g(x) =

1 if Q1(x) > Q0(x)0 otherwise.

Then for all decision functions g we have

Rg≤ Rg.


Rg

is called the Bayes risk. The proof is left as an exercise (see Prob-lem 32.7). Which transformations preserve all the necessary informationin the sense that the Bayes error probability corresponding to the pair(T (X), Y ) equals that of (X,Y )? Clearly, every invertible mapping T hasthis property. However, the practically interesting transformations are theones that provide some compression of the data. The most efficient of suchtransformations is the Bayes decision g∗ itself: g∗ is specifically designedto minimize the error probability. If the goal is to minimize the Bayes riskwith respect to some other cost function than the error probability, then g∗

generally fails to preserve the Bayes risk. It is natural to ask what transfor-mations preserve the Bayes risk for all possible cost functions. This questionhas a practical significance, when collecting data and construction of thedecision are separated in space or in time. In such cases the data should betransmitted via a communication channel (or should be stored). In bothcases there is a need for an efficient data compression rule. In this prob-lem formulation, when getting the data, one may not know the final costfunction. Therefore a desirable data compression (transformation) does notincrease the Bayes risk for any cost function C(·, ·). Here T is a measurablefunction mapping from Rd to Rd′ for some positive integer d′.

Definition 32.1. Let R∗C,T denote the Bayes risk for the cost function Cand transformed observation T (X). A transformation T is called admissibleif for any cost function C,

R∗C,T = R∗C ,

where R∗C denotes the Bayes risk for the original observation.

Obviously each invertible transformation T is admissible. A nontrivialexample of an admissible transformation is

T ∗(X) = η(X),

since according to Theorem 32.4, the Bayes decision for any cost functioncan be constructed by the a posteriori probability η(x) and by the costfunction. Surprisingly, this is basically the only such transformation in thefollowing sense:

Theorem 32.5. A transformation T is admissible if and only if there is amapping G such that

G(T (X)) = η(X) with probability one.

Proof. The converse is easy since for such G

R∗C ≤ R∗C,T (X) ≤ R∗C,G(T (X)) = R∗C .


Assume now that T is admissible but such function G does not exist. Thenthere is a set A ⊂ Rd such that µ(A) > 0, T (x) is constant on A, while allvalues of η(x) are different on A, that is, if x, y ∈ A, then x 6= y impliesη(x) 6= η(y). Then there are real numbers 0 < c < 1 and ε > 0, and setsB,D ⊂ A such that µ(B), µ(D) > 0, and

1− η(x) > c+ ε if x ∈ B,1− η(x) < c− ε if x ∈ D.

Now, choose a cost function with the following values:

c0,0 = c1,1 = 0, c0,1 = 1, and c1,0 =1− c

c.

Then,

Q0(x) = η(x),

Q1(x) = c1,0(1− η(x)),

and the Bayes decision on B ∪D is given by

g∗(x) =

0 if c < 1− η(x)1 otherwise.

Now, let g(T (x)) be an arbitrary decision. Without loss of generality wecan assume that g(T (x)) = 0 if x ∈ A. Then the difference between therisk of g(T (x)) and the Bayes risk is

Rg(T (X)) −Rg∗(X) =∫Rd

Qg(T (x))(x)µ(dx)−∫Rd

Qg∗(x)(x)µ(dx)

=∫Rd

(Qg(T (x))(x)−Qg∗(x)(x))µ(dx)

≥∫D

(Q0(x)−Q1(x))µ(dx)

=∫D

(η(x)− c1,0(1− η(x))µ(dx)

=∫D

(1− 1− η(x)

c

)µ(dx) ≥ ε

cµ(D) > 0. 2

We can give another characterization of admissible transformations byvirtue of a well-known concept of mathematical statistics:

Definition 32.2. T (X) is called a sufficient statistic if the random vari-ables X, T (X), and Y form a Markov chain in this order. That is, for anyset A, PY ∈ A|T (X), X = PY ∈ A|T (X).


Theorem 32.6. A transformation T is admissible if and only if T (X) isa sufficient statistic.

Proof. Assume that T is admissible. Then according to Theorem 32.5there is a mapping G such that

G(T (X)) = η(X) with probability one.

Then

PY = 1|T (X), X = PY = 1|X = η(X) = G(T (X)),

and

PY = 1|T (X) = E PY = 1|T (X), X|T (X)= E G(T (X))|T (X)= G(T (X)),

thus, PY = 1|T (X), X = PY = 1|T (X), therefore T (X) is sufficient.On the other hand, if T (X) is sufficient, then

PY = 1|X = PY = 1|T (X), X = PY = 1|T (X),

so for the choiceG(T (X)) = PY = 1|T (X)

we have the desired function G(·), and therefore T is admissible. 2

Theorem 32.6 states that we may replace X by any sufficient statisticT (X) without altering the Bayes error. The problem with this, in practice,is that we do not know the sufficient statistics because we do not know thedistribution. If the distribution of (X,Y ) is known to some extent, thenTheorem 32.6 may be useful.

Example. Assume that it is known that η(x) = e−c‖x‖ for some unknownc > 0. Then ‖X‖ is a sufficient statistic. Thus, for discrimination, we mayreplace the d-dimensional vector X by the 1-dimensional random variable‖X‖ without loss of discrimination power. 2

Example. If η(x(1), x(2), x(3)

)= η0

(x(1)x(2), x(2)x(3)

)for some function

η0, then(X(1)X(2), X(2)X(3)

)is a 2-dimensional sufficient statistic. For

discrimination, there is no need to deal with X(1), X(2), X(3). It suffices toextract the features X(1)X(2) and X(2)X(3). 2

Example. If given Y = i,X is normal with unknown meanmi and diagonalcovariance matrix σ2I (for unknown σ), then η(x) is a function of ‖x −m1‖2 − ‖x − m0‖2 only for unknown m0,m1. Here we have no obvious


sufficient statistic. However, if m0 and m1 are both known, then a quickinspection shows that xT (m1 −m0) is a 1-dimensional sufficient statistic.Again, it suffices to look for the simplest possible argument for η. If m0 =m1 = 0 but the covariance matrices are σ2

0I and σ21I given that Y = 0 or

Y = 1, then ‖X‖2 is a sufficient statistic. 2

In summary, the results of this section are useful for picking out featureswhen some theoretical information is available regarding the distributionof (X,Y ).


Problem 32.1. Consider the pair (X,Y ) ∈ Rd × 0, 1 of random variables,and let X(d+1) ∈ R be an additional component. Define the augmented randomvector X ′ = (X,X(d+1)). Denote the Bayes errors corresponding to the pairs(X,Y ) and (X ′, Y ) by L∗X and L∗X′ respectively. Clearly, L∗X ≥ L∗X′ . Prove thatequality holds if and only if

PIη′(X′)>1/2 6= Iη(X)>1/2

= 0,

where the a posteriori probability functions η : Rd → 0, 1 and η′ : Rd+1 →0, 1 are defined as η(x) = PY = 1|X = x and η′(x′) = PY = 1|X ′ = x′).Hint: Consult with the proof of Theorem 32.5.

Problem 32.2. Let PY = 1 = 1/2, and given Y = i, i = 0, 1, let X =(X(1), . . . , X(d)

)have d independent components, whereX(j) is normal

(mji, σ

2j

).

Prove that L∗ = P N > r/2, where N is a standard normal random variable,

and r2 is the square of the Mahalanobis distance: r2 =∑d

j=1((mj1 −mj0)/σj)

2

(Duda and Hart (1973, pp. 66–67)).

Problem 32.3. Sampling of a stochastic process. Let X(t), t ∈ [0, 1], be astochastic process (i.e., a collection of real-valued random variables indexed by t),

and let Y be a binary random variable. For integer N > 0, define X(i)N = X(i/N),

i = 1, . . . , N . Find sufficient conditions on the function m(t) = EX(t) and onthe covariance function K(t, s) = E(X(t) − EX(t))(X(s) − EX(s)) suchthat

limN→∞

L∗(X(·)N ) = L∗(X(·)),

where L∗(X(·)N ) is the Bayes error corresponding to (X

(1)N , . . . , X

(N)N ), and L∗(X(·))

is the infimum of the error probabilities of decision functions that map measur-able functions into 0, 1. Hint: Introduce the stochastic process XN (t) as the

linear interpolation of X(1)n , . . . , X(N). Find conditions under which

limN→∞

E

∫ 1

0

(XN (t)−X(t))2dt

= 0,

and use Theorem 32.3, and the remark following it.


Problem 32.4. Expansion of a stochastic process. Let X(t), m(t), andK(t, s) be as in the previous problem. Let ψ1, ψ2, . . . be a complete orthonormalsystem of functions on [0, 1]. Define

X(i) =

∫ 1

0

X(t)ψi(t)dt.

Find conditions under which

limN→∞

L∗(X(1), . . . , X(N)) = L∗(X(·)).

Problem 32.5. Extend Theorem 32.3 such that the transformations Tn(X,Dn)are allowed to depend on the training data. This extension has significance, be-cause feature extraction algorithms use the training data.

Problem 32.6. For discrete X prove that T (X) is sufficient iff Y, T (X), X forma Markov chain (in this order).


Problem 32.8. Recall the definition of F -errors from Chapter 3. Let F be astrictly concave function. Show that dF (X,Y ) = dF (T (X), Y ) if and only ifthe transformation T is admissible. Conclude that LNN(X,Y ) = LNN(T (X), Y )Construct a T and a distribution of (X,Y ) such that L∗(X,Y ) = L∗(T (X), Y ),but T is not admissible.

Problem 32.9. Find sufficient statistics of minimal dimension for the followingdiscrimination problems:

(1) It is known that for two given sets A and B with A ∩ B = ∅, if Y = 1,we have X ∈ A and if Y = 0, then X ∈ B, or vice versa.

(2) Given Y , X is a vector of independent gamma random variables withcommon unknown shape parameter a and common unknown scale pa-rameter b (i.e., the marginal density of each component of X is of theform xa−1e−x/b/(Γ(a)ba), x > 0). The parameters a and b depend uponY .

Problem 32.10. Let X have support on the surface of a ball of Rd centeredat the origin of unknown radius. Find a sufficient statistic for discrimination ofdimension smaller than d.

Problem 32.11. Assume that the distribution of X =(X(1), X(2), X(3), X(4)

)is such that X(2)X(4) = 1 and X(1) +X(2) +X(3) = 0 with probability one. Finda simple sufficient statistic.


Appendix

In this appendix we summarize some basic definitions and results from thetheory of probability. Most proofs are omitted as they may be found instandard textbooks on probability, such as Ash (1972), Shiryayev (1984),Chow and Teicher (1978), Durrett (1991), Grimmett and Stirzaker (1992),and Zygmund (1959). We also give a list of useful inequalities that are usedin the text.

A.1 Basics of Measure Theory

Definition A.1. Let S be a set, and let F be a family of subsets of S. Fis called a σ-algebra if

(i) ∅ ∈ F ,(ii) A ∈ F implies Ac ∈ F ,(iii) A1, A2, . . . ∈ F implies ∪∞i=1 Ai ∈ F .

A σ-algebra is closed under complement and union of countably infinitelymany sets. Conditions (i) and (ii) imply that S ∈ F . Moreover, (ii) and(iii) imply that a σ-algebra is closed under countably infinite intersections.

Definition A.2. Let S be a set, and let F be a σ-algebra of subsets ofS. Then (S,F) is called a measurable space. The elements of F are calledmeasurable sets.

Definition A.3. If S = Rd and B is the smallest σ-algebra containingall rectangles, then B is called the Borel σ-algebra. The elements of B are

A.2 The Lebesgue Integral 596

called Borel sets.

Definition A.4. Let (S,F) be a measurable space and let f : S → R be afunction. f is called measurable if for all B ∈ B

f−1(B) = s : f(s) ∈ B ∈ F ,

that is, the inverse image of any Borel set B is in F .

Obviously, if A is a measurable set, then the indicator variable IA is ameasurable function. Moreover, finite linear combinations of indicators ofmeasurable sets (called simple functions) are also measurable functions. Itcan be shown that the set of measurable functions is closed under addition,subtraction, multiplication, and division. Moreover, the supremum and in-fimum of a sequence of measurable functions, as well as its pointwise limitsupremum and limit infimum are measurable.

Definition A.5. Let (S,F) be a measurable space and let ν : F → [0,∞)be a function. ν is a measure on F if

(i) ν(∅) = 0,(ii) ν is σ-additive, that is, A1, A2, . . . ∈ F , and Ai ∩ Aj = 0, i 6= j

imply that ν(∪∞i=1Ai) =∑∞i=1 ν(Ai).

In other words, a measure is a nonnegative, σ-additive set function.

Definition A.6. ν is a finite measure if ν(S) <∞. ν is a σ-finite measureif there are countably many measurable sets A1, A2, . . . such that ∪∞i=1Ai =S and ν(Ai) <∞, i = 1, 2, . . ..

Definition A.7. The triple (S,F , ν) is a measure space if (S,F) is a mea-surable space and ν is a measure on F .

Definition A.8. The Lebesgue measure λ on Rd is a measure on the Borelσ-algebra of Rd such that the λ measure of each rectangle equals to itsvolume.

A.2 The Lebesgue Integral

Definition A.9. Let (S,F , ν) be a measure space and f =∑ni=1 xiIAi a

simple function such that the measurable sets A1, . . . , An are disjoint. Thenthe (Lebesgue) integral of f with respect to ν is defined by∫

fdν =∫S

f(s)ν(ds) =n∑i=1

xiν(Ai).


If f is a nonnegative-valued measurable function, then introduce a sequenceof simple functions as follows:

fn(s) =

(k − 1)/n if (k − 1)/n ≤ f(s) < k/n, k = 1, 2, . . . n2n

2n if f(s) ≥ 2n.

Then the fn’s are simple functions, and fn(s) → f(s) in a monotone non-decreasing fashion. Therefore, the sequence of integrals

∫fndν is monotone

non-decreasing, with a limit. The integral f is then defined by∫fdν =

∫S

f(s)ν(ds) = limn→∞

∫fndν.

If f is an arbitrary measurable function, then decompose it as a differenceof its positive and negative parts,

f(s) = f(s)+ − f(s)− = f(s)+ − (−f(s))+,

where f+ and f− are both nonnegative functions. Define the integral of fby ∫

fdν =∫f+dν −

∫f−dν,

if at least one term on the right-hand side is finite. Then we say that theintegral exists. If the integral is finite then f is integrable.

Definition A.10. If∫fdν exists and A is a measurable function, then∫

Afdν is defined by ∫

A

fdν =∫fIAdν.

Definition A.11. We say that fn → f (mod ν) if

ν(s : lim

n→∞fn(s) 6= f(s)

)= 0.

Theorem A.1. (Beppo-Levy theorem). If fn(s) → f(s) in a mono-tone increasing way for some nonnegative integrable function f , then∫

limn→∞

fndν = limn→∞

∫fndν.

Theorem A.2. (Lebesgue’s dominated convergence theorem). As-sume that fn → f (mod ν) and |fn(s)| ≤ g(s) for s ∈ S, n = 1, 2, . . .,where

∫gdν <∞. Then∫

limn→∞

fndν = limn→∞

∫fndν.

Theorem A.3. (Fatou’s lemma). Let f1, f2, . . . be measurable func-tions.


(i) If there exists a measurable function g with∫gdν > −∞ such that

for every n, fn(s) ≥ g(s), then

lim infn→∞

∫fndν ≥

∫lim infn→∞

fndν.

(ii) If there is a a measurable function g with∫gdν < ∞, such that

fn(s) ≤ g(s) for every n, then

lim supn→∞

∫fndν ≤

∫lim supn→∞

fndν.

Definition A.12. Let ν1 and ν2 be measures on a measurable space (S,F).We say that ν1 is absolutely continuous with respect to ν2 if and only ifν2(A) = 0 implies ν1(A) = 0 (A ∈ F). We denote this relation by ν1 ν2.

Theorem A.4. (Radon-Nikodym theorem). Let ν1 and ν2 be mea-sures on the measurable space (S,F) such that ν1 ν2 and ν2 is σ-finite.Then there exists a measurable function f such that for all A ∈ F

ν1(A) =∫A

fdν2.

f is unique (mod ν2). If ν1 is finite, then f has a finite integral.

Definition A.13. f is called the density, or Radon-Nikodym derivative ofν1 with respect to ν2. We use the notation f = dν1/dν2.

Definition A.14. Let ν1 and ν2 be measures on a measurable space (S,F).If there exists a set A ∈ F such that ν1(Ac) = 0 and ν2(A) = 0, then ν1 issingular with respect to ν2 (and vice versa).

Theorem A.5. (Lebesgue decomposition theorem). If µ is a σ-finitemeasure on a measurable space (S,F), then there exist two unique measuresν1, ν2 such that µ = ν1 + ν2, where ν1 µ and ν2 is singular with respectto µ.

Definition A.15. Let (S,F , ν) be a measure space and let f be a mea-surable function. Then f induces a measure µ on the Borel σ-algebra asfollows:

µ(B) = ν(f−1(B)), B ∈ B.

Theorem A.6. Let ν be a measure on the Borel σ-algebra B of R, and letf and g be measurable functions. Then for all B ∈ B,∫

B

g(x)µ(dx) =∫f−1(B)

g(f(s))ν(ds),

where µ is induced by f .

A.3 Denseness Results 599

Definition A.16. Let ν1 and ν2 be measures on the measurable spaces(S1,F1) and (S2,F2), respectively. Let (S,F) be a measurable space suchthat S = S1 × S2, and F1 × F2 ∈ F whenever F1 ∈ F1 and F2 ∈ F2. ν iscalled the product measure of ν1 and ν2 on F if for F1 ∈ F1 and F2 ∈ F2,ν(F1×F2) = ν1(F1)ν2(F2). The product of more than two measures can bedefined similarly.

Theorem A.7. (Fubini’s theorem). Let h be a measurable functionon the product space (S,F). Then∫

S

h(u, v)ν(d(u, v)) =∫S1

(∫S2

h(u, v)ν2(dv))ν1(du)

=∫S2

(∫S1

h(u, v)ν1(du))ν2(dv),

assuming that one of the three integrals is finite.

A.3 Denseness Results

Lemma A.1. (Cover and Hart (1967)). Let µ be a probability mea-sure on Rd, and define its support set by

A = support(µ) = x : for all r > 0, µ(Sx,r) > 0 .

Then µ(A) = 1.

Proof. By the definition of A,

Ac = x : µ(Sx,rx) = 0 for some rx > 0 .

Let Q denote the set of vectors in Rd with rational components (or anycountable dense set). Then for each x ∈ Ac, there is a yx ∈ Q with ‖x −yx‖ ≤ rx/3. This implies Syx,rx/2 ⊂ Sx,rx

. Therefore, µ(Syx,rx/2) = 0, and

Ac ⊂⋃x∈Ac

Syx,rx/2.

The right-hand side is a union of countably many sets of zero measure, andtherefore µ(Ac) = 1. 2

Definition A.17. Let (S,F , ν) be a measure space. For a fixed numberp ≥ 1, Lp(ν) denotes the set of all measurable functions satisfying

∫|f |pdν <

∞.

A.3 Denseness Results 600

Theorem A.8. For every probability measure ν on Rd, the set of contin-uous functions with bounded support is dense in Lp(ν). In other words, forevery ε > 0 and f ∈ Lp there is a continuous function with bounded supportg ∈ Lp such that ∫

|f − g|pdν < ε.

The following theorem is a rich source of denseness results:

Theorem A.9. (Stone-Weierstrass theorem). Let F be a familyof real-valued continuous functions on a closed bounded subset B of Rd.Assume that F is an algebra, that is, for any f1, f2 ∈ F and a, b ∈ R,we have af1 + bf2 ∈ F , and f1f2 ∈ F . Assume furthermore that if x 6= ythen there is an f ∈ F such that f(x) 6= f(y), and that for each x ∈ Bthere exists an f ∈ F with f(x) 6= 0. Then for every ε > 0 and continuousfunction g : B → R, there exists an f ∈ F such that

supx∈B

|g(x)− f(x)| < ε.

The following two theorems concern differentiation of integrals. Goodgeneral references are Wheeden and Zygmund (1977) and de Guzman(1975):

Theorem A.10. (The Lebesgue density theorem). Let f be a den-sity on Rd. Let Qk(x) be a sequence of closed cubes centered at x andcontracting to x. Then

limk→∞

∫Qk(x)

|f(x)− f(y)|dyλ(Qk(x))

= 0

at almost all x, where λ denotes the Lebesgue measure. Note that this im-plies

limk→∞

∫Qk(x)

f(y)dy

λ(Qk(x))= f(x)

at almost all x.

Corollary A.1. Let A be a collection of subsets of S0,1 with the propertythat for all A ∈ A, λ(A) ≥ cλ(S0,1) for some fixed c > 0. Then for almostall x, if x+ rA = y : (y − x)/r ∈ A,

limr→0

supA∈A

∣∣∣∣∣∫x+rA

f(y)dyλ(x+ rA)

− f(x)

∣∣∣∣∣ = 0.

The Lebesgue density theorem also holds if Qk(x) is replaced by asequence of contracting balls centered at x, or indeed by any sequence ofsets that satisfy

x+ brkS ⊆ Qk(x) ⊆ x+ arkS,

A.4 Probability 601

where S is the unit ball of Rd, rk ↓ 0, and 0 < b ≤ a < ∞ are fixedconstants. This follows from the Lebesgue density theorem. It does nothold in general when Qk(x) is a sequence of hyperrectangles containingx and contracting to x. For that, an additional restriction is needed:

Theorem A.11. (The Jessen-Marcinkiewicz-Zygmund theorem).Let f be a density on Rd with

∫f logd−1(1 + f)dx <∞. Let Qk(x) be a

sequence of hyperrectangles containing x and for which diam(Qk(x)) → 0.Then

limk→∞

∫Qk(x)

|f(x)− f(y)|dyλ(Qk(x))

= 0

at almost all x.

Corollary A.2. If f is a density and Qk(x) is as in Theorem A.11,then

lim infk→∞

1λ(Qk(x))

∫Qk(x)

f(y)dy ≥ f(x)

at almost all x.

To see this, take g = min(f,M) for large fixedM . As∫g logd−1(1+g) <∞,

by the Jessen-Marcinkiewicz-Zygmund theorem,

lim infk→∞

1λ(Qk(x))

∫Qk(x)

g(y)dy ≥ g(x)

at almost all x. Conclude by letting M →∞ along the integers.

A.4 Probability

Definition A.18. A measure space (Ω,F ,P) is called a probability spaceif PΩ = 1. Ω is the sample space or sure event, the measurable sets arecalled events, and the measurable functions are called random variables. IfX1, . . . , Xn are random variables then X = (X1, . . . , Xn) is a vector-valuedrandom variable.

Definition A.19. Let X be a random variable, then X induces the mea-sure µ on the Borel σ-algebra of R by

µ(B) = P ω : X(ω) ∈ B = PX ∈ B, B ∈ B.

The probability measure µ is called the distribution of the random variableX.

Definition A.20. Let X be a random variable. The expectation of X isthe integral of x with respect to the distribution µ of X:

EX =∫Rxµ(dx)

A.5 Inequalities 602

if it exists.

Definition A.21. Let X be a random variable. The variance of X is

VarX = E(X −EX)2

if EX is finite, and ∞ if EX is not finite or does not exist.

Definition A.22. Let X1, . . . , Xn be random variables. They induce themeasure µ(n) on the Borel σ-algebra of Rn with the property

µ(n)(B1 × . . .×Bn) = P X1 ∈ B1, . . . , Xn ∈ Bn , B1, . . . , Bn ∈ B.

µ(n) is called the joint distribution of the random variables X1, . . . , Xn.Let µi be the distribution of Xi (i = 1, . . . , n). The random variablesX1, . . . , Xn are independent if their joint distribution µ(n) is the productmeasure of µ1, . . . , µn. The events A1, . . . , An ∈ F are independent if therandom variables IA1 , . . . , IAn

are independent.

Fubini’s theorem implies the following:

Theorem A.12. If the random variables X1, . . . , Xn are independent andhave finite expectations then

EX1X2 . . . Xn = EX1EX2 · · ·EXn.

A.5 Inequalities

Theorem A.13. (Cauchy-Schwarz inequality). If the random vari-ables X and Y have finite second moments (EX2 <∞ and EY 2 <∞),then

|EXY | ≤√

EX2EY 2.

Theorem A.14. (Holder’s inequality). Let p, q ∈ (1,∞) such that(1/p)+(1/q) = 1. Let X and Y be random variables such that (E|Xp|)1/p <∞ and (E|Y q|)1/q <∞. Then

E|XY | ≤ (E|Xp|)1/p (E|Y q|)1/q .

Theorem A.15. (Markov’s inequality). Let X be a nonnegative-valued random variable. Then for each t > 0,

PX ≥ t ≤ EXt

.

Theorem A.16. (Chebyshev’s inequality). Let X be a random vari-able. Then for each t > 0,

P|X −EX| ≥ t ≤ VarXt2

.

A.5 Inequalities 603

Theorem A.17. (Chebyshev-Cantelli inequality). Let t ≥ 0. Then

PX −EX > t ≤ VarXVarX+ t2

.

Proof. We may assume without loss of generality that EX = 0. Then forall t

t = Et−X ≤ E(t−X)IX≤t.Thus for t ≥ 0 from the Cauchy-Schwarz inequality,

t2 ≤ E(t−X)2EI2X≤t

= E(t−X)2PX ≤ t= (VarX+ t2)PX ≤ t,

that is,

PX ≤ t ≥ t2

VarX+ t2,

and the claim follows. 2

Theorem A.18. (Jensen’s inequality). If f is a real-valued convexfunction on a finite or infinite interval of R, and X is a random variablewith finite expectation, taking its values in this interval, then

f(EX) ≤ Ef(X).

Theorem A.19. (Association inequalities). Let X be a real-valuedrandom variable and let f(x) and g(x) be monotone nondecreasing real-valued functions. Then

Ef(X)g(X) ≥ Ef(X)Eg(X),

provided that all expectations exist and are finite. If f is monotone increas-ing and g is monotone decreasing, then

Ef(X)g(X) ≤ Ef(X)Eg(X).

Proof. We prove the first inequality. The second follows by symmetry. LetX have distribution µ. Then we write

Ef(X)g(X) −Ef(X)Eg(X)

=∫f(x)g(x)µ(dx)−

∫f(y)µ(dy)

∫g(x)µ(dx)

=∫ (∫

[f(x)− f(y)]g(x)µ(dx))µ(dy)

=∫ (∫

h(x, y)g(x)µ(dx))µ(dy),

A.6 Convergence of Random Variables 604

where h(x, y) = f(x)− f(y). By Fubini’s theorem the last integral equals∫R2h(x, y)g(x)µ2(dxdy)

=∫x>y

h(x, y)g(x)µ2(dxdy) +∫x<y

h(x, y)g(x)µ2(dxdy),

since h(x, x) = 0 for all x. Here µ2(dxdy) = µ(dx) · µ(dy). The secondintegral on the right-hand side is just∫x

(∫y>x

h(x, y)g(x)µ(dy))µ(dx) =

∫y

(∫x>y

h(y, x)g(y)µ(dx))µ(dy).

Thus, we have

Ef(X)g(X) −Ef(X)Eg(X)

=∫R2h(x, y)g(x)µ2(dxdy)

=∫y

(∫x>y

[h(x, y)g(x) + h(y, x)g(y)]µ(dx))µ(dy)

=∫y

(∫x>y

h(x, y)[g(x)− g(y)]µ(dx))µ(dy)

≥ 0,

since h(y, x) = −h(x, y), and by the fact that h(x, y) ≥ 0 and g(x)−g(y) ≥0 if x > y. 2

A.6 Convergence of Random Variables

Definition A.23. Let Xn, n = 1, 2, . . ., be a sequence of random vari-ables. We say that

limn→∞

Xn = X in probability

if for each ε > 0limn→∞

P|Xn −X| ≥ ε = 0.

We say that

limn→∞

Xn = X with probability one (or almost surely),

if Xn → X (mod P), that is,

Pω : lim

n→∞Xn(ω) = X(ω)

= 1.

A.7 Conditional Expectation 605

For a fixed number p ≥ 1 we say that

limn→∞

Xn = X in Lp,

iflimn→∞

E |Xn −X|p = 0.

Theorem A.20. Convergence in Lp implies convergence in probability.

Theorem A.21. limn→∞Xn = X with probability one if and only if

limn→∞

supn≤m

|Xm −X| = 0

in probability. Thus, convergence with probability one implies convergencein probability.

Theorem A.22. (Borel-Cantelli lemma). Let An, n = 1, 2, . . . , bea sequence of events. Introduce the notation

[An i.o.] = lim supn→∞

An = ∩∞n=1 ∪∞m=n Am.

(“i.o.” stands for “infinitely often.”) If∞∑n=1

PAn <∞

thenP[An i.o.] = 0.

By Theorems A.21 and A.22, we have

Theorem A.23. If for each ε > 0∞∑n=1

P|Xn −X| ≥ ε <∞,

then limn→∞Xn = X with probability one.

A.7 Conditional Expectation

If Y is a random variable with finite expectation and A is an event with pos-itive probability, then the conditional expectation of Y given A is definedby

EY |A =EY IAPA

.

The conditional probability of an event B given A is

PB|A = EIB |A =PA ∩B

PA.

A.8 The Binomial Distribution 606

Definition A.24. Let Y be a random variable with finite expectation andX be a d-dimensional vector-valued random variable. Let FX be the σ-algebra generated by X:

FX = X−1(B);B ∈ Bn.

The conditional expectation EY |X of Y given X is a random variablewith the property that for all A ∈ FX∫

A

Y dP =∫A

EY |XdP.

The existence and uniqueness (with probability one) of EY |X is aconsequence of the Radon-Nikodym theorem if we apply it to the measures

ν+(A) =∫A

Y +dP, ν−(A) =∫A

Y −dP, A ∈ FX ,

such that

EY +|X =dν+

dP, EY −|X =

dν−

dP,

andEY |X = EY +|X −EY −|X.

Definition A.25. Let C be an event and X be a d-dimensional vector-valued random variable. Then the conditional probability of C given X isPC|X = EIC |X.

Theorem A.24. Let Y be a random variable with finite expectation. LetC be an event, and let X and Z be vector-valued random variables. Then

(i) There is a measurable function g on Rd such that EY |X = g(X)with probability one.

(ii) EY = EEY |X, PC = EPC|X.(iii) EY |X = EEY |X,Z|X, PC|X = EPC|X,Y |X.(iv) If Y is a function of X then EY |X = Y .(v) If (Y,X) and Z are independent, then EY |X,Z = EY |X.(vi) If Y = f(X,Z) for a measurable function f , and X and Z are

independent, then EY |X = g(X), where g(x) = Ef(x, Z).

A.8 The Binomial Distribution

An integer-valued random variable X is said to be binomially distributedwith parameters n and p if

PX = k =(n

k

)pk(1− p)n−k, k = 0, 1, . . . , n.


If A1, . . . , An are independent events with PAi = p, then X =∑ni=1 IAi

is binomial (n, p). IAiis called a Bernoulli random variable with parameter

p.

Lemma A.2. Let the random variable B(n, p) be binomially distributedwith parameters n and p. Then

(i)

E

11 +B(n, p)

≤ 1

(n+ 1)p

and(ii)

E

1B(n, p)

IB(n,p)>0

≤ 2

(n+ 1)p.

Proof. (i) follows from the following simple calculation:

E

11 +B(n, p)

=

n∑k=0

1k + 1

(n

k

)pk(1− p)n−k

=1

(n+ 1)p

n∑k=0

(n+ 1k + 1

)pk+1(1− p)n−k

≤ 1(n+ 1)p

n+1∑k=0

(n+ 1k

)pk(1− p)n−k+1 =

1(n+ 1)p

.

For (ii) we have

E

1B(n, p)

IB(n,p)>0

≤ E

2

1 +B(n, p)

≤ 2

(n+ 1)p

by (i). 2

Lemma A.3. Let B be a binomial random variable with parameters n andp. Then for every 0 ≤ p ≤ 1,

PB =

⌊n2

⌋≤√

12nπ

,

and for p = 1/2

PB =

⌊n2

⌋≥√

12nπ

1e1/12

.


Proof. The lemma follows from Stirling’s formula:√

2πn(ne

)ne1/(12n) ≤ n! ≤

√2πn

(ne

)ne1/(12n+1)

(see, e.g., Feller (1968)). 2

Lemma A.4. (Devroye and Gyorfi (1985), p. 194). For any ran-dom variable X with finite fourth moment,

E|X| ≥(EX2

)3/2(EX4)1/2

.

Proof. Fix a > 0. The function 1/x + ax2 is minimal on (0,∞) whenx3 = 1/(2a). Thus,

x+ ax4

x2≥ (2a)1/3 +

a

(2a)2/3=

32(2a)1/3.

Replace x by |X| and take expectations:

E|X| ≥ 32(2a)1/3EX2 − aEX4.

The lower bound, considered as a function of a, is maximized if we takea = 1

2

(EX2/EX4

)3/2. Resubstitution yields the given inequality. 2

Lemma A.5. Let B be a binomial (n, 1/2) random variable. Then

E∣∣∣B − n

2

∣∣∣ ≥√n

8.

Proof. This bound is a special case of Khintchine’s inequality (see Szarek(1976), Haagerup (1978), and also Devroye and Gyorfi (1985), p. 139).Rather than proving the given inequality, we will show how to apply theprevious lemma to get (without further work) the inequality

E∣∣∣B − n

2

∣∣∣ ≥√ n

12.

Indeed, E

(B − n/2)2

= n/4 and E(B − n/2)4

= 3n2/16 − n/8 ≤

3n2/16. Thus,

E∣∣∣B − n

2

∣∣∣ ≥ (n/4)3/2

(3n2/16)1/2=√

n

12. 2

A.10 The Multinomial Distribution 609

Lemma A.6. (Slud (1977)). Let B be a binomial (n, p) random variablewith p ≤ 1/2. Then for n(1− p) ≥ k ≥ np,

PB ≥ k ≥ P

N ≥ k − np√

np(1− p)

,

where N is normal (0, 1).

A.9 The Hypergeometric Distribution

Let N, b, and n be positive integers with N > n and N > b. A randomvariable X taking values on the integers 0, 1, . . . , b is hypergeometric withparameters N, b and n, if

PX = k =

(bk

)(N−bn−k

)(Nn

) , k = 1, . . . , b.

X models the number of blue balls in a sample of n balls drawn withoutreplacement from an urn containing b blue and N − b red balls.

Theorem A.25. (Hoeffding (1963)). Let the set A consist of N num-bers a1, . . . , aN . Let Z1, . . . , Zn denote a random sample taken without re-placement from A, where n ≤ N . Denote

m =1N

N∑i=1

ai and c = maxi,j≤N

|ai − aj |.

Then for any ε > 0 we have

P

∣∣∣∣∣ 1nn∑i=1

Zi −m

∣∣∣∣∣ ≥ ε

≤ 2e−2nε2/c2 .

Specifically, if X is hypergeometrically distributed with parameters N, b, andn, then

P|X − b| ≥ nε ≤ 2e−2nε2 .

For more inequalities of this type, see Hoeffding (1963) and Serfling(1974).

A.10 The Multinomial Distribution

A vector (N1, . . . , Nk) of integer-valued random variables is multinomiallydistributed with parameters (n, p1, . . . , pk) if

PN1 = i1, . . . , Nk = ik =

n!i1!···ik!p

i11 · · · p

ikk if

∑kj=1 ij = k, ij ≥ 0

0 otherwise.

A.12 The Multivariate Normal Distribution 610

Lemma A.7. The moment-generating function of a multinomial (k, p1, . . .,pk) vector is

Eet1N1+···+tkNk

=(p1e

t1 + · · ·+ pketk)n.

A.11 The Exponential and Gamma Distributions

A nonnegative random variable has exponential distribution with parameterλ > 0 if it has a density

f(x) = λe−λx, x ≥ 0.

A nonnegative-valued random variable has the gamma distribution withparameters a, b ≥ 0 if it has density

f(x) =xa−1e−x/b

Γ(a)ba, x ≥ 0, where Γ(a) =

∫ ∞

0

xa−1e−xdx.

The sum of n i.i.d. exponential(λ) random variables has gamma distributionwith parameters n and 1/λ.

A.12 The Multivariate Normal Distribution

A d-dimensional random variable X =(X(1), . . . , X(d)

)has the multivari-

ate normal distribution if it has a density

f(x) =1√

(2π)d det(Σ)e−

12 (x−m)T Σ−1(x−m),

where m ∈ Rd, Σ is a positive definite symmetric d×d matrix with entriesσij , and det(Σ) denotes the determinant of Σ. Then EX = m, and for alli, j = 1, . . . , d,

E

(X(i) −EX(i))(X(j) −EX(j))

= σij .

Σ is called the covariance matrix of X.


Notation

• IA indicator of an event A.• IB(x) = Ix∈B indicator function of a set B.• |A| cardinality of a finite set A.• Ac complement of a set A.• A4B symmetric difference of sets A,B.• f g composition of functions f, g.• log natural logarithm (base e).• bxc integer part of the real number x.• dxe upper integer part of the real number x.• X L= Z if X and Z have the same distribution.• x(1), . . . , x(d) components of the d-dimensional column vector x.

• ‖x‖ =√∑d

i=1

(x(i))2

L2-norm of x ∈ Rd.• X ∈ Rd observation, vector-valued random variable.• Y ∈ 0, 1 label, binary random variable.• Dn = ((X1, Y1), . . . , (Xn, Yn)) training data, sequence of i.i.d. pairs

that are independent of (X,Y ), and have the same distribution asthat of (X,Y ).

• η(x) = PY = 1|X = x, 1− η(x) = PY = 0|X = x a posterioriprobabilities.

• p = PY = 1, 1− p = PY = 0 class probabilities.• g∗ : Rd → 0, 1 Bayes decision function.• φ : Rd → 0, 1, gn : Rd ×

Rd × 0, 1

n → 0, 1 classificationfunctions. The short notation gn(x) = gn(x,Dn) is also used.

• L∗ = Pg∗(X) 6= Y Bayes risk, the error probability of the Bayes

Notation 612

decision.• Ln = L(gn) = Pgn(X,Dn) 6= Y |Dn error probability of a classifi-

cation function gn.• Ln(φ) = 1

n

∑ni=1 Iφ(Xi) 6=Yi empirical error probability of a classifier

φ.• µ(A) = PX ∈ A probability measure of X.• µn(A) = 1

n

∑ni=1 IXi∈A empirical measure corresponding toX1, . . . , Xn.

• λ Lebesgue measure on Rd.• f(x) density of X, Radon-Nikodym derivative of µ with respect toλ (if it exists).

• f0(x), f1(x) conditional densities of X given Y = 0 and Y = 1,respectively (if they exist).

• P partition of Rd.• X(k)(x), X(k) k-th nearest neighbor of x among X1, . . . , Xn.• K : Rd → R kernel function.• h, hn > 0 smoothing factor for a kernel rule.• Kh(x) = (1/h)K(x/h) scaled kernel function.• Tm = ((Xn+1, Yn+1), ..., (Xn+m, Yn+m)) testing data, sequence of

i.i.d. pairs that are independent of (X,Y ) and Dn, and have the samedistribution as that of (X,Y ).

• A class of sets.• C, Cn classes of classification functions.• s(A, n) n-th shatter coefficient of the class of sets A.• VA Vapnik-Chervonenkis dimension of the class of sets A.• S(C, n) n-th shatter coefficient of the class of classifiers C.• VC Vapnik-Chervonenkis dimension of the class of classifiers C.• Sx,r = y ∈ Rd : ‖y − x‖ ≤ r closed Euclidean ball in Rd centered

at x ∈ Rd, with radius r > 0.


References

Abou-Jaoude, S. (1976a). Conditions necessaires et suffisantes de con-vergence L1 en probabilite de l’histogramme pour une densite. Annalesde l’Institut Henri Poincare, 12:213–231.

Abou-Jaoude, S. (1976b). La convergence L1 et L∞ de l’estimateurde la partition aleatoire pour une densite. Annales de l’Institut HenriPoincare, 12:299–317.

Abou-Jaoude, S. (1976c). Sur une condition necessaire et suffisante deL1-convergence presque complete de l’estimateur de la partition fixepour une densite. Comptes Rendus de l’Academie des Sciences de Paris,283:1107–1110.

Aitchison, J. and Aitken, C. (1976). Multivariate binary discriminationby the kernel method. Biometrika, 63:413–420.

Aizerman, M., Braverman, E., and Rozonoer, L. (1964a). The methodof potential functions for the problem of restoring the characteristic ofa function converter from randomly observed points. Automation andRemote Control, 25:1546–1556.

Aizerman, M., Braverman, E., and Rozonoer, L. (1964b). The probabil-ity problem of pattern recognition learning and the method of potentialfunctions. Automation and Remote Control, 25:1307–1323.

Aizerman, M., Braverman, E., and Rozonoer, L. (1964c). Theoreti-cal foundations of the potential function method in pattern recognitionlearning. Automation and Remote Control, 25:917–936.

Aizerman, M., Braverman, E., and Rozonoer, L. (1970). Extrapolativeproblems in automatic control and the method of potential functions.American Mathematical Society Translations, 87:281–303.

Akaike, H. (1954). An approximation to the density function. Annals ofthe Institute of Statistical Mathematics, 6:127–132.

References 614

Akaike, H. (1974). A new look at the statistical model identification.IEEE Transactions on Automatic Control, 19:716–723.

Alexander, K. (1984). Probability inequalities for empirical processesand a law of the iterated logarithm. Annals of Probability, 4:1041–1067.

Anderson, A. and Fu, K. (1979). Design and development of a linear bi-nary tree classifier for leukocytes. Technical Report TR-EE-79-31, Pur-due University, Lafayette, IN.

Anderson, J. (1982). Logistic discrimination. In Handbook of Statistics,Krishnaiah, P. and Kanal, L., editors, volume 2, pages 169–191. North-Holland, Amsterdam.

Anderson, M. and Benning, R. (1970). A distribution-free discrimina-tion procedure based on clustering. IEEE Transactions on InformationTheory, 16:541–548.

Anderson, T. (1958). An Introduction to Multivariate Statistical Analy-sis. John Wiley, New York.

Anderson, T. (1966). Some nonparametric multivariate procedures basedon statistically equivalent blocks. In Multivariate Analysis, Krishnaiah,P., editor, pages 5–27. Academic Press, New York.

Angluin, D. and Valiant, L. (1979). Fast probabilistic algorithms forHamiltonian circuits and matchings. Journal of Computing System Sci-ence, 18:155–193.

Anthony, M. and Holden, S. (1993). On the power of polynomial discrim-inators and radial basis function networks. In Proceedings of the SixthAnnual ACM Conference on Computational Learning Theory, pages 158–164. Association for Computing Machinery, New York.

Anthony, M. and Shawe-Taylor, J. (1990). A result of Vapnik with appli-cations. Technical Report CSD-TR-628, University of London, Surrey.

Argentiero, P., Chin, R., and Beaudet, P. (1982). An automated ap-proach to the design of decision tree classifiers. IEEE Transactions onPattern Analysis and Machine Intelligence, 4:51–57.

Arkadjew, A. and Braverman, E. (1966). Zeichenerkennung undMaschinelles Lernen. Oldenburg Verlag, Munchen, Wien.

Ash, R. (1972). Real Analysis and Probability. Academic Press, NewYork.

Assouad, P. (1983a). Densite et dimension. Annales de l’Institut Fourier,33:233–282.

Assouad, P. (1983b). Deux remarques sur l’estimation. Comptes Rendusde l’Academie des Sciences de Paris, 296:1021–1024.

Azuma, K. (1967). Weighted sums of certain dependent random vari-ables. Tohoku Mathematical Journal, 68:357–367.

Bahadur, R. (1961). A representation of the joint distribution of re-sponses to n dichotomous items. In Studies in Item Analysis and Pre-diction, Solomon, H., editor, pages 158–168. Stanford University Press,Stanford, CA.

Bailey, T. and Jain, A. (1978). A note on distance-weighted k-nearestneighbor rules. IEEE Transactions on Systems, Man, and Cybernetics,8:311–313.

References 615

Barron, A. (1985). Logically smooth density estimation. Technical Re-port TR 56, Department of Statistics, Stanford University.

Barron, A. (1989). Statistical properties of artificial neural networks.In Proceedings of the 28th Conference on Decision and Control, pages280–285. Tampa, FL.

Barron, A. (1991). Complexity regularization with application to ar-tificial neural networks. In Nonparametric Functional Estimation andRelated Topics, Roussas, G., editor, pages 561–576. NATO ASI Series,Kluwer Academic Publishers, Dordrecht.

Barron, A. (1993). Universal approximation bounds for superpositionsof a sigmoidal function. IEEE Transactions on Information Theory,39:930–944.

Barron, A. (1994). Approximation and estimation bounds for artificialneural networks. Machine Learning, 14:115–133.

Barron, A. and Barron, R. (1988). Statistical learning networks: a unify-ing view. In Proceedings of the 20th Symposium on the Interface: Com-puting Science and Statistics, Wegman, E., Gantz, D., and Miller, J.,editors, pages 192–203. AMS, Alexandria, VA.

Barron, A. and Cover, T. (1991). Minimum complexity density estima-tion. IEEE Transactions on Information Theory, 37:1034–1054.

Barron, A., Gyorfi, L., and van der Meulen, E. (1992). Distributionestimation consistent in total variation and in two types of informationdivergence. IEEE Transactions on Information Theory, 38:1437–1454.

Barron, R. (1975). Learning networks improve computer-aided predic-tion and control. Computer Design, 75:65–70.

Bartlett, P. (1993). Lower bounds on the Vapnik-Chervonenkis dimen-sion of multi-layer threshold networks. In Proceedings of the Sixth annualACM Conference on Computational Learning Theory, pages 144–150. As-sociation for Computing Machinery, New York.

Bashkirov, O., Braverman, E., and Muchnik, I. (1964). Potential func-tion algorithms for pattern recognition learning machines. Automationand Remote Control, 25:692–695.

Baum, E. (1988). On the capabilities of multilayer perceptrons. Journalof Complexity, 4:193–215.

Baum, E. and Haussler, D. (1989). What size net gives valid generaliza-tion? Neural Computation, 1:151–160.

Beakley, G. and Tuteur, F. (1972). Distribution-free pattern verificationusing statistically equivalent blocks. IEEE Transactions on Computers,21:1337–1347.

Beck, J. (1979). The exponential rate of convergence of error for kn-nnnonparametric regression and decision. Problems of Control and Infor-mation Theory, 8:303–311.

Becker, P. (1968). Recognition of Patterns. Polyteknisk Forlag, Copen-hagen.

Ben-Bassat, M. (1982). Use of distance measures, information measuresand error bounds in feature evaluation. In Handbook of Statistics, Krish-naiah, P. and Kanal, L., editors, volume 2, pages 773–792. North-Holland,Amsterdam.

References 616

Benedek, G. and Itai, A. (1988). Learnability by fixed distributions.In Computational Learning Theory: Proceedings of the 1988 Workshop,pages 80–90. Morgan Kaufman, San Mateo, CA.

Benedek, G. and Itai, A. (1994). Nonuniform learnability. Journal ofComputer and Systems Sciences, 48:311–323.

Bennett, G. (1962). Probability inequalities for the sum of indepen-dent random variables. Journal of the American Statistical Association,57:33–45.

Beran, R. (1977). Minimum Hellinger distance estimates for parametricmodels. Annals of Statistics, 5:445–463.

Beran, R. (1988). Comments on ”A new theoretical and algorithmicalbasis for estimation, identification and control” by P. Kovanic. Auto-matica, 24:283–287.

Bernstein, S. (1946). The Theory of Probabilities. Gastehizdat Publish-ing House, Moscow.

Bhattacharya, P. and Mack, Y. (1987). Weak convergence of k–nn den-sity and regression estimators with varying k and applications. Annalsof Statistics, 15:976–994.

Bhattacharyya, A. (1946). On a measure of divergence between twomultinomial populations. Sankhya, Series A, 7:401–406.

Bickel, P. and Breiman, L. (1983). Sums of functions of nearest neighbordistances, moment bounds, limit theorems and a goodness of fit test.Annals of Probability, 11:185–214.

Birge, L. (1983). Approximation dans les espaces metriques et theoriede l’estimation. Zeitschrift fur Wahrscheinlichkeitstheorie und verwandteGebiete, 65:181–237.

Birge, L. (1986). On estimating a density using Hellinger distance andsome other strange facts. Probability Theory and Related Fields, 71:271–291.

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. (1989).Learnability and the Vapnik-Chervonenkis dimension. Journal of theACM, 36:929–965.

Braverman, E. (1965). The method of potential functions. Automationand Remote Control, 26:2130–2138.

Braverman, E. and Pyatniskii, E. (1966). Estimation of the rate of con-vergence of algorithms based on the potential function method. Automa-tion and Remote Control, 27:80–100.

Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classifica-tion and Regression Trees. Wadsworth International, Belmont, CA.

Breiman, L., Meisel, W., and Purcell, E. (1977). Variable kernel esti-mates of multivariate densities. Technometrics, 19:135–144.

Brent, R. (1991). Fast training algorithms for multilayer neural nets.IEEE Transactions on Neural Networks, 2:346–354.

Broder, A. (1990). Strategies for efficient incremental nearest neighborsearch. Pattern Recognition, 23:171–178.

Broomhead, D. and Lowe, D. (1988). Multivariable functional interpo-lation and adaptive networks. Complex Systems, 2:321–323.

References 617

Buescher, K. and Kumar, P. (1994a). Learning by canonical smoothestimation, Part I: Simultaneous estimation. To appear in IEEE Trans-actions on Automatic Control.

Buescher, K. and Kumar, P. (1994b). Learning by canonical smoothestimation, Part II: Learning and choice of model complexity. To appearin IEEE Transactions on Automatic Control.

Burbea, J. (1984). The convexity with respect to gaussian distributionsof divergences of order α. Utilitas Mathematica, 26:171–192.

Burbea, J. and Rao, C. (1982). On the convexity of some divergencemeasures based on entropy functions. IEEE Transactions on InformationTheory, 28:48–495.

Burshtein, D., Della Pietra, V., Kanevsky, D., and Nadas, A. (1992).Minimum impurity partitions. Annals of Statistics, 20:1637–1646.

Cacoullos, T. (1965). Estimation of a multivariate density. Annals ofthe Institute of Statistical Mathematics, 18:179–190.

Carnal, H. (1970). Die konvexe Hulle von n rotationssymmetrisch verteil-ten Punkten. Zeitschrift fur Wahrscheinlichkeitstheorie und verwandteGebiete, 15:168–176.

Casey, R. and Nagy, G. (1984). Decision tree design using a probabilisticmodel. IEEE Transactions on Information Theory, 30:93–99.

Cencov, N. (1962). Evaluation of an unknown distribution density fromobservations. Soviet Math. Doklady, 3:1559–1562.

Chang, C. (1974). Finding prototypes for nearest neighbor classifiers.IEEE Transactions on Computers, 26:1179–1184.

Chang, C. (1973). Dynamic programming as applied to feature selectionin pattern recognition systems. IEEE Transactions on Systems, Man,and Cybernetics, 3:166–171.

Chen, T., Chen, H., and Liu, R. (1990). A constructive proof and an ex-tension of Cybenko’s approximation theorem. In Proceedings of the 22ndSymposium of the Interface: Computing Science and Statistics, pages163–168. American Statistical Association, Alexandria, VA.

Chen, X. and Zhao, L. (1987). Almost sure L1-norm convergence fordata-based histogram density estimates. Journal of Multivariate Analy-sis, 21:179–188.

Chen, Z. and Fu, K. (1973). Nonparametric Bayes risk estimation forpattern classification. In Proceedings of the IEEE Conference on Sys-tems, Man, and Cybernetics. Boston, MA.

Chernoff, H. (1952). A measure of asymptotic efficiency of tests of ahypothesis based on the sum of observations. Annals of MathematicalStatistics, 23:493–507.

Chernoff, H. (1971). A bound on the classification error for discriminat-ing between populations with specified means and variances. In Studi diprobabilita, statistica e ricerca operativa in onore di Giuseppe Pompilj,pages 203–211. Oderisi, Gubbio.

Chou, P. (1991). Optimal partitioning for classification and regressiontrees. IEEE Transactions on Pattern Analysis and Machine Intelligence,13:340–354.

References 618

Chou, W. and Chen, Y. (1992). A new fast algorithm for effective train-ing of neural classifiers. Pattern Recognition, 25:423–429.

Chow, C. (1965). Statistical independence and threshold functions.IEEE Transactions on Computers, E-14:66–68.

Chow, C. (1970). On optimum recognition error and rejection tradeoff.IEEE Transactions on Information Theory, 16:41–46.

Chow, Y. and Teicher, H. (1978). Probability Theory, Independence,Interchangeability, Martingales. Springer-Verlag, New York.

Ciampi, A. (1991). Generalized regression trees. Computational Statis-tics and Data Analysis, 12:57–78.

Collomb, G. (1979). Estimation de la regression par la methode des kpoints les plus proches: proprietes de convergence ponctuelle. ComptesRendus de l’Academie des Sciences de Paris, 289:245–247.

Collomb, G. (1980). Estimation de la regression par la methode des kpoints les plus proches avec noyau. Lecture Notes in Mathematics #821,Springer-Verlag, Berlin. 159–175.

Collomb, G. (1981). Estimation non parametrique de la regression: revuebibliographique. International Statistical Review, 49:75–93.

Conway, J. and Sloane, N. (1993). Sphere-Packings, Lattices and Groups.Springer-Verlag, Berlin.

Coomans, D. and Broeckaert, I. (1986). Potential Pattern Recognition inChemical and Medical Decision Making. Research Studies Press, Letch-worth, Hertfordshire, England.

Cormen, T., Leiserson, C., and Rivest, R. (1990). Introduction to Algo-rithms. MIT Press, Boston.

Cover, T. (1965). Geometrical and statistical properties of systems oflinear inequalities with applications in pattern recognition. IEEE Trans-actions on Electronic Computers, 14:326–334.

Cover, T. (1968a). Estimation by the nearest neighbor rule. IEEE Trans-actions on Information Theory, 14:50–55.

Cover, T. (1968b). Rates of convergence for nearest neighbor procedures.In Proceedings of the Hawaii International Conference on Systems Sci-ences, pages 413–415. Honolulu.

Cover, T. (1969). Learning in pattern recognition. In Methodologiesof Pattern Recognition, Watanabe, S., editor, pages 111–132. AcademicPress, New York.

Cover, T. (1974). The best two independent measurements are not thetwo best. IEEE Transactions on Systems, Man, and Cybernetics, 4:116–117.

Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification.IEEE Transactions on Information Theory, 13:21–27.

Cover, T. and Thomas, J. (1991). Elements of Information Theory. JohnWiley, New York.

Cover, T. and Van Campenhout, J. (1977). On the possible orderingsin the measurement selection problem. IEEE Transactions on Systems,Man, and Cybernetics, 7:657–661.

References 619

Cover, T. and Wagner, T. (1975). Topics in statistical pattern recogni-tion. Communication and Cybernetics, 10:15–46.

Cramer, H. and Wold, H. (1936). Some theorems on distribution func-tions. Journal of the London Mathematical Society, 11:290–294.

Csibi, S. (1971). Simple and compound processes in iterative machinelearning. Technical Report, CISM Summer Course, Udine, Italy.

Csibi, S. (1975). Using indicators as a base for estimating optimal deci-sion functions. In Colloquia Mathematica Societatis Janos Bolyai: Topicsin Information Theory, pages 143–153. Keszthely, Hungary.

Csiszar, I. (1967). Information-type measures of difference of probabilitydistributions and indirect observations. Studia Scientiarium Mathemati-carum Hungarica, 2:299–318.

Csiszar, I. (1973). Generalized entropy and quantization problems.In Transactions of the Sixth Prague Conference on Information The-ory, Statistical Decision Functions, Random Processes, pages 159–174.Academia, Prague.

Csiszar, I. and Korner, J. (1981). Information Theory: Coding Theoremsfor Discrete Memoryless Systems. Academic Press, New York.

Cybenko, G. (1989). Approximations by superpositions of sigmoidalfunctions. Math. Control, Signals, Systems, 2:303–314.

Darken, C., Donahue, M., Gurvits, L., and Sontag, E. (1993). Rateof approximation results motivated by robust neural network learning.In Proceedings of the Sixth ACM Workshop on Computational Learn-ing Theory, pages 303–309. Association for Computing Machinery, NewYork.

Das Gupta, S. (1964). Nonparametric classification rules. Sankhya Se-ries A, 26:25–30.

Das Gupta, S. and Lin, H. (1980). Nearest neighbor rules of statisticalclassification based on ranks. Sankhya Series A, 42:419–430.

Dasarathy, B. (1991). Nearest Neighbor Pattern Classification Tech-niques. IEEE Computer Society Press, Los Alamitos, CA.

Day, N. and Kerridge, D. (1967). A general maximum likelihood dis-criminant. Biometrics, 23:313–324.

de Guzman, M. (1975). Differentiation of Integrals in Rn. Lecture Notesin Mathematics #481, Springer-Verlag, Berlin.

Deheuvels, P. (1977). Estimation nonparametrique de la densite parhistogrammes generalises. Publications de l’Institut de Statistique del’Universite de Paris, 22:1–23.

Devijver, P. (1978). A note on ties in voting with the k-nn rule. PatternRecognition, 10:297–298.

Devijver, P. (1979). New error bounds with the nearest neighbor rule.IEEE Transactions on Information Theory, 25:749–753.

Devijver, P. (1980). An overview of asymptotic properties of nearestneighbor rules. In Pattern Recognition in Practice, Gelsema, E. andKanal, L., editors, pages 343–350. Elsevier Science Publishers, Amster-dam.

References 620

Devijver, P. and Kittler, J. (1980). On the edited nearest neighbor rule.In Proceedings of the Fifth International Conference on Pattern Recog-nition, pages 72–80. Pattern Recognition Society, Los Alamitos, CA.

Devijver, P. and Kittler, J. (1982). Pattern Recognition: A StatisticalApproach. Prentice-Hall, Englewood Cliffs, NJ.

Devroye, L. (1978). A universal k-nearest neighbor procedure in dis-crimination. In Proceedings of the 1978 IEEE Computer Society Confer-ence on Pattern Recognition and Image Processing, pages 142–147. IEEEComputer Society, Long Beach, CA.

Devroye, L. (1981a). On the almost everywhere convergence of nonpara-metric regression function estimates. Annals of Statistics, 9:1310–1309.

Devroye, L. (1981b). On the asymptotic probability of error in nonpara-metric discrimination. Annals of Statistics, 9:1320–1327.

Devroye, L. (1981c). On the inequality of Cover and Hart in nearestneighbor discrimination. IEEE Transactions on Pattern Analysis andMachine Intelligence, 3:75–78.

Devroye, L. (1982a). Bounds for the uniform deviation of empirical mea-sures. Journal of Multivariate Analysis, 12:72–79.

Devroye, L. (1982b). Necessary and sufficient conditions for the al-most everywhere convergence of nearest neighbor regression functionestimates. Zeitschrift fur Wahrscheinlichkeitstheorie und verwandte Ge-biete, 61:467–481.

Devroye, L. (1983). The equivalence of weak, strong and complete con-vergence in L1 for kernel density estimates. Annals of Statistics, 11:896–904.

Devroye, L. (1987). A Course in Density Estimation. Birkhauser,Boston.

Devroye, L. (1988a). Applications of the theory of records in the studyof random trees. Acta Informatica, 26:123–130.

Devroye, L. (1988b). Automatic pattern recognition: A study of theprobability of error. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 10:530–543.

Devroye, L. (1988c). The expected size of some graphs in computationalgeometry. Computers and Mathematics with Applications, 15:53–64.

Devroye, L. (1988d). The kernel estimate is relatively stable. ProbabilityTheory and Related Fields, 77:521–536.

Devroye, L. (1991a). Exponential inequalities in nonparametric esti-mation. In Nonparametric Functional Estimation and Related Topics,Roussas, G., editor, pages 31–44. NATO ASI Series, Kluwer AcademicPublishers, Dordrecht.

Devroye, L. (1991b). On the oscillation of the expected number of pointson a random convex hull. Statistics and Probability Letters, 11:281–286.

Devroye, L. and Gyorfi, L. (1983). Distribution-free exponential boundon the L1 error of partitioning estimates of a regression function. In Pro-ceedings of the Fourth Pannonian Symposium on Mathematical Statis-tics, Konecny, F., Mogyorodi, J., and Wertz, W., editors, pages 67–76.Akademiai Kiado, Budapest, Hungary.

References 621

Devroye, L. and Gyorfi, L. (1985). Nonparametric Density Estimation:The L1 View. John Wiley, New York.

Devroye, L. and Gyorfi, L. (1992). No empirical probability measurecan converge in the total variation sense for all distributions. Annals ofStatistics, 18:1496–1499.

Devroye, L., Gyorfi, L., Krzyzak, A., and Lugosi, G. (1994). On thestrong universal consistency of nearest neighbor regression function esti-mates. Annals of Statistics, 22:1371–1385.

Devroye, L. and Krzyzak, A. (1989). An equivalence theorem for L1 con-vergence of the kernel regression estimate. Journal of Statistical Planningand Inference, 23:71–82.

Devroye, L. and Laforest, L. (1990). An analysis of random d-dimensional quadtrees. SIAM Journal on Computing, 19:821–832.

Devroye, L. and Lugosi, G. (1995). Lower bounds in pattern recognitionand learning. Pattern Recognition. To appear.

Devroye, L. and Machell, F. (1985). Data structures in kernel densityestimation. IEEE Transactions on Pattern Analysis and Machine Intel-ligence, 7:360–366.

Devroye, L. and Wagner, T. (1976a). A distribution-free performancebound in error estimation. IEEE Transactions Information Theory,22:586–587.

Devroye, L. and Wagner, T. (1976b). Nonparametric discrimination anddensity estimation. Technical Report 183, Electronics Research Center,University of Texas.

Devroye, L. and Wagner, T. (1979a). Distribution-free inequalities forthe deleted and holdout error estimates. IEEE Transactions on Infor-mation Theory, 25:202–207.

Devroye, L. and Wagner, T. (1979b). Distribution-free performancebounds for potential function rules. IEEE Transactions on InformationTheory, 25:601–604.

Devroye, L. and Wagner, T. (1979c). Distribution-free performancebounds with the resubstitution error estimate. IEEE Transactions onInformation Theory, 25:208–210.

Devroye, L. and Wagner, T. (1980a). Distribution-free consistency re-sults in nonparametric discrimination and regression function estimation.Annals of Statistics, 8:231–239.

Devroye, L. and Wagner, T. (1980b). On the L1 convergence of kernelestimators of regression functions with applications in discrimination.Zeitschrift fur Wahrscheinlichkeitstheorie und verwandte Gebiete, 51:15–21.

Devroye, L. and Wagner, T. (1982). Nearest neighbor methods in dis-crimination. In Handbook of Statistics, Krishnaiah, P. and Kanal, L.,editors, volume 2, pages 193–197. North Holland, Amsterdam.

Devroye, L. and Wise, G. (1980). Consistency of a recursive nearestneighbor regression function estimate. Journal of Multivariate Analysis,10:539–550.

Diaconis, P. and Shahshahani, M. (1984). On nonlinear functions oflinear combinations. SIAM Journal on Scientific and Statistical Com-puting, 5:175–191.

References 622

Do-Tu, H. and Installe, M. (1975). On adaptive solution of piecewiselinear approximation problem—application to modeling and identifica-tion. In Milwaukee Symposium on Automatic Computation and Control,pages 1–6. Milwaukee, WI.

Duda, R. and Hart, P. (1973). Pattern Classification and Scene Analysis.John Wiley, New York.

Dudani, S. (1976). The distance-weighted k-nearest-neighbor rule. IEEETransactions on Systems, Man, and Cybernetics, 6:325–327.

Dudley, R. (1978). Central limit theorems for empirical measures. An-nals of Probability, 6:899–929.

Dudley, R. (1979). Balls in Rk do not cut all subsets of k + 2 points.Advances in Mathematics, 31 (3):306–308.

Dudley, R. (1984). Empirical processes. In Ecole de Probabilite de St.Flour 1982. Lecture Notes in Mathematics #1097, Springer-Verlag, NewYork.

Dudley, R. (1987). Universal Donsker classes and metric entropy. Annalsof Probability, 15:1306–1326.

Dudley, R., Kulkarni, S., Richardson, T., and Zeitouni, O. (1994). Ametric entropy bound is not sufficient for learnability. IEEE Transactionson Information Theory, 40:883–885.

Durrett, R. (1991). Probability: Theory and Examples. Wadsworth andBrooks/Cole, Pacific Grove, CA.

Dvoretzky, A. (1956). On stochastic approximation. In Proceedings ofthe First Berkeley Symposium on Mathematical Statistics and Proba-bility, Neyman, J., editor, pages 39–55. University of California Press,Berkeley, Los Angeles.

Dvoretzky, A., Kiefer, J., and Wolfowitz, J. (1956). Asymptotic minimaxcharacter of a sample distribution function and of the classical multino-mial estimator. Annals of Mathematical Statistics, 33:642–669.

Edelsbrunner, H. (1987). Algorithms for Computational Geometry.Springer-Verlag, Berlin.

Efron, B. (1979). Bootstrap methods: another look at the jackknife.Annals of Statistics, 7:1–26.

Efron, B. (1983). Estimating the error rate of a prediction rule: improve-ment on cross validation. Journal of the American Statistical Associa-tion, 78:316–331.

Efron, B. and Stein, C. (1981). The jackknife estimate of variance. An-nals of Statistics, 9:586–596.

Ehrenfeucht, A., Haussler, D., Kearns, M., and Valiant, L. (1989). Ageneral lower bound on the number of examples needed for learning.Information and Computation, 82:247–261.

Elashoff, J., Elashoff, R., and Goldmann, G. (1967). On the choiceof variables in classification problems with dichotomous variables.Biometrika, 54:668–670.

Fabian, V. (1971). Stochastic approximation. In Optimizing Methods inStatistics, Rustagi, J., editor, pages 439–470. Academic Press, New York,London.

References 623

Fano, R. (1952). Class Notes for Transmission of Information. Course6.574. MIT, Cambridge, MA.

Farago, A., Linder, T., and Lugosi, G. (1993). Fast nearest neighborsearch in dissimilarity spaces. IEEE Transactions on Pattern Analysisand Machine Intelligence, 15:957–962.

Farago, A. and Lugosi, G. (1993). Strong universal consistency of neuralnetwork classifiers. IEEE Transactions on Information Theory, 39:1146–1151.

Farago, T. and Gyorfi, L. (1975). On the continuity of the error distor-tion function for multiple hypothesis decisions. IEEE Transactions onInformation Theory, 21:458–460.

Feinholz, L. (1979). Estimation of the performance of partitioning algo-rithms in pattern classification. Master’s Thesis, Department of Mathe-matics, McGill University, Montreal.

Feller, W. (1968). An Introduction to Probability Theory and its Appli-cations, Vol.1. John Wiley, New York.

Finkel, R. and Bentley, J. (1974). Quad trees: A data structure forretrieval on composite keys. Acta Informatica, 4:1–9.

Fisher, R. (1936). The case of multiple measurements in taxonomicproblems. Annals of Eugenics, 7, part II:179–188.

Fitzmaurice, G. and Hand, D. (1987). A comparison of two averageconditional error rate estimators. Pattern Recognition Letters, 6:221–224.

Fix, E. and Hodges, J. (1951). Discriminatory analysis. Nonparamet-ric discrimination: Consistency properties. Technical Report 4, ProjectNumber 21-49-004, USAF School of Aviation Medicine, Randolph Field,TX.

Fix, E. and Hodges, J. (1952). Discriminatory analysis: small sampleperformance. Technical Report 21-49-004, USAF School of AviationMedicine, Randolph Field, TX.

Fix, E. and Hodges, J. (1991a). Discriminatory analysis, nonparamet-ric discrimination, consistency properties. In Nearest Neighbor Pat-tern Classification Techniques, Dasarathy, B., editor, pages 32–39. IEEEComputer Society Press, Los Alamitos, CA.

Fix, E. and Hodges, J. (1991b). Discriminatory analysis: small sam-ple performance. In Nearest Neighbor Pattern Classification Techniques,Dasarathy, B., editor, pages 40–56. IEEE Computer Society Press, LosAlamitos, CA.

Flick, T., Jones, L., Priest, R., and Herman, C. (1990). Pattern classifi-cation using protection pursuit. Pattern Recognition, 23:1367–1376.

Forney, G. (1968). Exponential error bounds for erasure, list, and de-cision feedback schemes. IEEE Transactions on Information Theory,14:41–46.

Friedman, J. (1977). A recursive partitioning decision rule for nonpara-metric classification. IEEE Transactions on Computers, 26:404–408.

Friedman, J., Baskett, F., and Shustek, L. (1975). An algorithm forfinding nearest neighbor. IEEE Transactions on Computers, 24:1000–1006.

References 624

Friedman, J., Bentley, J., and Finkel, R. (1977). An algorithm for find-ing best matches in logarithmic expected time. ACM Transactions onMathematical Software, 3:209–226.

Friedman, J. and Silverman, B. (1989). Flexible parsimonious smoothingand additive modeling. Technometrics, 31:3–39.

Friedman, J. and Stuetzle, W. (1981). Projection pursuit regression.Journal of the American Statistical Association, 76:817–823.

Friedman, J., Stuetzle, W., and Schroeder, A. (1984). Projection pur-suit density estimation. Journal of the American Statistical Association,79:599–608.

Friedman, J. and Tukey, J. (1974). A projection pursuit algorithm forexploratory data analysis. IEEE Transactions on Computers, 23:881–889.

Fritz, J. and Gyorfi, L. (1976). On the minimization of classificationerror probability in statistical pattern recognition. Problems of Controland Information Theory, 5:371–382.

Fu, K., Min, P., and Li, T. (1970). Feature selection in pattern recogni-tion. IEEE Transactions on Systems Science and Cybernetics, 6:33–39.

Fuchs, H., Abram, G., and Grant, E. (1983). Near real-time shadeddisplay of rigid objects. Computer Graphics, 17:65–72.

Fuchs, H., Kedem, Z., and Naylor, B. (1980). On visible surface gener-ation by a priori tree structures. In Proceedings SIGGRAPH ’80, pages124–133. Published as Computer Graphics, volume 14.

Fukunaga, K. and Flick, T. (1984). An optimal global nearest neighbormetric. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 6:314–318.

Fukunaga, K. and Hostetler, L. (1973). Optimization of k-nearestneighbor density estimates. IEEE Transactions on Information Theory,19:320–326.

Fukunaga, K. and Hummels, D. (1987). Bayes error estimation usingParzen and k-nn procedures. IEEE Transactions on Pattern Analysisand Machine Intelligence, 9:634–643.

Fukunaga, K. and Kessel, D. (1971). Estimation of classification error.IEEE Transactions on Computers, 20:1521–1527.

Fukunaga, K. and Kessel, D. (1973). Nonparametric Bayes error estima-tion using unclassified samples. IEEE Transactions Information Theory,19:434–440.

Fukunaga, K. and Mantock, J. (1984). Nonparametric data reduction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:115–118.

Fukunaga, K. and Narendra, P. (1975). A branch and bound algorithmfor computing k-nearest neighbors. IEEE Transactions on Computers,24:750–753.

Funahashi, K. (1989). On the approximate realization of continuousmappings by neural networks. Neural Networks, 2:183–192.

Gabor, D. (1961). A universal nonlinear filter, predictor, and simulatorwhich optimizes itself. Proceedings of the Institute of Electrical Engi-neers, 108B:422–438.

References 625

Gabriel, K. and Sokal, R. (1969). A new statistical approach to geo-graphic variation analysis. Systematic Zoology, 18:259–278.

Gaenssler, P. (1983). Empirical processes. Lecture Notes–MonographSeries, Institute of Mathematical Statistics, Hayward, CA.

Gaenssler, P. and Stute, W. (1979). Empirical processes: A survey ofresults for independent identically distributed random variables. Annalsof Probability, 7:193–243.

Gallant, A. (1987). Nonlinear Statistical Models. John Wiley, New York.

Ganesalingam, S. and McLachlan, G. (1980). Error rate estimation onthe basis of posterior probabilities. Pattern Recognition, 12:405–413.

Garnett, J. and Yau, S. (1977). Nonparametric estimation of the Bayeserror of feature extractors using ordered nearest neighbour sets. IEEETransactions on Computers, 26:46–54.

Gates, G. (1972). The reduced nearest neighbor rule. IEEE Transactionson Information Theory, 18:431–433.

Gelfand, S. and Delp, E. (1991). On tree structured classifiers. In Arti-ficial Neural Networks and Statistical Pattern Recognition, Old and NewConnections, Sethi, I. and Jain, A., editors, pages 71–88. Elsevier SciencePublishers, Amsterdam.

Gelfand, S., Ravishankar, C., and Delp, E. (1989). An iterative growingand pruning algorithm for classification tree design. In Proceedings of the1989 IEEE International Conference on Systems, Man, and Cybernetics,pages 818–823. IEEE Press, Piscataway, NJ.

Gelfand, S., Ravishankar, C., and Delp, E. (1991). An iterative growingand pruning algorithm for classification tree design. IEEE Transactionson Pattern Analysis and Machine Intelligence, 13:163–174.

Geman, S. and Hwang, C. (1982). Nonparametric maximum likelihoodestimation by the method of sieves. Annals of Statistics, 10:401–414.

Gessaman, M. (1970). A consistent nonparametric multivariate densityestimator based on statistically equivalent blocks. Annals of Mathemat-ical Statistics, 41:1344–1346.

Gessaman, M. and Gessaman, P. (1972). A comparison of some multi-variate discrimination procedures. Journal of the American StatisticalAssociation, 67:468–472.

Geva, S. and Sitte, J. (1991). Adaptive nearest neighbor pattern classi-fication. IEEE Transactions on Neural Networks, 2:318–322.

Gine, E. and Zinn, J. (1984). Some limit theorems for empirical pro-cesses. Annals of Probability, 12:929–989.

Glick, N. (1973). Sample-based multinomial classification. Biometrics,29:241–256.

Glick, N. (1976). Sample-based classification procedures related to em-piric distributions. IEEE Transactions on Information Theory, 22:454–461.

Glick, N. (1978). Additive estimators for probabilities of correct classifi-cation. Pattern Recognition, 10:211–222.

References 626

Goldberg, P. and Jerrum, M. (1993). Bounding the Vapnik-Chervonenkis dimension of concept classes parametrized by real num-bers. In Proceedings of the Sixth ACM Workshop on ComputationalLearning Theory, pages 361–369. Association for Computing Machinery,New York.

Goldstein, M. (1977). A two-group classification procedure for multivari-ate dichotomous responses. Multivariate Behavorial Research, 12:335–346.

Goldstein, M. and Dillon, W. (1978). Discrete Discriminant Analysis.John Wiley, New York.

Golea, M. and Marchand, M. (1990). A growth algorithm for neuralnetwork decision trees. Europhysics Letters, 12:205–210.

Goodman, R. and Smyth, P. (1988). Decision tree design from a commu-nication theory viewpoint. IEEE Transactions on Information Theory,34:979–994.

Gordon, L. and Olshen, R. (1984). Almost surely consistent nonpara-metric regression from recursive partitioning schemes. Journal of Multi-variate Analysis, 15:147–163.

Gordon, L. and Olshen, R. (1978). Asymptotically efficient solutions tothe classification problem. Annals of Statistics, 6:515–533.

Gordon, L. and Olshen, R. (1980). Consistent nonparametric regressionfrom recursive partitioning schemes. Journal of Multivariate Analysis,10:611–627.

Gowda, K. and Krishna, G. (1979). The condensed nearest neighbor ruleusing the concept of mutual nearest neighborhood. IEEE Transactionson Information Theory, 25:488–490.

Greblicki, W. (1974). Asymptotically optimal probabilistic algorithmsfor pattern recognition and identification. Technical Report, MonografieNo. 3, Prace Naukowe Instytutu Cybernetyki Technicznej PolitechnikiWroclawsjiej No. 18, Wroclaw, Poland.

Greblicki, W. (1978a). Asymptotically optimal pattern recognition pro-cedures with density estimates. IEEE Transactions on Information The-ory, 24:250–251.

Greblicki, W. (1978b). Pattern recognition procedures with nonpara-metric density estimates. IEEE Transactions on Systems, Man and Cy-bernetics, 8:809–812.

Greblicki, W. (1981). Asymptotic efficiency of classifying proceduresusing the Hermite series estimate of multivariate probability densities.IEEE Transactions on Information Theory, 27:364–366.

Greblicki, W., Krzyzak, A., and Pawlak, M. (1984). Distribution-freepointwise consistency of kernel regression estimate. Annals of Statistics,12:1570–1575.

Greblicki, W. and Pawlak, M. (1981). Classification using the Fourierseries estimate of multivariate density functions. IEEE Transactions onSystems, Man and Cybernetics, 11:726–730.

Greblicki, W. and Pawlak, M. (1982). A classification procedure usingthe multiple Fourier series. Information Sciences, 26:115–126.

References 627

Greblicki, W. and Pawlak, M. (1983). Almost sure convergence of clas-sification procedures using Hermite series density estimates. PatternRecognition Letters, 2:13–17.

Greblicki, W. and Pawlak, M. (1985). Pointwise consistency of the Her-mite series density estimate. Statistics and Probability Letters, 3:65–69.

Greblicki, W. and Pawlak, M. (1987). Necessary and sufficient conditionsfor Bayes risk consistency of a recursive kernel classification rule. IEEETransactions on Information Theory, 33:408–412.

Grenander, U. (1981). Abstract Inference. John Wiley, New York.

Grimmett, G. and Stirzaker, D. (1992). Probability and Random Pro-cesses. Oxford University Press, Oxford.

Guo, H. and Gelfand, S. (1992). Classification trees with neural networkfeature extraction. IEEE Transactions on Neural Networks, 3:923–933.

Gyorfi, L. (1975). An upper bound of error probabilities for multihypoth-esis testing and its application in adaptive pattern recognition. Problemsof Control and Information Theory, 5:449–457.

Gyorfi, L. (1978). On the rate of convergence of nearest neighbor rules.IEEE Transactions on Information Theory, 29:509–512.

Gyorfi, L. (1981). Recent results on nonparametric regression estimateand multiple classification. Problems of Control and Information Theory,10:43–52.

Gyorfi, L. (1984). Adaptive linear procedures under general conditions.IEEE Transactions on Information Theory, 30:262–267.

Gyorfi, L. and Gyorfi, Z. (1975). On the nonparametric estimate of a pos-teriori probabilities of simple statistical hypotheses. In Colloquia Math-ematica Societatis Janos Bolyai: Topics in Information Theory, pages299–308. Keszthely, Hungary.

Gyorfi, L. and Gyorfi, Z. (1978). An upper bound on the asymptoticerror probability of the k-nearest neighbor rule for multiple classes. IEEETransactions on Information Theory, 24:512–514.

Gyorfi, L., Gyorfi, Z., and Vajda, I. (1978). Bayesian decision with re-jection. Problems of Control and Information Theory, 8:445–452.

Gyorfi, L. and Vajda, I. (1980). Upper bound on the error probabilityof detection in non-gaussian noise. Problems of Control and InformationTheory, 9:215–224.

Haagerup, U. (1978). Les meilleures constantes de l’inegalite de Khint-chine. Comptes Rendus de l’Academie des Sciences de Paris A, 286:259–262.

Habbema, J., Hermans, J., and van den Broek, K. (1974). A stepwisediscriminant analysis program using density estimation. In COMPSTAT1974, Bruckmann, G., editor, pages 101–110. Physica Verlag, Wien.

Hagerup, T. and Rub, C. (1990). A guided tour of Chernoff bounds.Information Processing Letters, 33:305–308.

Hall, P. (1981). On nonparametric multivariate binary discrimination.Biometrika, 68:287–294.

Hall, P. (1989). On projection pursuit regression. Annals of Statistics,17:573–588.

References 628

Hall, P. and Wand, M. (1988). On nonparametric discrimination usingdensity differences. Biometrika, 75:541–547.

Hand, D. (1981). Discrimination and Classification. John Wiley, Chich-ester, U.K.

Hand, D. (1986). Recent advances in error rate estimation. PatternRecognition Letters, 4:335–346.

Hardle, W. and Marron, J. (1985). Optimal bandwidth selection in non-parametric regression function estimation. Annals of Statistics, 13:1465–1481.

Hart, P. (1968). The condensed nearest neighbor rule. IEEE Transac-tions on Information Theory, 14:515–516.

Hartigan, J. (1975). Clustering Algorithms. John Wiley, New York.

Hartmann, C., Varshney, P., Mehrotra, K., and Gerberich, C. (1982).Application of information theory to the construction of efficient decisiontrees. IEEE Transactions on Information Theory, 28:565–577.

Hashlamoun, W., Varshney, P., and Samarasooriya, V. (1994). A tightupper bound on the Bayesian probability of error. IEEE Transactionson Pattern Analysis and Machine Intelligence, 16:220–224.

Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models.Chapman and Hall, London, U. K.

Haussler, D. (1991). Sphere packing numbers for subsets of the booleann-cube with bounded Vapnik-Chervonenkis dimension. Technical Re-port, Computer Research Laboratory, University of California, SantaCruz.

Haussler, D. (1992). Decision theoretic generalizations of the PAC modelfor neural net and other learning applications. Information and Compu-tation, 100:78–150.

Haussler, D., Littlestone, N., and Warmuth, M. (1988). Predicting 0, 1functions from randomly drawn points. In Proceedings of the 29th IEEESymposium on the Foundations of Computer Science, pages 100–109.IEEE Computer Society Press, Los Alamitos, CA.

Hecht-Nielsen, R. (1987). Kolmogorov’s mapping network existence the-orem. In IEEE First International Conference on Neural Networks, vol-ume 3, pages 11–13. IEEE, Piscataway, NJ.

Hellman, M. (1970). The nearest neighbor classification rule with a rejectoption. IEEE Transactions on Systems, Man and Cybernetics, 2:179–185.

Henrichon, E. and Fu, K. (1969). A nonparametric partitioning pro-cedure for pattern classification. IEEE Transactions on Computers,18:614–624.

Hertz, J., Krogh, A., and Palmer, R. (1991). Introduction to the Theoryof Neural Computation. Addison-Wesley, Redwood City, CA.

Hills, M. (1966). Allocation rules and their error rates. Journal of theRoyal Statistical Society, B28:1–31.

Hjort, N. (1986a). Contribution to the discussion of a paper by P. Dia-conis and Freeman. Annals of Statistics, 14:49–55.

References 629

Hjort, N. (1986b). Notes on the theory of statistical symbol recognition.Technical Report 778, Norwegian Computing Centre, Oslo.

Hoeffding, W. (1963). Probability inequalities for sums of bounded ran-dom variables. Journal of the American Statistical Association, 58:13–30.

Holmstrom, L. and Klemela, J. (1992). Asymptotic bounds for the ex-pected l1 error of a multivariate kernel density estimator. Journal ofMultivariate Analysis, 40:245–255.

Hora, S. and Wilcox, J. (1982). Estimation of error rates in several-population discriminant analysis. Journal of Marketing Research, 19:57–61.

Horibe, Y. (1970). On zero error probability of binary decisions. IEEETransactions on Information Theory, 16:347–348.

Horne, B. and Hush, D. (1990). On the optimality of the sigmoid percep-tron. In International Joint Conference on Neural Networks, volume 1,pages 269–272. Lawrence Erlbaum Associates, Hillsdale, NJ.

Hornik, K. (1991). Approximation capabilities of multilayer feedforwardnetworks. Neural Networks, 4:251–257.

Hornik, K. (1993). Some new results on neural network approximation.Neural Networks, 6:1069–1072.

Hornik, K., Stinchcombe, M., and White, H. (1989). Multi-layer feedfor-ward networks are universal approximators. Neural Networks, 2:359–366.

Huber, P. (1985). Projection pursuit. Annals of Statistics, 13:435–525.

Hudimoto, H. (1957). A note on the probability of the correct classifica-tion when the distributions are not specified. Annals of the Institute ofStatistical Mathematics, 9:31–36.

Ito, T. (1969). Note on a class of statistical recognition functions. IEEETransactions on Computers, C-18:76–79.

Ito, T. (1972). Approximate error bounds in pattern recognition. InMachine Intelligence, Meltzer, B. and Mitchie, D., editors, pages 369–376. Edinburgh University Press, Edinburgh.

Ivakhnenko, A. (1968). The group method of data handling—a rivalof the method of stochastic approximation. Soviet Automatic Control,1:43–55.

Ivakhnenko, A. (1971). Polynomial theory of complex systems. IEEETransactions on Systems, Man, and Cybernetics, 1:364–378.

Ivakhnenko, A., Konovalenko, V., Tulupchuk, Y., and Tymchenko, I.(1968). The group method of data handling in pattern recognition anddecision problems. Soviet Automatic Control, 1:31–41.

Ivakhnenko, A., Petrache, G., and Krasyts’kyy, M. (1968). A gmdhalgorithm with random selection of pairs. Soviet Automatic Control,5:23–30.

Jain, A., Dubes, R., and Chen, C. (1987). Bootstrap techniques forerror estimation. IEEE Transactions on Pattern Analysis and MachineIntelligence, 9:628–633.

Jeffreys, H. (1948). Theory of Probability. Clarendon Press, Oxford.

Johnson, D. and Preparata, F. (1978). The densest hemisphere problem.Theoretical Computer Science, 6:93–107.

References 630

Kurkova, V. (1992). Kolmogorov’s theorem and multilayer neural net-works. Neural Networks, 5:501–506.

Kailath, T. (1967). The divergence and Bhattacharyya distance mea-sures in signal detection. IEEE Transactions on Communication Tech-nology, 15:52–60.

Kanal, L. (1974). Patterns in pattern recognition. IEEE Transactionson Information Theory, 20:697–722.

Kaplan, M. (1985). The uses of spatial coherence in ray tracing. SIG-GRAPH ’85 Course Notes, 11:22–26.

Karlin, A. (1968). Total Positivity, Volume 1. Stanford University Press,Stanford, CA.

Karp, R. (1988). Probabilistic Analysis of Algorithms. Class Notes, Uni-versity of California, Berkeley.

Karpinski, M. and Macintyre, A. (1994). Quadratic bounds for vc di-mension of sigmoidal neural networks. Submitted to the ACM Sympo-sium on Theory of Computing.

Kazmierczak, H. and Steinbuch, K. (1963). Adaptive systems in patternrecognition. IEEE Transactions on Electronic Computers, 12:822–835.

Kemp, R. (1984). Fundamentals of the Average Case Analysis of Par-ticular Algorithms. B.G. Teubner, Stuttgart.

Kemperman, J. (1969). On the optimum rate of transmitting informa-tion. In Probability and Information Theory, pages 126–169. SpringerLecture Notes in Mathematics, Springer-Verlag, Berlin.

Kiefer, J. and Wolfowitz, J. (1952). Stochastic estimation of the maxi-mum of a regression function. Annals of Mathematical Statistics, 23:462–466.

Kim, B. and Park, S. (1986). A fast k-nearest neighbor finding algorithmbased on the ordered partition. IEEE Transactions on Pattern Analysisand Machine Intelligence, 8:761–766.

Kittler, J. and Devijver, P. (1981). An efficient estimator of patternrecognition system error probability. Pattern Recognition, 13:245–249.

Knoke, J. (1986). The robust estimation of classification error rates.Computers and Mathematics with Applications, 12A:253–260.

Kohonen, T. (1988). Self-Organization and Associative Memory.Springer-Verlag, Berlin.

Kohonen, T. (1990). Statistical pattern recognition revisited. In Ad-vanced Neural Computers, Eckmiller, R., editor, pages 137–144. North-Holland, Amsterdam.

Kolmogorov, A. (1957). On the representation of continuous functions ofmany variables by superposition of continuous functions of one variableand addition. Doklady Akademii Nauk USSR, 114:953–956.

Kolmogorov, A. and Tikhomirov, V. (1961). ε-entropy and ε-capacityof sets in function spaces. Translations of the American MathematicalSociety, 17:277–364.

Koutsougeras, C. and Papachristou, C. (1989). Training of a neural net-work for pattern classification based on an entropy measure. In Proceed-ings of IEEE International Conference on Neural Networks, volume 1,pages 247–254. IEEE San Diego Section, San Diego, CA.

References 631

Kraaijveld, M. and Duin, R. (1991). Generalization capabilities of mini-mal kernel-based methods. In International Joint Conference on NeuralNetworks, volume 1, pages 843–848. Piscataway, NJ.

Kronmal, R. and Tarter, M. (1968). The estimation of probability densi-ties and cumulatives by Fourier series methods. Journal of the AmericanStatistical Association, 63:925–952.

Krzanowski, W. (1987). A comparison between two distance-based dis-criminant principles. Journal of Classification, 4:73–84.

Krzyzak, A. (1983). Classification procedures using multivariate variablekernel density estimate. Pattern Recognition Letters, 1:293–298.

Krzyzak, A. (1986). The rates of convergence of kernel regression esti-mates and classification rules. IEEE Transactions on Information The-ory, 32:668–679.

Krzyzak, A. (1991). On exponential bounds on the bayes risk of thekernel classification rule. IEEE Transactions on Information Theory,37:490–499.

Krzyzak, A., Linder, T., and Lugosi, G. (1993). Nonparametric estima-tion and classification using radial basis function nets and empirical riskminimization. IEEE Transactions on Neural Networks. To appear.

Krzyzak, A. and Pawlak, M. (1983). Universal consistency results forWolverton-Wagner regression function estimate with application in dis-crimination. Problems of Control and Information Theory, 12:33–42.

Krzyzak, A. and Pawlak, M. (1984a). Almost everywhere convergenceof recursive kernel regression function estimates. IEEE Transactions onInformation Theory, 31:91–93.

Krzyzak, A. and Pawlak, M. (1984b). Distribution-free consistency of anonparametric kernel regression estimate and classification. IEEE Trans-actions on Information Theory, 30:78–81.

Kulkarni, S. (1991). Problems of computational and information com-plexity in machine vision and learning. PhD Thesis, Department ofElectrical Engineering and Computer Science, MIT, Cambridge, MA.

Kullback, S. (1967). A lower bound for discrimination information interms of variation. IEEE Transactions on Information Theory, 13:126–127.

Kullback, S. and Leibler, A. (1951). On information and sufficiency.Annals of Mathematical Statistics, 22:79–86.

Kushner, H. (1984). Approximation and Weak Convergence Methods forRandom Processes with Applications to Stochastic Systems Theory. MITPress, Cambridge, MA.

Lachenbruch, P. (1967). An almost unbiased method of obtaining con-fidence intervals for the probability of misclassification in discriminantanalysis. Biometrics, 23:639–645.

Lachenbruch, P. and Mickey, M. (1968). Estimation of error rates indiscriminant analysis. Technometrics, 10:1–11.

LeCam, L. (1970). On the assumptions used to prove asymptotic normal-ity of maximum likelihood estimates. Annals of Mathematical Statistics,41:802–828.

References 632

LeCam, L. (1973). Convergence of estimates under dimensionality re-strictions. Annals of Statistics, 1:38–53.

Li, X. and Dubes, R. (1986). Tree classifier design with a permutationstatistic. Pattern Recognition, 19:229–235.

Lin, Y. and Fu, K. (1983). Automatic classification of cervical cells usinga binary tree classifier. Pattern Recognition, 16:69–80.

Linde, Y., Buzo, A., and Gray, R. (1980). An algorithm for vector quan-tizer design. IEEE Transactions on Communications, 28:84–95.

Linder, T., Lugosi, G., and Zeger, K. (1994). Rates of convergence in thesource coding theorem, empirical quantizer design, and universal lossysource coding. IEEE Transactions on Information Theory, 40:1728–1740.

Lissack, T. and Fu, K. (1976). Error estimation in pattern recognitionvia lα distance between posterior density functions. IEEE Transactionson Information Theory, 22:34–45.

Ljung, L., Pflug, G., and Walk, H. (1992). Stochastic Approximation andOptimization of Random Systems. Birkhauser, Basel, Boston, Berlin.

Lloyd, S. (1982). Least squares quantization in pcm. IEEE Transactionson Information Theory, 28:129–137.

Loftsgaarden, D. and Quesenberry, C. (1965). A nonparametric estimateof a multivariate density function. Annals of Mathematical Statistics,36:1049–1051.

Logan, B. (1975). The uncertainty principle in reconstructing functionsfrom projections. Duke Mathematical Journal, 42:661–706.

Logan, B. and Shepp, L. (1975). Optimal reconstruction of a functionfrom its projections. Duke Mathematical Journal, 42:645–660.

Loh, W. and Vanichsetakul, N. (1988). Tree-structured classification viageneralized discriminant analysis. Journal of the American StatisticalAssociation, 83:715–728.

Loizou, G. and Maybank, S. (1987). The nearest neighbor and the Bayeserror rates. IEEE Transactions on Pattern Analysis and Machine Intel-ligence, 9:254–262.

Lorentz, G. (1976). The thirteenth problem of Hilbert. In Proceedings ofSymposia in Pure Mathematics, volume 28, pages 419–430. Providence,RI.

Lugosi, G. (1992). Learning with an unreliable teacher. Pattern Recog-nition, 25:79–87.

Lugosi, G. (1995). Improved upper bounds for probabilities of uniformdeviations. Statistics and Probability Letters, 25:71–77.

Lugosi, G. and Nobel, A. (1993). Consistency of data-driven histogrammethods for density estimation and classification. Technical Report,Beckman Institute, University of Illinois, Urbana-Champaign. To ap-pear in the Annals of Statistics.

Lugosi, G. and Pawlak, M. (1994). On the posterior-probability estimateof the error rate of nonparametric classification rules. IEEE Transactionson Information Theory, 40:475–481.

References 633

Lugosi, G. and Zeger, K. (1994). Concept learning using complexityregularization. To appear in IEEE Transactions on Information Theory.

Lugosi, G. and Zeger, K. (1995). Nonparametric estimation via empiricalrisk minimization. IEEE Transactions on Information Theory, 41:677–678.

Lukacs, E. and Laha, R. (1964). Applications of Characteristic Functionsin Probability Theory. Griffin, London.

Lunts, A. and Brailovsky, V. (1967). Evaluation of attributes obtainedin statistical decision rules. Engineering Cybernetics, 3:98–109.

Maass, W. (1993). Bounds for the computational power and learningcomplexity of analog neural nets. In Proceedings of the 25th Annual ACMSymposium on the Theory of Computing, pages 335–344. Association ofComputing Machinery, New York.

Maass, W. (1994). Neural nets with superlinear vc-dimension. NeuralComputation, 6:875–882.

Macintyre, A. and Sontag, E. (1993). Finiteness results for sigmoidal“neural” networks. In Proceedings of the 25th Annual ACM Symposiumon the Theory of Computing, pages 325–334. Association of ComputingMachinery, New York.

Mack, Y. (1981). Local properties of k–nearest neighbor regression esti-mates. SIAM Journal on Algebraic and Discrete Methods, 2:311–323.

Mahalanobis, P. (1936). On the generalized distance in statistics. Pro-ceedings of the National Institute of Sciences of India, 2:49–55.

Mahalanobis, P. (1961). A method of fractile graphical analysis. SankhyaSeries A, 23:41–64.

Marron, J. (1983). Optimal rates of convergence to Bayes risk in non-parametric discrimination. Annals of Statistics, 11:1142–1155.

Massart, P. (1983). Vitesse de convergence dans le theoreme de la limitecentrale pour le processus empirique. PhD Thesis, Universite Paris-Sud,Orsay, France.

Massart, P. (1990). The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. Annals of Probability, 18:1269–1283.

Mathai, A. and Rathie, P. (1975). Basic Concepts in Information Theoryand Statistics. Wiley Eastern Ltd., New Delhi.

Matloff, N. and Pruitt, R. (1984). The asymptotic distribution of anestimator of the bayes error rate. Pattern Recognition Letters, 2:271–274.

Matula, D. and Sokal, R. (1980). Properties of Gabriel graphs relevantto geographic variation research and the clustering of points in the plane.Geographical Analysis, 12:205–222.

Matushita, K. (1956). Decision rule, based on distance, for the clas-sification problem. Annals of the Institute of Statistical Mathematics,8:67–77.

Matushita, K. (1973). Discrimination and the affinity of distributions.In Discriminant Analysis and Applications, Cacoullos, T., editor, pages213–223. Academic Press, New York.

References 634

Max, J. (1960). Quantizing for minimum distortion. IEEE Transactionson Information Theory, 6:7–12.

McDiarmid, C. (1989). On the method of bounded differences. In Sur-veys in Combinatorics 1989, pages 148–188. Cambridge University Press,Cambridge.

McLachlan, G. (1976). The bias of the apparent error rate in discrimi-nant analysis. Biometrika, 63:239–244.

McLachlan, G. (1992). Discriminant Analysis and Statistical PatternRecognition. John Wiley, New York.

Meisel, W. (1969). Potential functions in mathematical pattern recogni-tion. IEEE Transactions on Computers, 18:911–918.

Meisel, W. (1972). Computer Oriented Approaches to Pattern Recogni-tion. Academic Press, New York.

Meisel, W. (1990). Parsimony in neural networks. In International Con-ference on Neural Networks, pages 443–446. Lawrence Erlbaum Asso-ciates, Hillsdale, NJ.

Meisel, W. and Michalopoulos, D. (1973). A partitioning algorithm withapplication in pattern classification and the optimization of decision tree.IEEE Transactions on Computers, 22:93–103.

Michel-Briand, C. and Milhaud, X. (1994). Asymptotic behavior of theaid method. Technical Report, Universite Montpellier 2, Montpellier.

Mielniczuk, J. and Tyrcha, J. (1993). Consistency of multilayer percep-tron regression estimators. Neural Networks, To appear.

Minsky, M. (1961). Steps towards artificial intelligence. In Proceedingsof the IRE, volume 49, pages 8–30.

Minsky, M. and Papert, S. (1969). Perceptrons: An Introduction to Com-putational Geometry. MIT Press, Cambridge, MA.

Mitchell, A. and Krzanowski, W. (1985). The Mahalanobis distance andelliptic distributions. Biometrika, 72:464–467.

Mizoguchi, R., Kizawa, M., and Shimura, M. (1977). Piecewise lin-ear discriminant functions in pattern recognition. Systems-Computers-Controls, 8:114–121.

Moody, J. and Darken, J. (1989). Fast learning in networks of locally-tuned processing units. Neural Computation, 1:281–294.

Moore, D., Whitsitt, S., and Landgrebe, D. (1976). Variance comparisonsfor unbiased estimators of probability of correct classification. IEEETransactions on Information Theory, 22:102–105.

Morgan, J. and Sonquist, J. (1963). Problems in the analysis of surveydata, and a proposal. Journal of the American Statistical Association,58:415–434.

Mui, J. and Fu, K. (1980). Automated classification of nucleated bloodcells using a binary tree classifier. IEEE Transactions on Pattern Anal-ysis and Machine Intelligence, 2:429–443.

Myles, J. and Hand, D. (1990). The multi-class metric problem in nearestneighbour discrimination rules. Pattern Recognition, 23:1291–1297.

Nadaraya, E. (1964). On estimating regression. Theory of Probabilityand its Applications, 9:141–142.

References 635

Nadaraya, E. (1970). Remarks on nonparametric estimates for densityfunctions and regression curves. Theory of Probability and its Applica-tions, 15:134–137.

Narendra, P. and Fukunaga, K. (1977). A branch and bound algorithmfor feature subset selection. IEEE Transactions on Computers, 26:917–922.

Natarajan, B. (1991). Machine Learning: A Theoretical Approach. Mor-gan Kaufmann, San Mateo, CA.

Nevelson, M. and Khasminskii, R. (1973). Stochastic Approximation andRecursive Estimation. Translations of Mathematical Monographs, Vol.47. American Mathematical Society, Providence, RI.

Niemann, H. and Goppert, R. (1988). An efficient branch-and-boundnearest neighbour classifier. Pattern Recognition Letters, 7:67–72.

Nilsson, N. (1965). Learning Machines: Foundations of Trainable PatternClassifying Systems. McGraw-Hill, New York.

Nishizeki, T. and Chiba, N. (1988). Planar Graphs: Theory and Algo-rithms. North Holland, Amsterdam.

Nobel, A. (1994). Histogram regression estimates using data-dependentpartitions. Technical Report, Beckman Institute, University of Illinois,Urbana-Champaign.

Nobel, A. (1992). On uniform laws of averages. PhD Thesis, Departmentof Statistics, Stanford University, Stanford, CA.

Nolan, D. and Pollard, D. (1987). U-processes: Rates of convergence.Annals of Statistics, 15:780–799.

Okamoto, M. (1958). Some inequalities relating to the partial sum ofbinomial probabilities. Annals of the Institute of Statistical Mathematics,10:29–35.

Olshen, R. (1977). Comments on a paper by C.J. Stone. Annals ofStatistics, 5:632–633.

Ott, J. and Kronmal, R. (1976). Some classification procedures for multi-variate binary data using orthogonal functions. Journal of the AmericanStatistical Association, 71:391–399.

Papadimitriou, C. and Bentley, J. (1980). A worst-case analysis of near-est neighbor searching by projection. In Automata, Languages and Pro-gramming 1980, pages 470–482. Lecture Notes in Computer Science #85,Springer-Verlag, Berlin.

Park, J. and Sandberg, I. (1991). Universal approximation using radial-basis-function networks. Neural Computation, 3:246–257.

Park, J. and Sandberg, I. (1993). Approximation and radial-basis-function networks. Neural Computation, 5:305–316.

Park, Y. and Sklansky, J. (1990). Automated design of linear tree clas-sifiers. Pattern Recognition, 23:1393–1412.

Parthasarathy, K. and Bhattacharya, P. (1961). Some limit theorems inregression theory. Sankhya Series A, 23:91–102.

Parzen, E. (1962). On the estimation of a probability density functionand the mode. Annals of Mathematical Statistics, 33:1065–1076.

References 636

Patrick, E. (1966). Distribution-free minimum conditional risk learningsystems. Technical Report TR-EE-66-18, Purdue University, Lafayette,IN.

Patrick, E. and Fischer, F. (1970). A generalized k-nearest neighbor rule.Information and Control, 16:128–152.

Patrick, E. and Fisher, F. (1967). Introduction to the performance ofdistribution-free conditional risk learning systems. Technical Report TR-EE-67-12, Purdue University, Lafayette, IN.

Pawlak, M. (1988). On the asymptotic properties of smoothed estimatorsof the classification error rate. Pattern Recognition, 21:515–524.

Payne, H. and Meisel, W. (1977). An algorithm for constructing optimalbinary decision trees. IEEE Transactions on Computers, 26:905–916.

Pearl, J. (1979). Capacity and error estimates for boolean classifierswith limited complexity. IEEE Transactions on Pattern Analysis andMachine Intelligence, 1:350–355.

Penrod, C. and Wagner, T. (1977). Another look at the edited nearestneighbor rule. IEEE Transactions on Systems, Man and Cybernetics,7:92–94.

Peterson, D. (1970). Some convergence properties of a nearest neighbordecision rule. IEEE Transactions on Information Theory, 16:26–31.

Petrov, V. (1975). Sums of Independent Random Variables. Springer-Verlag, Berlin.

Pippenger, N. (1977). Information theory and the complexity of booleanfunctions. Mathematical Systems Theory, 10:124–162.

Poggio, T. and Girosi, F. (1990). A theory of networks for approximationand learning. Proceedings of the IEEE, 78:1481–1497.

Pollard, D. (1981). Strong consistency of k-means clustering. Annals ofStatistics, 9:135–140.

Pollard, D. (1982). Quantization and the method of k-means. IEEETransactions on Information Theory, 28:199–205.

Pollard, D. (1984). Convergence of Stochastic Processes. Springer-Verlag, New York.

Pollard, D. (1986). Rates of uniform almost sure convergence for empir-ical processes indexed by unbounded classes of functions. Manuscript.

Pollard, D. (1990). Empirical Processes: Theory and Applications. NSF-CBMS Regional Conference Series in Probability and Statistics, Instituteof Mathematical Statistics, Hayward, CA.

Powell, M. (1987). Radial basis functions for multivariable interpolation:a review. Algorithms for Approximation. Clarendon Press, Oxford.

Preparata, F. and Shamos, M. (1985). Computational Geometry—AnIntroduction. Springer-Verlag, New York.

Psaltis, D., Snapp, R., and Venkatesh, S. (1994). On the finite sampleperformance of the nearest neighbor classifier. IEEE Transactions onInformation Theory, 40:820–837.

Qing-Yun, S. and Fu, K. (1983). A method for the design of binary treeclassifiers. Pattern Recognition, 16:593–603.

References 637

Quesenberry, C. and Gessaman, M. (1968). Nonparametric discrimina-tion using tolerance regions. Annals of Mathematical Statistics, 39:664–673.

Quinlan, J. (1993). C4.5: Programs for Machine Learning. MorganKaufmann, San Mateo.

Rabiner, L., Levinson, S., Rosenberg, A., and Wilson, J. (1979). Speaker-independent recognition of isolated words using clustering techniques.IEEE Transactions on Acoustics, Speech, and Signal Processing, 27:339–349.

Rao, R. (1962). Relations between weak and uniform convergence ofmeasures with applications. Annals of Mathematical Statistics, 33:659–680.

Raudys, S. (1972). On the amount of a priori information in designingthe classification algorithm. Technical Cybernetics, 4:168–174.

Raudys, S. (1976). On dimensionality, learning sample size and complex-ity of classification algorithms. In Proceedings of the 3rd InternationalConference on Pattern Recognition, pages 166–169. IEEE Computer So-ciety, Long Beach, CA.

Raudys, S. and Pikelis, V. (1980). On dimensionality, sample size, clas-sification error, and complexity of classification algorithm in patternrecognition. IEEE Transactions on Pattern Analysis and Machine Intel-ligence, 2:242–252.

Raudys, S. and Pikelis, V. (1982). Collective selection of the best versionof a pattern recognition system. Pattern Recognition Letters, 1:7–13.

Rejto, L. and Revesz, P. (1973). Density estimation and pattern classi-fication. Problems of Control and Information Theory, 2:67–80.

Renyi, A. (1961). On measures of entropy and information. In Pro-ceedings of the Fourth Berkeley Symposium, pages 547–561. Universityof California Press, Berkeley.

Revesz, P. (1973). Robbins-Monroe procedures in a Hilbert space andits application in the theory of learning processes. Studia ScientiariumMathematicarum Hungarica, 8:391–398.

Ripley, B. (1993). Statistical aspects of neural networks. In Networksand Chaos—Statistical and Probabilistic Aspects, Barndorff-Nielsen, O.,Jensen, J., and Kendall, W., editors, pages 40–123. Chapman and Hall,London, U.K.

Ripley, B. (1994). Neural networks and related methods for classification.Journal of the Royal Statistical Society, 56:409–456.

Rissanen, J. (1983). A universal prior for integers and estimation byminimum description length. Annals of Statistics, 11:416–431.

Ritter, G., Woodruff, H., Lowry, S., and Isenhour, T. (1975). An algo-rithm for a selective nearest neighbor decision rule. IEEE Transactionson Information Theory, 21:665–669.

Robbins, H. and Monro, S. (1951). A stochastic approximation method.Annals of Mathematical Statistics, 22:400–407.

Rogers, W. and Wagner, T. (1978). A finite sample distribution-freeperformance bound for local discrimination rules. Annals of Statistics,6:506–514.

References 638

Rosenblatt, F. (1962). Principles of Neurodynamics: Perceptrons and theTheory of Brain Mechanisms. Spartan Books, Washington, DC.

Rosenblatt, M. (1956). Remarks on some nonparametric estimates of adensity function. Annals of Mathematical Statistics, 27:832–837.

Rounds, E. (1980). A combined nonparametric approach to feature se-lection and binary decision tree design. Pattern Recognition, 12:313–317.

Royall, R. (1966). A Class of Nonparametric Estimators of a SmoothRegression Function. PhD Thesis, Stanford University, Stanford, CA.

Rumelhart, D., Hinton, G., and Williams, R. (1986). Learning internalrepresentations by error propagation. In Parallel Distributed ProcessingVol. I, Rumelhart, D., J.L.McCelland, and the PDP Research Group, ed-itors. MIT Press, Cambridge, MA. Reprinted in: J.A. Anderson and E.Rosenfeld, Neurocomputing—Foundations of Research, MIT Press, Cam-bridge, pp. 673–695, 1988.

Ruppert, D. (1991). Stochastic approximation. In Handbook of Sequen-tial Analysis, Ghosh, B. and Sen, P., editors, pages 503–529. MarcelDekker, New York.

Samet, H. (1984). The quadtree and related hierarchical data structures.Computing Surveys, 16:187–260.

Samet, H. (1990a). Applications of Spatial Data Structures. Addison-Wesley, Reading, MA.

Samet, H. (1990b). The Design and Analysis of Spatial Data Structures.Addison-Wesley, Reading, MA.

Sansone, G. (1969). Orthogonal Functions. Interscience, New York.

Sauer, N. (1972). On the density of families of sets. Journal of Combi-natorial Theory Series A, 13:145–147.

Scheffe, H. (1947). A useful convergence theorem for probability distri-butions. Annals of Mathematical Statistics, 18:434–458.

Schlaffli, L. (1950). Gesammelte Mathematische Abhandlungen.Birkhauser-Verlag, Basel.

Schmidt, W. (1994). Neural Pattern Classifying Systems. PhD Thesis,Technical University, Delft, The Netherlands.

Schoenberg, I. (1950). On Polya frequency functions, II: variation-diminishing integral operators of the convolution type. Acta Scientiar-ium Mathematicarum Szeged, 12:97–106.

Schwartz, S. (1967). Estimation of probability density by an orthogonalseries. Annals of Mathematical Statistics, 38:1261–1265.

Schwemer, G. and Dunn, O. (1980). Posterior probability estimatorsin classification simulations. Communications in Statistics—Simulation,B9:133–140.

Sebestyen, G. (1962). Decision-Making Processes in Pattern Recognition.Macmillan, New York.

Serfling, R. (1974). Probability inequalities for the sum in samplingwithout replacement. Annals of Statistics, 2:39–48.

Sethi, I. (1981). A fast algorithm for recognizing nearest neighbors.IEEE Transactions on Systems, Man and Cybernetics, 11:245–248.

References 639

Sethi, I. (1990). Entropy nets: from decision trees to neural nets. Pro-ceedings of the IEEE, 78:1605–1613.

Sethi, I. (1991). Decision tree performance enhancement using an arti-ficial neural network interpretation. In Artificial Neural Networks andStatistical Pattern Recognition, Old and New Connections, Sethi, I. andJain, A., editors, pages 71–88. Elsevier Science Publishers, Amsterdam.

Sethi, I. and Chatterjee, B. (1977). Efficient decision tree design for dis-crete variable pattern recognition problems. Pattern Recognition, 9:197–206.

Sethi, I. and Sarvarayudu, G. (1982). Hierarchical classifier design us-ing mutual information. IEEE Transactions on Pattern Analysis andMachine Intelligence, 4:441–445.

Shannon, C. (1948). A mathematical theory of communication. BellSystems Technical Journal, 27:379–423.

Shawe-Taylor, J. (1994). Sample sizes for sigmoidal neural networks.Technical Report, Department of Computer Science, Royal Holloway,University of London, Egham, England.

Shiryayev, A. (1984). Probability. Springer-Verlag, New York.

Short, R. and Fukunaga, K. (1981). The optimal distance measure fornearest neighbor classification. IEEE Transactions on Information The-ory, 27:622–627.

Simon, H. (1991). The Vapnik-Chervonenkis dimension of decision treeswith bounded rank. Information Processing Letters, 39:137–141.

Simon, H. (1993). General lower bounds on the number of examplesneeded for learning probabilistic concepts. In Proceedings of the SixthAnnual ACM Conference on Computational Learning Theory, pages 402–412. Association for Computing Machinery, New York.

Sklansky, J. and Michelotti (1980). Locally trained piecewise linear clas-sifiers. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 2:101–111.

Sklansky, J. and Wassel, G. (1979). Pattern Classifiers and TrainableMachines. Springer-Verlag, New York.

Slud, E. (1977). Distribution inequalities for the binomial law. Annalsof Probability, 5:404–412.

Specht, D. (1967). Generation of polynomial discriminant functionsfor pattern classification. IEEE Transactions on Electronic Computers,15:308–319.

Specht, D. (1971). Series estimation of a probability density function.Technometrics, 13:409–424.

Specht, D. (1990). Probabilistic neural networks and the polynomialAdaline as complementary techniques for classification. IEEE Transac-tions on Neural Networks, 1:111–121.

Spencer, J. (1987). Ten Lectures on the Probabilistic Method. SIAM,Philadelphia, PA.

Sprecher, D. (1965). On the structure of continuous functions of severalvariables. Transactions of the American Mathematical Society, 115:340–355.

References 640

Steele, J. (1975). Combinatorial entropy and uniform limit laws. PhDThesis, Stanford University, Stanford, CA.

Steele, J. (1986). An Efron-Stein inequality for nonsymmetric statistics.Annals of Statistics, 14:753–758.

Stengle, G. and Yukich, J. (1989). Some new Vapnik-Chervonenkisclasses. Annals of Statistics, 17:1441–1446.

Stoffel, J. (1974). A classifier design technique for discrete variable pat-tern recognition problems. IEEE Transactions on Computers, 23:428–441.

Stoller, D. (1954). Univariate two-population distribution-free discrimi-nation. Journal of the American Statistical Association, 49:770–777.

Stone, C. (1977). Consistent nonparametric regression. Annals of Statis-tics, 8:1348–1360.

Stone, C. (1982). Optimal global rates of convergence for nonparametricregression. Annals of Statistics, 10:1040–1053.

Stone, C. (1985). Additive regression and other nonparametric models.Annals of Statistics, 13:689–705.

Stone, M. (1974). Cross-validatory choice and assessment of statisticalpredictions. Journal of the Royal Statistical Society, 36:111–147.

Stute, W. (1984). Asymptotic normality of nearest neighbor regressionfunction estimates. Annals of Statistics, 12:917–926.

Sung, K. and Shirley, P. (1992). Ray tracing with the BSP tree. InGraphics Gems III, Kirk, D., editor, pages 271–274. Academic Press,Boston.

Swonger, C. (1972). Sample set condensation for a condensed nearestneighbor decision rule for pattern recognition. In Frontiers of PatternRecognition, Watanabe, S., editor, pages 511–519. Academic Press, NewYork.

Szarek, S. (1976). On the best constants in the Khintchine inequality.Studia Mathematica, 63:197–208.

Szego, G. (1959). Orthogonal Polynomials, volume 32. American Math-ematical Society, Providence, RI.

Talagrand, M. (1987). The Glivenko-Cantelli problem. Annals of Prob-ability, 15:837–870.

Talagrand, M. (1994). Sharper bounds for Gaussian and empirical pro-cesses. Annals of Probability, 22:28–76.

Talmon, J. (1986). A multiclass nonparametric partitioning algorithm.In Pattern Recognition in Practice II, Gelsema, E. and Kanal, L., editors.Elsevier Science Publishers, Amsterdam.

Taneja, I. (1983). On characterization of J-divergence and its general-izations. Journal of Combinatorics, Information and System Sciences,8:206–212.

Taneja, I. (1987). Statistical aspects of divergence measures. Journal ofStatistical Planning and Inference, 16:137–145.

Tarter, M. and Kronmal, R. (1970). On multivariate density estimatesbased on orthogonal expansions. Annals of Mathematical Statistics,41:718–722.

References 641

Tomek, I. (1976a). A generalization of the k-nn rule. IEEE Transactionson Systems, Man and Cybernetics, 6:121–126.

Tomek, I. (1976b). Two modifications of cnn. IEEE Transactions onSystems, Man and Cybernetics, 6:769–772.

Toussaint, G. (1971). Note on optimal selection of independent binary-valued features for pattern recognition. IEEE Transactions on Informa-tion Theory, 17:618.

Toussaint, G. (1974a). Bibliography on estimation of misclassification.IEEE Transactions on Information Theory, 20:472–479.

Toussaint, G. (1974b). On the divergence between two distributions andthe probability of misclassification of several decision rules. In Proceed-ings of the Second International Joint Conference on Pattern Recogni-tion, pages 27–34. Copenhagen.

Toussaint, G. and Donaldson, R. (1970). Algorithms for recognizingcontour-traced handprinted characters. IEEE Transactions on Comput-ers, 19:541–546.

Tsypkin, Y. (1971). Adaptation and Learning in Automatic Systems.Academic Press, New York.

Tutz, G. (1985). Smoothed additive estimators for non-error rates inmultiple discriminant analysis. Pattern Recognition, 18:151–159.

Tutz, G. (1986). An alternative choice of smoothing for kernel-baseddensity estimates in discrete discriminant analysis. Biometrika, 73:405–411.

Tutz, G. (1988). Smoothing for discrete kernels in discrimination. Bio-metrics Journal, 6:729–739.

Tutz, G. (1989). On cross-validation for discrete kernel estimates indiscrimination. Communications in Statistics—Theory and Methods,18:4145–4162.

Ullmann, J. (1974). Automatic selection of reference data for use in anearest-neighbor method of pattern classification. IEEE Transactionson Information Theory, 20:541–543.

Vajda, I. (1968). The estimation of minimal error probability for testingfinite or countable number of hypotheses. Problemy Peredaci Informacii,4:6–14.

Vajda, I. (1989). Theory of Statistical Inference and Information.Kluwer Academic Publishers, Dordrecht.

Valiant, L. (1984). A theory of the learnable. Communications of theACM, 27:1134–1142.

Van Campenhout, J. (1980). The arbitrary relation between probabilityof error and measurement subset. Journal of the American StatisticalAssociation, 75:104–109.

Van Ryzin, J. (1966). Bayes risk consistency of classification proceduresusing density estimation. Sankhya Series A, 28:161–170.

Vapnik, V. (1982). Estimation of Dependencies Based on EmpiricalData. Springer-Verlag, New York.

Vapnik, V. and Chervonenkis, A. (1971). On the uniform convergence ofrelative frequencies of events to their probabilities. Theory of Probabilityand its Applications, 16:264–280.

References 642

Vapnik, V. and Chervonenkis, A. (1974a). Ordered risk minimization. I.Automation and Remote Control, 35:1226–1235.

Vapnik, V. and Chervonenkis, A. (1974b). Ordered risk minimization.II. Automation and Remote Control, 35:1403–1412.

Vapnik, V. and Chervonenkis, A. (1974c). Theory of Pattern Recogni-tion. Nauka, Moscow. (in Russian); German translation: Theorie derZeichenerkennung, Akademie Verlag, Berlin, 1979.

Vapnik, V. and Chervonenkis, A. (1981). Necessary and sufficient condi-tions for the uniform convergence of means to their expectations. Theoryof Probability and its Applications, 26:821–832.

Vidal, E. (1986). An algorithm for finding nearest neighbors in (approx-imately) constant average time. Pattern Recognition Letters, 4:145–157.

Vilmansen, T. (1973). Feature evaluation with measures of probabilisticdependence. IEEE Transactions on Computers, 22:381–388.

Vitushkin, A. (1961). The absolute ε-entropy of metric spaces. Transla-tions of the American Mathematical Society, 17:365–367.

Wagner, T. (1971). Convergence of the nearest neighbor rule. IEEETransactions on Information Theory, 17:566–571.

Wagner, T. (1973). Convergence of the edited nearest neighbor. IEEETransactions on Information Theory, 19:696–699.

Wang, Q. and Suen, C. (1984). Analysis and design of decision tree basedon entropy reduction and its application to large character set recogni-tion. IEEE Transactions on Pattern Analysis and Machine Intelligence,6:406–417.

Warner, H., Toronto, A., Veasey, L., and Stephenson, R. (1961). A math-ematical approach to medical diagnosis. Journal of the American MedicalAssociation, 177:177–183.

Wassel, G. and Sklansky, J. (1972). Training a one-dimensional classifierto minimize the probability of error. IEEE Transactions on Systems,Man, and Cybernetics, 2:533–541.

Watson, G. (1964). Smooth regression analysis. Sankhya Series A,26:359–372.

Weiss, S. and Kulikowski, C. (1991). Computer Systems that Learn.Morgan Kaufmann, San Mateo, CA.

Wenocur, R. and Dudley, R. (1981). Some special Vapnik-Chervonenkisclasses. Discrete Mathematics, 33:313–318.

Wheeden, R. and Zygmund, A. (1977). Measure and Integral. MarcelDekker, New York.

White, H. (1990). Connectionist nonparametric regression: multilayerfeedforward networks can learn arbitrary mappings. Neural Networks,3:535–549.

White, H. (1991). Nonparametric estimation of conditional quantilesusing neural networks. In Proceedings of the 23rd Symposium of theInterface: Computing Science and Statistics, pages 190–199. AmericanStatistical Association, Alexandria, VA.

Widrow, B. (1959). Adaptive sampled-data systems—a statistical theoryof adaptation. In IRE WESCON Convention Record, volume part 4,pages 74–85.

References 643

Widrow, B. and Hoff, M. (1960). Adaptive switching circuits. In IREWESCON Convention Record, volume part 4, pages 96–104. Reprintedin J.A. Anderson and E. Rosenfeld, Neurocomputing: Foundations of Re-search, MIT Press, Cambridge, 1988.

Wilson, D. (1972). Asymptotic properties of nearest neighbor rules us-ing edited data. IEEE Transactions on Systems, Man and Cybernetics,2:408–421.

Winder, R. (1963). Threshold logic in artificial intelligence. In ArtificialIntelligence, pages 107–128. IEEE Special Publication S–142.

Wolverton, C. and Wagner, T. (1969a). Asymptotically optimal discrimi-nant functions for pattern classification. IEEE Transactions on Systems,Science and Cybernetics, 15:258–265.

Wolverton, C. and Wagner, T. (1969b). Recursive estimates of probabil-ity densities. IEEE Transactions on Systems, Science and Cybernetics,5:307.

Wong, W. and Shen, X. (1992). Probability inequalities for likelihoodratios and convergence rates of sieve MLE’s. Technical Report 346, De-partment of Statistics, University of Chicago, Chicago, IL.

Xu, L., Krzyzak, A., and Oja, E. (1993). Rival penalized competitivelearning for clustering analysis, RBF net and curve detection. IEEETransactions on Neural Networks, 4:636–649.

Xu, L., Krzyzak, A., and Yuille, A. (1994). On radial basis functionnets and kernel regression: Approximation ability, convergence rate andreceptive field size. Neural Networks. To appear.

Yatracos, Y. (1985). Rates of convergence of minimum distance estima-tors and kolmogorov’s entropy. Annals of Statistics, 13:768–774.

Yau, S. and Lin, T. (1968). On the upper bound of the probabilityof error of a linear pattern classifier for probabilistic pattern classes.Proceedings of the IEEE, 56:321–322.

You, K. and Fu, K. (1976). An approach to the design of a linear binarytree classifier. In Proceedings of the Symposium of Machine Processing ofRemotely Sensed Data, pages 3A–10. Purdue University, Lafayette, IN.

Yukich, J. (1985). Laws of large numbers for classes of functions. Journalof Multivariate Analysis, 17:245–260.

Yunck, T. (1976). A technique to identify nearest neighbors. IEEETransactions on Systems, Man, and Cybernetics, 6:678–683.

Zhao, L. (1987). Exponential bounds of mean error for the nearest neigh-bor estimates of regression functions. Journal of Multivariate Analysis,21:168–178.

Zhao, L. (1989). Exponential bounds of mean error for the kernel esti-mates of regression functions. Journal of Multivariate Analysis, 29:260–273.

Zhao, L., Krishnaiah, P., and Chen, X. (1990). Almost sure Lr-normconvergence for data-based histogram estimates. Theory of Probabilityand its Applications, 35:396–403.

Zygmund, A. (1959). Trigonometric Series I. University Press, Cam-bridge.


Subject Index

absolute error, 11accuracy, 201, 202, 206, 246Adaline, 531additive model, 536, 538, 546admissible transformation, 570, 571, 572,

573agreement, 163aid criterion, 339Alexander’s inequality, 207, 210, 211, 278,

295, 300ancestral rule, 180, 181, 185, 420a posteriori probability, 10, 11, 16, 92,

97, 102, 108, 113, 301, 479, 483, 493,549, 551, 552, 554, 555, 563, 568, 570,572

apparent error rate, see resubstution er-ror estimate

approximation error, 188, 284, 290, 295,297, 395, 430, 431, 454, 487, 517, 523,529, 547

arrangement, 511, 512, 513, 514, 515, 516,517, 542

arrangement classifier, 511, 512, 514, 515,516, 517

association inequality, 22, 583asymptotic optimality, 390, 393, 431, 445

back propagation, 525Bahadur-Lazarsfeld expansion, 472bandwidth, see smoothing factorBarron network, 545basis function, 225, 266, 279, 285, 290,

398, 501, 506complete orthonormal, 280, 281, 282,

287Haar, 281, 288Hermite, 279, 281, 288Laguerre, 281, 288Legendre, 281, 288Rademacher, 282Rademacher-Walsh, 470, 472trigonometric, 279, 287, 288Walsh, 282

Bayes classifier (or rule), 2, 10, 11, 12,13, 14, 15, 16, 17, 19, 20, 47, 48,57, 58, 59, 61, 69, 202, 226, 239, 249,264, 265, 266, 271, 272, 274, 275, 276,277, 280, 294, 295, 300, 301, 307, 357,387, 464, 466, 467, 471, 472, 473, 474,487, 527, 569, 570

Bayes decision, see Bayes classifierBayes error, 2, 11, 12, 13, 14, 15, 17, 18,

19, 21, 23, 28, 29, 30, 31, 35, 36, 39,57, 79, 82, 90, 91, 111, 112, 113, 128,

Subject Index 645

129, 176, 188, 236, 237, 241, 276, 285,289, 295, 297, 298, 307, 308, 313, 357,389, 401, 451, 533, 545, 561, 562, 563,565, 567, 568, 569, 570, 571, 572, 573estimation of, 4, 128, 566

Bayes problem, 2, 9Bayes risk, see Bayes errorBennett’s inequality, 124, 130Beppo-Levy theorem, 577Bernstein perceptron, 544Bernstein polynomial, 544Bernstein’s inequality, 124, 130, 210, 494beta distribution, 77, 88, 266, 276Bhattacharyya affinity, 23, 24, 343bias, 121, 125, 126, 144, 151, 395, 555binary space partition tree, 316, 317, 318,

338, 340, 341, 342, 343, 344, 347, 362,383, 542balanced, 362raw, 342, 362

binomial distribution, 51, 71, 72, 75, 76,77, 78, 79, 81, 82, 83, 84, 85, 86, 89,94, 106, 109, 122, 124, 127, 129, 130,131, 152, 184, 230, 238, 242, 276, 306,309, 321, 332, 345, 410, 444, 449, 462,586, 587, 588

binomial theorem, 218boolean classification, 461, 468Borel-Cantelli lemma, 142, 193, 227, 284,

293, 357, 367, 381, 425, 452, 505, 585bracketing metric entropy, 253, 254, 262branch-and-bound method, 563bsp tree, see binary space partition tree

cart, 336, 338, 339Catalan number, 469, 476Cauchy-Schwarz inequality, 20, 33, 52,

76, 93, 95, 99, 103, 104, 141, 155,158, 173, 256, 345, 401, 402, 414, 463,582, 583

central limit theorem, 85, 270channel, 109characteristic function, 45Chebyshev-Cantelli inequality, 42, 237,

463, 582Chebyshev’s inequality, 45, 56, 97, 194,

210, 269, 276, 322, 332, 414, 494, 582Chernoff’s affinity, 24, 35

Chernoff’s bounding method, 117, 123,129, 130, 131, 135, 230, 261

χ2-divergence, 34, 36class-conditional density, 14, 15, 17, 19,

31, 148, 149, 150, 166, 184, 260, 263,264, 266, 268, 269, 271, 272, 276, 277,280, 282, 285

class-conditional distribution, 56, 118, 263,264, 265, 271, 274, 275, 276

classifier selection, 125, 127, 132, 187, 188,189, 193, 199, 201, 233, 234, 245, 250,404, 549, 554, 555, 559lower bounds for, 206, 233, 234, 239,

245, 246, 275class probability, 15, 17, 18, 19, 32, 34,

58, 150clustering, 343, 363, 377, 378, 379, 380,

384, 392, 506, 542committee machine, 525, 526, 543complexity penalty, 291, 292, 297, 300,

301, 395complexity regularization, 266, 289, 290,

291, 292, 293, 294, 295, 296, 298, 300,301, 395, 524, 525, 527, 531

concept learning, see learningconfidence, 201, 202, 246consistency, 92, 93, 104, 106, 109, 128,

132, 184, 188, 232, 268, 296, 389, 390,468, 513, 555definition of, 3, 91of Fourier series rule, 282, 284of generalized linear rules, 286of kernel rules, 160, 161, 163, 171, 439,

441, 442, 446, 448, 474of maximum likelihood, 249, 250, 252,

253, 254, 257, 259, 262, 268, 473of nearest neighbor rules, 80, 174, 176,

307, 308, 309, 310, 311, 312, 313,457

of neural network classifiers, 514, 523,525, 527, 531, 534, 542, 544, 545,

of partitioning rules, 94, 96, 319, 320,322, 323, 324, 332, 333, 335, 338,339, 340, 341, 342, 343, 344, 346,348, 349, 350, 357, 358, 359, 361,362, 364, 368, 370, 374, 376, 377,381, 384, 385, 392, 394, 395,

of skeleton estimates, 483, 487of squared error minimization, 490, 493

Subject Index 646

strong, see strong consistencystrong universal, see strong universal

consistencyuniversal, see universal consistencyweak, 91, 100, 103, 133, 139, 149, 162,

169, 184, 425, 430, 452consistent rule, see consistencyconvex hull, 229, 232Cover-Hart inequality, 5, 22, 24, 27, 35,

62, 70, 448covering,

empirical, 211, 298, 485of a class of classifiers, 211, 212of a class of functions, 479, 480, 483,

484, 485, 486, 487, 489, 499, 500simple empirical, 289, 296, 297, 298,

300, 301covering lemma, 67, 153, 157, 158, 164,

165, 171, 176, 181, 182covering number, 479, 483, 492, 493, 494,

496, 499, 504, 526, 530, 547cross-validation, 125, 445curse of dimensionality, 469, 486, 561

decision with rejection, 18denseness,

in L1, 286, 287, 517, 520, 542in L2, 502, 503in Lp, 543, 579with respect to the supremum norm,

287, 517, 518, 533, 539density estimation, 107, 118, 143, 149,

150, 151, 161, 162, 169, 184, 208, 266,278, 279, 364, 433, 445, 447, 484, 527

density of a class of sets, 231diamond, 231dimension, 17, 49, 52, 222, 363, 482, 567

reduction of, 561, 567distance,

data-based, 169, 391, 451, 456, 459empirical, 178, 182Euclidean, 63, 101, 149, 169, 368, 370,

378, 456generalized, 182Lp, 182rotation-invariant, 179, 180tie breaking, 4, 61, 69, 101, 170, 174,

175, 176, 177, 178, 180, 181, 182,

304, 305, 377, 413, 414, 451, 456,458, 553

distribution,beta, 77, 88, 266, 276binomial, 51, 71, 72, 75, 76, 77, 78,

79, 81, 82, 83, 84, 85, 86, 89, 94,106, 109, 122, 124, 127, 129, 130,131, 152, 184, 230, 238, 242, 276,306, 309, 321, 332, 345, 410, 444,449, 462, 586, 587, 588

class-conditional, 56, 118, 263, 264, 265,271, 274, 275, 276

exponential, 361, 590gamma, 266, 268, 276, 573, 590geometric, 80, 153, 161, 167hypergeometric, 589Maxwell, 266multinomial, 227, 321, 589normal, 31, 35, 47, 48, 49, 56, 57, 59,

76, 85, 86, 249, 264, 265, 266, 276,343, 390, 399, 415, 564, 566, 567,572, 588, 590

Rayleigh, 266, 277strictly separable, 104, 162

distribution function, 142class-conditional, 40, 41, 45empirical, 142, 193, 325, 375

divergence, see Kullback-Leibler divergencedominated convergence theorem, 63, 156,

267, 329, 514, 577Dvoretzky-Kiefer-Wolfowitz-Massart inequality,

44, 57, 207

elementary cut, 350embedding, 64empirical classifier selection, see classifier

selectionempirical covering, see coveringempirical error, 43, 49, 54, 56, 125, 187,

188, 191, 200, 310, 335, 351, 362, 482empirically optimal classifier, 187, 188,

200, 206, 225, 275empirical measure, 43, 142, 190, 193, 203,

208, 210, 211, 213, 265, 270, 271, 272,278, 304, 365, 366, 367, 398, 414, 443,553, 556

empirical risk, see empirical errorempirical risk minimization, 5, 6, 49, 50,

Subject Index 647

53, 54, 56, 125, 126, 131, 187, 191,192, 193, 199, 201, 202, 209, 210, 211,212, 225, 227, 232, 234, 235, 239, 246,249, 251, 257, 258, 265, 266, 268, 274,275, 277, 285, 286, 287, 289, 290, 291,292, 294, 306, 310, 311, 334, 339, 346,347, 354, 375, 388, 404, 420, 428, 441,444, 454, 455, 457, 458, 467, 469, 476,477, 479, 483, 487, 489, 491, 493, 501,502, 506, 512, 514, 521, 523, 525, 526,527, 534, 537, 539, 540, 541, 545, 546,549, 551

empirical squared Euclidean distance er-ror, 377, 378, 380, 384

entropy, 25, 26, 29, 35, 218, 219, 252,286, 337, 339, 373

ε-covering number, see covering numberε-entropy, see metric entropyεm-optimality, 390, 393, 454ε-packing number, see packing numbererror estimation, 4, 121, 124, 125, 127,

132, 343, 397, 408, 409, 5490.632 estimator, 557bootstrap, 125, 549, 556, 557, 558complexity penalized, 291, 297, 396cross validated, see leave-one-out es-

timatedeleted, see leave-one-out estimatedouble bootstrap, 557E0 estimator, 557error counting, 121, 123, 126, 131, 549,

550holdout, 125, 347, 388, 389, 391, 393,

395, 397, 428, 430, 453, 549, 556leave-one-out, 4, 125, 309, 397, 407,

408, 410, 413, 414, 415, 417, 419,420, 441, 444, 446, 458, 459, 474,549, 550, 554, 555, 556, 557variance of, 415, 419, 552, 556

posterior probability, 549, 554, 555,559

randomized bootstrap, 557resubstitution, 125, 304, 343, 347, 382,

397, 398, 399, 400, 402, 405, 407,415, 419, 439, 458, 474, 476, 549,550, 557bias of, 343, 397, 398, 399, 400, 405,

419, 550rotation, 125, 556

smoothed, 125, 549, 550, 551, 552, 554,555

U -method, see leave-one-out estimateestimation error, 188, 284, 290, 295, 395,

431, 454, 455, 471, 487, 523, 529Euclidean norm, 99, 370, 567Euler’s theorem, 380exponential distribution, 361, 590exponential family, 266, 275, 276

Fano’s inequality, 26, 35Fatou’s lemma , 577f -divergence, 31, 32, 34, 36feature extraction, 14, 21, 128, 561, 562,

567, 572, 573F -error, 29, 32, 34, 36, 573fingering, 191, 192, 277fingering dimension, 192, 209Fisher linear discriminant, 47, 49, 58flexible grid, 367, 368, 369Fourier series classifier, 279, 280, 282, 285,

471Fourier series density estimation, 279Fubini’s theorem, 139, 157, 371, 579, 582,

583fundamental rule, 440, 461, 462, 463, 471,

474fundamental theorem of mathematical statis-

tics, see Glivenko-Cantelli theorem

Gabriel graph, 90Gabriel neighbor, 90gamma distribution, 266, 268, 276, 573,

590gaussian distribution, see normal distri-

butiongaussian noise, 57, 285geometric distribution, 80, 153, 161, 167Gessaman’s rule, 375, 376, 384, 393ghost sample, 193, 203, 210, 213Gini index, 338, 339, 340, 343, 361, 362Glivenko-Cantelli theorem, 192, 193, 321,

374gradient optimization, 54, 56, 251, 525Grenander’s method of sieves, 527, 540grid complexity, 357group method of data handling, 532

Subject Index 648

Hamming distance, 474, 477ham-sandwich theorem, 339Hart’s rule, 304, 305, 306Hellinger distance, 23, 32, 243, 474Hellinger integral, 32, 36histogram density estimation, 107, 143,

208histogram rule, 94, 95, 96, 98, 106, 109,

133, 138, 139, 143, 144, 145, 147, 152,153, 155, 159, 170, 253, 261, 287, 288,290, 301, 318, 342, 347, 351, 363, 368,370, 378, 384, 394, 399, 400, 401, 404,405, 408, 409, 417, 418, 419, 420, 462,463, 506, 518, 549, 552, 553, 555, 559cubic, 95, 96, 107, 133, 138, 144, 381,

382, 384, 465, 475, 512data-dependent, 317, 365, 368, 369,

373, 382, 385, 391, 403, 404lazy, 143, 152, 408, 409, 414randomized, 475

Hoeffding’s inequality, 51, 52, 75, 122, 124,127, 133, 134, 135, 137, 190, 193, 198,212, 213, 284, 299, 306, 311, 366, 428,440, 445, 448, 483, 487, 490, 495

Holder’s inequality, 108, 582hypergeometric distribution, 589

imperfect training, 89, 109impurity function, 336, 337, 338, 340, 343,

361, 362inequality,

Alexander’s, 207, 210, 211, 278, 295,300

association, 22, 583Bennett’s, 124, 130Bernstein’s, 124, 130, 210, 494Cauchy-Schwarz, 20, 33, 52, 76, 93,

95, 99, 103, 104, 141, 155, 158, 173,256, 345, 401, 402, 414, 463, 582,583

Chebyshev-Cantelli, 42, 237, 463, 582Chebyshev’s, 45, 56, 97, 194, 210, 269,

276, 322, 332, 414, 494, 582Cover-Hart, 5, 22, 24, 27, 35, 62, 70,

448Dvoretzky-Kiefer-Wolfowitz-Massart,

44, 57, 207Fano’s, 26, 35

Hoeffding’s, 51, 52, 75, 122, 124, 127,133, 134, 135, 137, 190, 193, 198,212, 213, 284, 299, 306, 311, 366,428, 440, 445, 448, 483, 487, 490,495

Holder’s, 108, 582Jensen’s, 23, 26, 29, 32, 76, 99, 141,

242, 252, 288, 401, 402, 418, 463,558, 568, 583

Khinchine’s, 410, 588large deviation, 136, 190, 364, 365, 379LeCam’s, 33, 244Markov’s, 76, 117, 123, 256, 309, 379,

582McDiarmid’s, 3, 136, 142, 158, 162,

172, 174, 201, 211, 346, 414, 448,553, 558

Pinsker’s, 37Rogers-Wagner, 4, 411, 413, 417, 418Vapnik-Chervonenkis, 5, 138, 192, 196,

197, 198, 201, 202, 203, 206, 225,226, 229, 233, 271, 273, 275, 278,295, 304, 312, 356, 357, 364, 365,366, 388, 399, 429, 436, 467, 476,492, 494

information divergence, see Kullback-Leib-ler divergence

Jeffreys’ divergence, 27, 28, 29Jensen’s inequality, 23, 26, 29, 32, 76,

99, 141, 242, 252, 288, 401, 402, 418,463, 558, 568, 583

Jessen-Marcinkiewitz-Zygmund theorem,329, 580, 581

Karhunen-Loeve expansion, 568k-d tree, 321, 326, 327, 328, 333, 334,

358, 359, 375chronological, 327, 328, 343, 344, 358permutation-optimized chronological,

344deep, 327, 333, 358well-populated, 327

kernel, 147, 148, 149, 151, 153, 159, 160,161, 162, 163, 165, 166, 185, 416, 420,421, 423, 428, 430, 432, 438, 439, 440,

Subject Index 649

441, 446, 447, 448, 474, 476, 518, 540,541, 542, 546, 549Cauchy, 148, 149, 165Deheuvels’, 433de la Vallee-Poussin, 446devilish, 162Epanechnikov, 148, 149, 162exponential, 433gaussian, 105, 109, 148, 149, 165, 416,

421, 426, 433, 448, 531Hermite, 151multiparameter, 435, 447naıve, 148, 149, 150, 153, 156, 157,

158, 163, 167, 416, 420, 426, 442,445

negative valued, 151, 153polynomial, 434product, 435, 447regular, 149, 150, 156, 158, 164, 165,

166, 185, 415, 424, 425, 426, 433,446

star-shaped, 432uniform, see naıve kernelwindow, see naıve kernel

kernel complexity, 429, 430, 431, 432, 433,434, 435, 436, 446, 447

kernel density estimation, 149, 150, 151,161, 162, 208, 433, 447

kernel rule, 98, 105, 109, 147, 148, 149,150, 151, 153, 159, 160, 161, 162, 163,166, 167, 171, 185, 285, 290, 390, 408,409, 415, 416, 417, 423, 428, 429, 430,439, 440, 442, 444, 445, 446, 447, 462,474, 476, 477, 531, 532, 540, 547, 552automatic, 150, 423, 424, 425, 426variable, 447, 448

Khinchine’s inequality, 410, 588k-local rule, 64, 69, 414k-means clustering, 380, 381, 384, 542Kolmogorov-Lorentz network, 535Kolmogorov-Lorentz representation, 511,

534, 535, 538, 546Kolmogorov-Smirnov distance, 41, 271,

272, 274generalized, 271, 272, 273, 274, 277,

278Kolmogorov-Smirnov statistic, 142, 207Kolmogorov variational distance, 22, 36k-spacing rule, 320, 363, 373, 374, 376,

420Kullback-Leibler divergence, 27, 34, 36,

252, 258, 261

L1 consistency, 109, 143, 161, 162, 184L1 convergence, 93, 270L1 distance, 15, 16, 492L1 error, 20, 107, 108, 150, 162, 208, 525,

526, 527, 528, 542L2 consistency, 102, 284, 559L2 error, 104, 283, 284, 445, 526, 542large deviation inequality, 136, 190, 364,

365, 379learning, 2, 201, 202, 211, 233

algorithm, 201supervised, 2with a teacher, 2

learning vector quantization, 310Lebesgue’s density theorem, 580LeCam’s inequality, 33, 244likelihood product, 250, 258, 259, 260,

268linear classifier, 14, 19, 39, 40, 44, 47,

48, 49, 50, 51, 53, 54, 55, 57, 58,59, 191, 224, 225, 260, 311, 342, 343,390, 399, 466, 467, 468, 476, 490, 501,507, 510, 540, 567generalized, 39, 160, 224, 225, 266, 279,

285, 287, 289, 290, 398, 399, 501,505, 506, 532, 549

linear discriminant, see linear classifierlinear independence, 222linear ordering by inclusion, 231Lloyd-Max algorithm, 380local average estimator, 98logistic discrimination, 259log-linear model, 472Lp error, 525, 526, 527lying teacher, see imperfect training

Mahalanobis distance, 30, 36, 47, 48, 57,566, 572

majority vote, 61, 84, 86, 94, 96, 98, 101,106, 109, 147, 152, 166, 170, 176, 177,178, 183, 317, 319, 322, 327, 328, 333,334, 339, 341, 342, 343, 346, 347, 349,356, 361, 363, 368, 376, 377, 383, 384,

Subject Index 650

392, 420, 435, 463, 474, 475, 512, 513,514, 516, 525

Markov’s inequality, 76, 117, 123, 256,309, 379, 582

martingale, 134, 135, 142martingale difference sequence, 133, 134,

135, 136Matushita error, 23, 29, 34, 36, 109maximum likelihood, 249, 250, 251, 252,

253, 254, 247, 259, 260, 261, 262, 265,266, 268, 269, 270, 276, 277, 444, 467,472, 473, 475, 527, 540, 551distribution format, 250, 260regression format, 249, 250, 260set format, 249

Maxwell distribution, 266McDiarmid’s inequality, 3, 136, 142, 158,

162, 172, 174, 201, 211, 346, 414, 448,553, 558

median tree, 322, 323, 324, 325, 334, 358,362, 375theoretical, 323, 324, 358

method of bounded differences, 133, 135metric, see distancemetric entropy, 480, 481, 484, 486, 492,

526minimax lower bound, 234, 235, 236minimum description length principle, 292,

295minimum distance estimation, 265, 266,

270, 271, 272, 274, 277, 278moment generating function, 56, 589monomial, 468, 469, 476, 531, 533, 534monotone layer, 183, 226, 227, 229, 232,

487moving window rule, 147, 148, 150, 552,

553, 555, 559multinomial discrimination, 461multinomial distribution, 227, 321, 589multivariate normal distribution, see nor-

mal distribution

natural classifier, 317, 318, 320, 323, 327,342, 347, 358, 513, 514, 542

nearest neighbor, 62, 63, 69, 80, 84, 85,89, 90, 101, 166, 169, 170, 171, 175,177, 179, 180, 181, 182, 184, 185, 307,310, 392, 413, 414, 441, 448, 453, 456,

553nearest neighbor clustering, 377, 379nearest neighbor density estimation, 184nearest neighbor error, 22, 23, 29, 31, 57,

63, 69, 132, 307, 308, 361, 551, 559nearest neighbor rule,

1-nn , 5, 70, 80, 84, 87, 88, 106, 107,152, 169, 180, 181, 185, 285, 303,304, 305, 306, 307, 308, 309, 310,311, 313, 359, 360, 387, 390, 391,395, 398, 405, 407, 414, 454, 458,475, 550

3-nn , 14, 80, 84, 86, 405, 565, 566admissibility of the 1-nn rule, 5, 81,

87, 174asymptotic error probability of the 1-

nn rule, see nearest neighbor errorasymptotic error probability of the k-

nn rule, 70, 71, 73, 75, 79, 82, 169automatic, 169, 387, 451based on reference data, see prototype

nn rulecondensed, 303, 304, 307, 309, 455data-based, see automatic nearest neigh-

bor ruleedited, 62, 307, 309, 313, 455k-nn , 3, 4, 5, 6, 61, 62, 64, 69, 70, 73,

74, 80, 81, 82, 83, 85, 86, 87, 98,100, 101, 169, 170, 175, 176, 177,179, 180, 181, 182, 185, 290, 309,313, 343, 389, 397, 408, 413, 414,415, 420, 421, 451, 452, 453, 455,456, 457, 458, 459, 549, 552, 553,555

(k, l)-nn , 81, 82, 88layered, 183prototype, 307, 309, 310, 311, 377, 455recursive, 177, 179, 182relabeling, 169, 180, 181, 185, 309, 420selective, 455variable metric, 455weighted, 61, 71, 73, 74, 85, 87, 88,

90, 179, 183, 451, 453neural network, 444, 507, 510, 514, 520,

527, 528, 530, 531classifier, 5, 39, 290, 516, 517, 543,

549multilayer, 509, 523, 544, 545with one hidden layer, 507, 508, 509,

Subject Index 651

510, 517, 518, 519, 520, 521, 525,538, 542, 543, 544

with two hidden layers, 509, 510, 514,515, 516, 517, 522, 537, 542, 546

Neyman-Pearson lemma, 18normal distribution, 31, 35, 47, 48, 49,

56, 57, 59, 76, 85, 86, 249, 264, 265,266, 276, 343, 390, 399, 415, 564, 566,567, 572, 588, 590

order statistics, 88, 178, 257, 320, 328,358, 363, 372, 382

outer layer, see monotone layeroverfitting, 188, 286, 291, 297, 348, 395,

454, 527, 532, 544

packing number, 481, 496Padaline, 531, 532, 544parameter estimation, 250, 268, 270, 277,

399parametric classification, 24, 250, 263,

270, 275, 390, 415, 472Parseval’s identity, 283partition, 94, 106, 109, 140, 143, 144,

147, 152, 287, 301, 307, 310, 312, 315,317, 318, 319, 320, 321, 327, 328, 333,343, 346, 347, 349, 350, 351, 352, 353,354, 356, 357, 360, 363, 364, 365, 366,368, 372, 373, 374, 378, 381, 382, 383,392, 394, 395, 399, 400, 402, 403, 404,417, 418, 419, 420, 454, 477, 506, 511,513, 553complexity of a family of, 365, 367,

369, 403, 404cubic, 95, 96, 107, 133, 138, 143, 382,

475data-dependent, 94, 363, 364, 370, 404,

420ε-effective cardinality of, 144family of, 365, 367, 368, 369, 403, 404,

405µ-approximating, 144

partitioning rule, see histogram ruleperceptron, 6, 14, 39, 40, 44, 507, 509,

510, 514, 515, 516, 532, 544perceptron criterion, 59pigeonhole principle, 238, 476

Pinsker’s inequality, 37planar graph, 312, 380plug-in decision, 15, 16, 17, 19, 20, 24,

93, 265, 266, 267, 270, 271, 276, 277,467, 468, 471, 472, 475

polynomial discriminant function, 160,279, 287, 531

polynomial network, 532, 533, 534potential function rule, 159, 160, 279, 540principal component analysis, 455probabilistic method, 112, 234, 496projection pursuit, 538, 539, 546pseudo dimension, 498, 506pseudo probability of error, 82

quadratic discrimination rule, 225, 265,276, 343, 390, 468, 476

quadtree, 321, 333, 334, 348, 361chronological, 333, 361deep, 333, 361

quantization, 380, 461, 463, 567

radial basis function , 539, 540, 541, 542,547

Radon-Nikodym derivative, 272, 313Radon-Nikodym theorem, 578, 586rate of convergence, 4, 102, 104, 108, 111,

113, 114, 118, 128, 239, 262, 264, 267,269, 275, 276, 284, 294, 295, 296, 297,300, 357, 390, 483, 485, 486, 494, 527

Rayleigh distribution, 266, 277record, 330, 331, 358recursive kernel rule, 160, 161, 166regression function, 9, 11, 93, 254, 261,

445, 454, 474, 475, 479, 483, 484, 487,489estimation, 92, 93, 102, 103, 104, 108,

114, 149, 169, 172, 185, 405, 445,527

relative stability, 162Rogers-Wagner inequality, 4, 411, 413,

417, 418rotation invariance, see transformation in-

varianceRoyall’s rule, 185

Subject Index 652

sample complexity, 201, 202, 205, 206,211, 234, 245

sample scatter, 46, 47sampling without replacement, 213scale invariance, see transformation in-

variancescaling, 144, 166, 177, 179, 381, 385, 387,

391scatter matrix, 46Scheffe’s theorem, 32, 36, 208, 211, 278,

383search tree, 62, 326

balanced, 322shatter coefficient, 196, 197, 198, 200, 212,

215, 216, 219, 220, 222, 223, 226, 271,273, 312, 347, 365, 389, 390, 404, 429,453, 492, 493, 496, 497, 512, 522, 541

shattering, 196, 198, 199, 217, 218, 219,220, 221, 222, 223, 226, 229, 235, 239,277, 436, 437, 438, 476, 477, 498, 499,521, 541

sieve method, 444sigmoid, 56, 59, 507, 510, 515, 517, 519,

520, 521, 524, 526, 528, 542, 543, 544,545arctan, 508, 524gaussian, 508, 524logistic, see standard sigmoidpiecewise polynomial, 524standard, 508, 524, 525threshold, 507, 508, 516, 517, 519, 521,

522, 523, 524, 525, 526, 537, 543signed measure, 272, 273skeleton estimate, 292, 480, 482, 483, 484,

486, 487, 489, 493smart rule, 106, 109smoothing factor, 148, 149, 153, 160, 162,

163, 165, 166, 185, 390, 416, 423, 424,428, 429, 435, 436, 439, 441, 442, 444,446, 462, 474, 531data-dependent, 423, 424, 425, 426,

474spacing, 328, 358, 372splitting criterion, 334, 335, 336, 337, 338,

339, 348splitting function, 334splitting the data, 132, 185, 211, 298,

309, 347, 387, 388, 389, 390, 391, 392,397, 428, 436, 452, 453, 454, 541

squared error, 11, 54, 55, 56, 58, 59, 445,448, 489, 490, 491, 492, 493, 501, 502,505, 506, 525, 526, 527, 551, 559

standard empirical measure, see empiri-cal measure

statistically equivalent blocks, 178, 363,372, 374, 375, 384, 393

Stirling’s formula, 83, 238, 415, 449, 587stochastic approximation, 161, 505stochastic process,

expansion of, 573sampling of, 573

Stoller split, 40, 334, 335, 336, 338, 350,361, 375empirical, 43, 339, 343theoretical, 40, 41, 42, 43, 56

Stoller’s rule, 43, 44Stone’s lemma, 65, 66, 69, 83Stone’s theorem, 97, 100, 101, 107, 181,

182, 183, 185, 456Stone-Weierstrass theorem, 281, 287, 519,

538, 543, 580strictly separable distribution, 104, 162strong consistency, 92, 93, 128,

definition of, 3, 91of Fourier series rule, 282, 284of kernel rules, 149, 161, 162, 431of nearest neighbor rules, 100, 169, 174,

451of neural network classifiers, 528of partitioning rules, 143, 155, 341, 347,

368, 369, 370, 376, 378, 383strong universal consistency, 395

of complexity regularization, 290, 292,298

definition of, 92of generalized linear rules, 287, 502,

505of kernel rules, 149, 150, 166, 424, 426,

429, 430, 435of nearest neighbor rules, 170, 176, 453,

458, 459of neural network classifiers, 523, 528,

545of partitioning rules, 96, 133, 138, 142,

143, 144, 170, 288, 369, 376, 381,382, 384, 462

structural risk minimization, see complex-ity regularization

Subject Index 653

sufficient statistics, 268, 270, 571, 572,573, 574

superclassifier, 5, 118support, 63, 80, 84, 579symmetric rule, 109, 411, 413, 414, 419symmetrization, 193, 194, 203, 278

tennis rule, 82, 88testing sequence, 121, 124, 125, 126, 127,

131, 388, 390, 392, 393, 397, 428, 453,455, 457

total boundedness, 480, 483, 486, 487,493, 494

total positivity, 434total variation, 22, 32, 36, 208transformation invariance, 6, 62, 84, 177,

178, 179, 182, 183, 285, 318, 319, 320,328, 334, 337, 348, 373, 375, 385, 512,531

tree,α-balanced, 358binary, 315, 316, 364bsp, see binary space partition treeclassifier, 6, 39, 316, 319, 322, 341,

343, 347, 364, 383, 390, 394, 477,516, 542

depth of, 315, 333, 358expression, 469greedy, 348, 349, 350height of, 315, 334, 474Horton-Strahler number of, 477hyperplane, see binary space partition

treek-d, see k-d treemedian, see median treeordered, 316ordinary, 316, 317, 318perpendicular splitting, 334, 336, 348pruning of, see trimming of a treesphere, 316, 317trimming of, 327, 336, 338, 339

unbiased estimator, 121, 285uniform deviations, see uniform laws of

large numbersuniform laws of large numbers, 190, 196,

198, 199, 207, 228, 366, 490, 529

uniform marginal transformation, 319universal consistency, 97, 98, 106, 109,

113, 118, 226, 285, 387, 510of complexity regularization, 293definition of, 3, 92of generalized linear rules, 286, 501,

502of kernel rules, 149, 153, 163, 164, 165,

166, 167, 423, 425, 441, 442of nearest neighbor rules, 61, 100, 169,

170, 175, 177, 178, 180, 181, 182,183, 184, 185, 453, 454, 455

of neural network classifiers, 512, 524,526, 528, 532, 539, 540, 547

of partitioning rules, 95, 96, 107, 133,253, 344, 346, 364, 382, 393

Vapnik-Chervonenkis dimension, see vcdimension

Vapnik-Chervonenkis inequality, 5, 138,192, 196, 197, 198, 201, 202, 203, 206,225, 226, 229, 233, 271, 273, 275, 278,295, 304, 312, 356, 357, 364, 365, 366,388, 399, 429, 436, 467, 476, 492, 494

variation, 144vc class, 197, 210, 251, 290, 297vc dimension, 5, 192, 196, 197, 200, 201,

202, 206, 210, 215, 216, 218, 219, 220,221, 223, 224, 226, 229, 230, 231, 232,233, 235, 236, 238, 239, 240, 243, 245,247, 257, 272, 273, 275, 277, 278, 286,289, 290, 291, 292, 294, 295, 297, 298,300, 301, 346, 347, 398, 436, 439, 444,447, 467, 471, 476, 477, 486, 496, 521,522, 523, 524, 526, 530, 533, 534, 537,538, 541, 542, 544, 545, 546, 547

Voronoi cell, 62, 304, 312, 378Voronoi partition, 62, 304, 312, 377, 378,

380

weighted average estimator, 98

X-property, 319, 320, 322, 333, 334, 339,342, 343, 357, 358, 362, 513


Author Index

Abou-Jaoude, S., 143, 144, 364Abram, G.D., 342Aitchison, J., 474, 476Aitken, C.G.G., 474, 476Aizerman, M.A., 159Akaike, H., 149, 292, 295Alexander, K., 207Anderson, A.C., 340Anderson, J.A., 259Anderson, M.W., 375Anderson, T.W., 49, 322, 374, 375Angluin, D., 130Anthony, M., 493, 534Antos, A., viArgentiero, P., 340, 342Arkadjew, A.G., 149Ash, R.B., 575Assouad, P., 231, 232, 243Azuma, K., 133, 135

Bahadur, R.R., 472Bailey, T., 73Barron, A.R., vi, 144, 208, 292, 295, 364,

494, 509, 518, 520, 527, 531, 533Barron, R.L., vi, 509, 532, 533Bartlett, P., 522, 544

Bashkirov, O., 159, 540Baskett, F., 62Baum, E.B. 521, 522Beakley, G.W., 375Beaudet, P., 340, 342Beck, J., 169Becker, P., 42, 56Beirlant, J., vi,Ben-Bassat, M., 562Benedek, G.M., 211, 297, 301, 485Bennett, G., 124Benning, R.D., 375Bentley, J.L., 62, 333Beran, R.H., 23Berlinet, A., vi,Bernstein, S.N., 124Bhattacharya, P.K., 169, 320Bhattacharyya, A., 23Bickel, P.J., 169Birge, L., 118, 243Blumer, A., 202, 233, 236, 245Bosq, D., vi,Brailovsky, V.L., 125Braverman, E.M., 149, 159, 540Breiman, L., 169, 185, 318, 336, 337, 339,

342, 347, 360, 364, 376, 383, 447Brent, R.P., 542

Author Index 655

Broder, A.J., 62Broeckaert, I., 149Broniatowski, M., viBroomhead, D.S., 540Buescher, K.L., 211, 212, 297, 485Burbea, J., 28Burman, P., viBurshtein, D., 340Buzo, A., 380

Cacoullos, T., 149Cao, R., viCarnal, H., 229Casey, R.G., 340Cencov, N.N., 279Chang, C.L., 310Chang, C.Y., 563Chatterjee, B., 340Chen, C., 125, 557, 558Chen, H., 518Chen. T., 518Chen, X.R., 364Chen, Y.C., 540, 542Chen, Z., 128Chin, R., 340, 342Chou, P.A., 339Chou, W.S., 540, 542Chernoff, H., 24, 41, 51, 56, 122, 123,

129Chervonenkis, A.Y., v, 5, 50, 126, 187,

189, 190, 197, 198, 199, 203, 216, 235,240, 291, 491, 492

Chow, C.K., 18, 466Chow, Y.S., 575Ciampi, A., 339Collomb, G., 169Conway, J.H., 475Coomans, D., 149Cormen, T.H., 316, 326Cover, T.M., v, vi, 5, 22, 25, 26, 62, 70,

80, 81, 87, 114, 125, 169, 198, 219,222, 223, 230, 292, 295, 395, 562, 564,579

Cramer, H., 45Csibi, S., v, 16, 506Csiszar, I., 31, 37, 144, 219Csuros, M., vi, 301Cuevas, A., vi

Cybenko, G., 518

Darken, C., 540, 547Dasarathy, B.V., 62, 309Das Gupta, S., 84, 375Day, N.E., 260de Guzman, M., 580Deheuvels, P., vi, 433Della Pietra, V.D., 340Delp, E.J., 339Devijver, P.A., vi, 22, 30, 45, 62, 74, 77,

82, 88, 304, 309, 313, 455, 554, 562,563

Devroye, L., 16, 33, 62, 70, 76, 77, 86,87, 88, 90, 112, 114, 126, 134, 137,138, 143, 149, 150, 160, 161, 162, 164,169, 170, 171, 176, 177, 178, 182, 183,198, 202, 206, 208, 212, 223, 229, 236,240, 243, 278, 304, 333, 334, 388, 389,398, 403, 411, 414, 415, 416, 421, 430,484, 588

Diaconis, P., 538, 546Dillon, W.R., 463Donahue, M., 547Donaldson, R.W., 556Do-Tu, H., 56Dubes R.C., 125, 339, 557, 558Duda, R.O., 49, 225, 270, 472, 501, 566,

572Dudani, S.A., 88Dudley, R.M., 193, 198, 211, 221, 223,

231, 365, 485, 496, 498Duin, R.P.W., 540, 542Dunn, O.J., 554Durrett, R., 575Dvoretzky, A., 207, 505

Edelsbrunner, H., 339, 380, 512Efron, B., 125, 137, 556Ehrenfeucht, A., 202, 233, 236, 245Elashoff, J.D., 564Elashoff, R.M., 564

Fabian, V., 505Fano, R.M., 26

Author Index 656

Farago, A., vi, 62, 231, 523, 525Farago, T., 567Feinholz, L., 198, 394Feller, W., 587Finkel, R.A., 62, 333Fisher, F.P., 85, 184, 374Fisher, R.A., 46Fitzmaurice, G.M., 554Fix, E., 61, 149, 169Flick, T.E., 455, 538Forney, G.D., 18Fraiman, R., viFriedman, J.H., 62, 318, 336, 337, 339,

342, 347, 360, 364, 375, 376, 383, 535,538

Fritz, J., v, 56, 58Fu, K.S., 36, 128, 340, 342, 343, 375, 562Fuchs, H., 341, 342Fukunaga, K., 62, 128, 307, 415, 455, 554,

555, 563Funahashi, K., 518

Gabor, D., 532Gabriel, K.R., 90Gaenssler, P., 198, 207Gallant, A.R., 527Ganesalingam, S., 554Garnett, J.M., 128Gates, G.W., 307, 455Gelfand, S.B., 339Geman, S., 527, 540Gerberich, C.L, 340Gessaman, M.P., 374, 375, 393Gessaman, P.H., 374Geva, S., 310Gijbels, I., viGine, E., 198Girosi, F., 540, 547Glick, N., vi, 16, 96, 125, 225, 463, 474,

550Goldberg, P., 524Goldmann, G.E., 564Goldstein, M., 463Golea, M., 542Gonzalez-Manteiga, W., viGoodman, R.M., 339Goppert, R., 62Gordon, L., 96, 339, 347, 364, 376

Gowda, K.C., 307Grant, E.D., 342Gray, R.M., 380Greblicki, W., 149, 161, 165, 166, 279Grenander, U., 527, 540Grimmett, G.R., 575Guo, H., 339Gurvits, L., 547Gyorfi, L., 16, 18, 33, 56, 57, 58, 76, 86,

138, 144, 150, 161, 162, 169, 170, 171,176, 208, 430, 505, 567, 588

Gyorfi, Z., 18, 76, 169

Haagerup, U., 588Habbema, J.D.F., 444Hagerup, T., 130Hall, P., vi, 474, 538Hand, D.J., 149, 455, 554, 557, 558Hardle, W., 445Hart, P.E., v, 5, 22, 49, 62, 70, 81, 87,

225, 270, 304, 455, 472, 501, 566, 572,579

Hartigan, J.A., 377Hartmann, C.R.P., 340Hashlamoun, W.A., 29, 36Hastie, T., 356, 537, 538Haussler, D., 202, 210, 233, 235, 236, 245,

493, 496, 498, 506, 522, 526, 527Hecht-Nielsen, R., 534Hellman, M.E., 81Henrichon, E.G., 340, 342, 343, 375Herman, C., 538Hermans, J., 444Hertz, J., 56, 509Hills, M., 474Hinton, G.E., 525Hjort, N., 270Hodges, J.L., 61, 149, 169Hoeffding, W., 122, 133, 135, 589Hoff, M.E., 54, 531Holden, S.B., 534Holmstrom, L., 150, 430Hora, S.C., 554Horibe, Y., 24Horne, B., 59Hornik, K., 518, 542, 543Hostetler, L.D., 455Huber, P.J., 538

Author Index 657

Hudimoto, H., 24Hummels, D.M., 128Hush, D., 59Hwang, C.R., 527, 540

Installe, M., 56Isenhour, T.L., 307, 455Isogai, E., viItai, A., 211, 297, 485Ito, T., 24, 472Ivakhnenko, A.G., 532, 533

Jain, A.K., 73, 125, 557, 558Jeffreys, H., 27Jerrum, M., 524Johnson, D.S., 54Jones, L.K., 538

Kailath, T., 24Kanal, L.N., 125, 562Kanevsky, D., 340Kaplan, M.R., 342Karlin, A.S., 434Karp, R.M., 130Karpinski, M., 525Kazmierczak, H., 476Kearns, M., 233, 236, 245Kedem, Z.M., 342Kegl, B., viKemp, R., 469, 476Kemperman, J.H.B., 37Kerridge, N.E., 260Kessel, D.L., 128, 415, 554, 555Khasminskii, R.Z., 505Kiefer, J., 207, 505Kim, B.S., 62Kittler, J., 22, 30, 45, 77, 88, 304, 309,

313, 455, 554, 562, 563Kizawa, M., 342Klemela, J., 150, 430Knoke, J.D., 550Kohonen, T., 310Kolmogorov, A.N., 22, 479, 481, 482, 486,

487, 510, 534, 536, 537Konovalenko, V.V., 532

Korner, J., 219Koutsougeras, C., 542Kraaijveld, M.A., 540, 542Krasits’kyy, M.S., 532Krishna, G., 307Krishnaiah, P.R., 364Krogh, A., 56, 509Kronmal, R.A., 279, 473Krzanowski, W.J., 31, 463Krzyzak, A., vi, 149, 150, 161, 163, 164,

165, 166, 169, 170, 176, 447, 540, 541,547

Kulikowski, C.A., 509Kulkarni, S.R., 211, 485Kullback, S., 27, 37Kumar, P.R., 211, 212, 297Kurkova, V., 537Kushner, H.J., 505

Lachenbruch, P.A., 415Laforest, L., 334Laha, R.G., 45Landgrebe, D.A., 554LeCam, L., 23, 33Leibler, A., 27Leiserson, C.E., 316, 326Levinson, S.E., 85Li, T.L., 562Li, X., 339Lin, H.E., 84Lin, T.T., 45Lin, Y.K., 343Linde, Y., 380Linder, T., vi, 62, 506, 540, 541, 547Lissack, T., 36Littlestone, N., 210, 233, 235Liu, R., 518Ljung, L., 505Lloyd, S.P., 380Loftsgaarden, D.O., 184Logan, B.F., 539Loh, W.Y., 342, 343Loizou, G., 84Lorentz, G.G., 510, 534Lowe, D., 540Lowry, S.L., 307, 455Lugosi, G., 62, 89, 109, 169, 170, 176,

210, 236, 240, 243, 292, 301, 364, 365,

Author Index 658

369, 378, 383, 506, 523, 525, 527, 528,540, 541, 547, 551, 554, 555, 559

Lukacs, E., 45Lunts, A.L., 125

Maass, W., 523, 524Machell, F., 160Macintyre, A., 524, 525Mack, Y.P., vi, 169, 185Mahalanobis, P.C., 30, 320Mantock, J.M., 307Marchand, M., 542Marron, J.S., 114, 445Massart, P., 44, 198, 207Mathai, A.M., 22Matloff, N., 554Matula, D.W., 90Matushita, K., 23, 31Max, J., 380Maybank, S.J., 84McDiarmid, C., 3, 133, 134, 136McLachlan, G.J., 49, 125, 251, 259, 270,

399, 415, 468, 472, 554, 557, 558Mehrotra, K.G., 340Meisel, W., 149, 185, 339, 359, 365, 447,

542, 562Michalopoulos, W.S., 359, 365Michel-Briand, C., 339Michelotti, 342Mickey, P.A., 415Mielniczuk, J., 527Milhaud, X., 339Min, P.J., 562Minsky, M.L., 466, 509Mitchell, A.F.S., 31Mizoguchi, R., 342Monro, S., 505Moody, J., 540Moore, D.S., 554Morgan, J.N., 339Muchnik, I.E., 159, 540Mui, J.K., 343Myles, J.P., 455

Nadaraya, E.A., 149Nadas, A., vi, 340Nagy, G., 340

Narendra, P.M., 62, 563Natarajan, B.K., 202, 468Naylor, B., 342Nevelson, M.B., 505Niemann, H., 62Nilsson, N.J., 44, 509, 525Nobel, A.B., vi, 364, 365, 369, 378, 383,

499Nolan, D., 498, 499

Oja, E., 540Okamoto, M., 51, 75, 122, 131Olshen, R.A., 96, 178, 318, 336, 337, 339,

342, 347, 360, 364, 376, 383Ott, J., 472

Pali, I., viPalmer, R.G., 56, 509Papachristou, C.A., 542Papadimitriou, C.H., 62Papert, S., 509Park, J., 547Park S.B., 62Park, Y., 342Parthasarathy, K.R., 320Parzen, E., 149, 161Patrick, E.A., 85, 184, 374Pawlak, M., vi, 149, 161, 165, 166, 279,

551, 554, 555, 559Payne, H.J., 339Pearl, J., 470Penrod, C.S., 309, 313Peterson, D.W., 87Petrache, G., 532Petrov, V.V. 416Petry, H., viPflug, G., vi, 505Pikelis, V., 49, 567Pinsker, M., 37Pinter, M., viPippinger, N., 470Poggio, T., 540, 546Pollard, D., 193, 198, 365, 492, 493, 496,

498, 499, 500, 506Powell, M.J.D., 540Preparata, F.P., 54, 177Priest, R.G., 538

Author Index 659

Pruitt, R., 554Psaltis, D., 87Purcell, E., 185, 447Pyatniskii, E.S., 159

Qing-Yun, S., 340, 342Quesenberry, C.P., 184, 374Quinlan, J.R., 339

Rabiner, L.R., 85Rao, C.R., 28Rao, R.R., 228Rathie, P.N., 22Raudys, S., 49, 567Ravishankar, C.S., 339Rejto, L., 149Renyi, A., 28Revesz, P., v, 149, 161Richardson, T., 211, 485Ripley, B.D., 509Rissanen, J., 292, 295Ritter, G.L., 307, 455Rivest, R.L., 316, 326Robbins, H., 505Rogers, W.H., 4, 411, 413Rosenberg, A.E., 85Rosenblatt, F., 5, 14, 39, 44, 509Rosenblatt, M., 149, 161Rounds, E.M., 339Roussas, G., viRoyall, R.M., 71, 73, 179, 185, 453Rozonoer, L.I., 159Rub, C., 130Rumelhart, D.E., 525Ruppert, D., 505

Samarasooriya, V.N.S., 29, 36Samet, H., 326, 333, 342Sandberg, I.W., 547Sansone, G., 281Sarvarayudu, G.P.R., 339Sauer, N., 216Scheffe, H., 32, 208, 211, 383Schlaffli, L., 395Schmidt, W.F., 525Schoenberg, I.J., 434Schroeder, A., 538

Schwartz, S.C., 279Schwemer, G.T., 554Sebestyen, G., 149, 340, 432, 562Serfling, R.J., 589Sethi, I.K., 62, 339, 340, 542Shahshahani, M., 538, 546Shamos, M.I., 177Shannon, C.E., 25Shawe-Taylor, J., 493, 525Shen, X., 254, 527Shepp, L., 539Shimura, M., 342Shirley, P., 342Shiryayev, A.N., 575Short, R.D., 455Shustek, L.J., 62Silverman, B.W., 536, 538Simar, L., viSimon, H.U., 198, 240, 477Sitte, J., 310Sklansky, J., 56, 342Sloane, N.J.A., 475Slud, E.V., 87, 588Smyth, P., 339Snapp, R.R., 87Sokal, R.R., 90Sonquist, J.A., 339Sontag, E., 524, 547Specht, D.F., 160, 279, 531, 532Spencer, J., 496Sprecher, D.A., 534Steele, J.M., 137, 221, 229Stein, C., 137Steinbuch, K., 476Stengle, G., 198Stephenson, R., 468Stinchcombe, M., 518Stirzaker, D.R., 575Stoffel, J.C., 340Stoller, D.S., 40, 43Stone, C.J., v, 3, 6, 65, 70, 97, 98, 100,

101, 114, 169, 175, 179, 181, 183, 318,336, 337, 339, 342, 347, 360, 364, 376,383, 538

Stone, M., 125Stuetzle, W., 538Stute, W., vi, 169, 207Suen, C.Y., 339Sung, K., 342

Author Index 660

Swonger, C.W., 307Szabados, T., viSzabo, R., 301Szarek, S.J., 588Szego, G., 289

Talagrand, M., 199, 207Talmon, J.L., 339Taneja, I.J., 28Tarter, M.E., 279Thomas, J.A., 25, 26, 219Tibshirani, R.J., 536, 537, 538Tikhomirov, V.M., 479, 481, 482, 486, 487Tomek, I., 307, 309, 455Toronto, A.F., 468Toussaint, G.T., vi, 28, 35, 125, 556, 564,

565Tsypkin, Y.Z., 505Tukey, J., 538Tulupchuk, Y.M., 532Tuteur, F.B., 375Tutz, G.E., 441, 474, 477, 550Tymchenko, I.K., 532Tyrcha, J., 527

Ullmann, J.R., 307, 455

Vajda, Igor, vi, 22, 32Vajda, Istvan, 18, 56, 57Valiant, L.G., 130, 202, 233, 236, 245Van Campenhout, J.M., 562, 564van den Broek, K., 444van der Meulen, E.C., vi, 144, 208Vanichsetakul, N., 342, 343Van Ryzin, J., 16, 149Vapnik, V.N., v, 5, 50, 126, 187, 189, 190,

197, 198, 199, 202, 203, 206, 207, 211,216, 235, 240, 291, 485, 491, 492, 493,501

Varshney, P.K., 29, 36, 340Veasey, L.G., 468Venkatesh, S.S., 87Vidal, E., 62Vilmansen, T.E., 563Vitushkin, A.G., 482

Wagner, T.J., v, vi, 4, 16, 20, 62, 84,125, 149, 161, 198, 202, 206, 304, 309,313, 389, 398, 403, 411, 413, 414, 415,416, 421, 455

Walk, H., vi, 505Wand, M.P., 474Wang, Q.R., 339Warmuth, M.K., 202, 210, 233, 235, 236,

245Warner, H.R., 468Wassel, G.N., 56Watson, G.S., 149Weiss, S.M., 509Wenocur, R.S., 498Wheeden, R.L., 580White, H., 518, 527Whitsitt, S.J., 554Widrow, B., 54, 531Wilcox, J.B., 554Williams, R.J., 525Wilson, D.L., 309, 313, 455Wilson, J.G., 85Winder, R.O., 466Wise, G.L., 177, 182Wold, H., 45Wolfowitz, J., 207, 505Wolverton, C.T., 16, 20, 161Wong, W.H., 254, 527Woodruff, H.B., 307, 455Wouters, W., v

Xu, L., 540

Yakowitz, S., viYatracos, Y.G., vi, 278Yau, S.S., 45, 128You, K.C., 340Yuille, A.L., 540Yukich, J.E., 198Yunck, T.P., 62

Zeitouni, O., 211, 485Zhao, L.C., 150, 169, 170, 364Zeger, K., vi, 292, 301, 506, 527, 528Zinn, J., 198Zygmund, A., 281, 575, 580

A Probabilistic Theory of Pattern...

Documents

Transcript of A Probabilistic Theory of Pattern...