Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications...

40
L Large Deviations and Applications F C Professor Université Paris Diderot, Paris, France Large deviations is concerned with the study of rare events and of small probabilities. Let X i , i n, be independent identically distributed (i.i.d.) real random variables with expectation m, and ¯ X n =(X + ... + X n )n their empirical mean. e law of large numbers shows that, for any Borel set A R not containing m in its closure, P( ¯ X n A)→ as n →∞, but does not tell us how fast the probability vanishes. Large deviations theory gives us the rate of decay, which is exponential in n. Cramér’s theorem states that, P( ¯ X n A)= exp -n inf {I (x); x A}+ o() as n →∞, for all interval A. e rate function I can be computed as the Legendre conjugate of the logarithmic moment generating function of X, I (x)= sup{λx - ln E exp(λX )R}, and is called the Cramér transform of the common law of the X i ’s. e natural assumption is the niteness of the 7moment generating function in a neighborhood of the origin, i.e., the property of exponential tails. e func- tion I : R →[, +∞] is convex with I (m)= . In the Gaussian case X i ∼ N(m, σ ), we nd I (x)= (x - m) (σ ). In the Bernoulli case P(X i = )= p = - P(X i = ), we nd the entropy function I (x)=x ln(xp)+( - x) ln(-x)(-p) for x ∈[, ], and I (x) = +∞ otherwise. To emphasize the importance of rare events, let us mention a consequence, the Erdös–Rényi law: consider an innite sequence X i , i , of Bernoulli i.i.d. variables with parameter p, and let R n denote the length of the longest consecutive run, contained within the rst n tosses, in which the fraction of s is at least a (a > p). Erdös and Rényi proved that, almost surely as n →∞, R n ln n I (a) - , with the function I from the Bernoulli case above. ough it may look paradoxical, large deviations are at the core of this event of full probability. is result is the basis of 7bioinformatics applications like sequence matching, and of statistical tests for sequence randomness. e theory does not only apply to independent vari- ables, but allows for many variations, including weakly dependent variables in a general state space, Markov or 7Gaussian processes, large deviations from 7ergodic the- orems, non-asymptotic bounds, asymptotic expansions (Edgeworth expansions), etc. Here is the formal denition. Given a Polish space (i.e., a separable complete metric space) X , let {P n } be a sequence of Borel probability measures on X , let a n be a positive sequence tending to innity, and nally let I : X→[, +∞] be a lower semicontinuous functional on X. We say that the sequence {P n } satises a large deviation principle with speed a n and rate I , if for each measurable set E X - inf xE I (x)≤ lim n a - n ln P n (E) lim n a - n ln P n (E)≤- inf x¯ E I (x) where ¯ E and E denote respectively the closure and interior of E. e rate function can be obtained as I (x)=- lim δ lim n→∞ a - n ln P n (B(x, δ)), with B(x, δ) the ball of center x and radius δ. Sanov’s theorem and sampling with replacement: let μ be a probability measure on a set Σ that we assume nite for simplicity, with μ(y)> for all y Σ. Let Y i , i , an i.i.d. sequence with law μ, and N n the score vector of the n-sample, N n (y)= n i= y (Y i ). By the law of large numbers, N n n μ almost surely. From the 7multinomial distribution, one can check that, for all ν such that is a possible score vector for the n-sample, (n + ) - Σ e -nH(ν μ) P(n - N n = ν)≤ e -nH(ν μ) , where H(ν μ)=∑ yΣ ν(y) ln ν(y) μ(y) is the relative entropy of ν with respect to μ. e large deviations theorem holds Miodrag Lovric (ed.), International Encyclopedia of Statistical Science, DOI ./----, © Springer-Verlag Berlin Heidelberg

Transcript of Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications...

Page 1: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

L

Large Deviations andApplications

Francis CometsProfessorUniversiteacute Paris Diderot Paris France

Large deviations is concerned with the study of rare eventsand of small probabilities Let Xi le i le n be independentidentically distributed (iid) real random variables withexpectationm and Xn = (X + +Xn)n their empiricalmeane law of large numbers shows that for any Borelset A sub R not containing m in its closure P(Xn isin A) rarr as n rarr infin but does not tell us how fast the probabilityvanishes Large deviations theory gives us the rate of decaywhich is exponential in n Crameacuterrsquos theorem states that

P(Xn isin A) = exp (minusn( infI(x) x isin A + o()))

as n rarr infin for all interval Ae rate function I can becomputed as the Legendre conjugate of the logarithmicmoment generating function of X

I(x) = supλx minus lnE exp(λX) λ isin R

and is called the Crameacuter transform of the common lawof the Xirsquose natural assumption is the niteness of the7moment generating function in a neighborhood of theorigin ie the property of exponential tails e func-tion I Rrarr [+infin] is convex with I(m) =

In the Gaussian case Xi sim N (m σ ) we nd I(x) =(x minusm)(σ )

In the Bernoulli case P(Xi = ) = p = minus P(Xi = )we nd the entropy function I(x)=x ln(xp) + ( minus x)ln(minusx)(minusp) for x isin [ ] and I(x) = +infinotherwise

To emphasize the importance of rare events let usmention a consequence the ErdoumlsndashReacutenyi law consider aninnite sequence Xi i ge of Bernoulli iid variables withparameter p and let Rn denote the length of the longestconsecutive run contained within the rst n tosses inwhich the fraction of s is at least a (a gt p) Erdoumls andReacutenyi proved that almost surely as nrarrinfin

Rn lnnETHrarr I(a)minus

with the function I from the Bernoulli case aboveoughit may look paradoxical large deviations are at the coreof this event of full probabilityis result is the basis of7bioinformatics applications like sequence matching andof statistical tests for sequence randomness

e theory does not only apply to independent vari-ables but allows for many variations including weaklydependent variables in a general state space Markov or7Gaussian processes large deviations from 7ergodic the-orems non-asymptotic bounds asymptotic expansions(Edgeworth expansions) etcHere is the formal denition Given a Polish space

(ie a separable complete metric space) X let Pn bea sequence of Borel probability measures on X let an bea positive sequence tending to innity and nally let I X rarr [+infin] be a lower semicontinuous functional on XWe say that the sequence Pn satises a large deviationprinciple with speed an and rate I if for each measurableset E sub X

minus infxisinEI(x) le lim

naminusn lnPn(E)

le limnaminusn lnPn(E) le minus inf

xisinEI(x)

where E andE denote respectively the closure and interiorof Ee rate function can be obtained as

I(x) = minus limδ

limnrarrinfin a

minusn lnPn(B(x δ))

with B(x δ) the ball of center x and radius δSanovrsquos theorem and sampling with replacement let micro

be a probability measure on a set Σ that we assume nitefor simplicity with micro(y) gt for all y isin Σ Let Yi i ge aniid sequence with law micro and Nn the score vector of then-sample

Nn(y) =n

sumi=y(Yi)

By the law of large numbersNnnrarr micro almost surely Fromthe 7multinomial distribution one can check that for allν such that nν is a possible score vector for the n-sample

(n + )minus∣Σ∣eminusnH(ν∣micro) le P(nminusNn = ν) le eminusnH(ν∣micro)

where H(ν∣micro) = sumyisinΣ ν(y) ln ν(y)micro(y) is the relative entropy

of ν with respect to microe large deviations theorem holds

Miodrag Lovric (ed) International Encyclopedia of Statistical Science DOI ----copy Springer-Verlag Berlin Heidelberg

L Laws of Large Numbers

for the empirical distribution of a general n-sample withspeed n and rate I(ν) = H(ν∣micro) given by the natu-ral generalization of the above formula is result dueto Sanov has many consequences in information the-ory and statistical mechanics (Dembo and Zeitouni den Hollander ) and for exponential families instatistics Applications in statistics also include point esti-mation (by giving the exponential rate of convergenceof M-estimators) and for hypothesis testing (Bahadureciency) (Kester ) and concentration inequalities(Dembo and Zeitouni )

e FreidlinndashWentzell theory deals with diusion pro-cesses with small noise

dXєt = b (Xє

t )dt +radicє σ (Xє

t )dBt Xє = y

e coecients b σ are uniformly lipshitz functions andB is a standard Brownian motion (see 7Brownian Motionand Diusions)e sequence Xє can be viewed as є as a small random perturbation of the ordinary dierentialequation

dxt = b(xt)dt x = yIndeed Xє rarr x in the supremum norm on bounded time-intervals Freidlin andWentzell have shown that on a nitetime interval [T] the sequenceXє with values in the pathspace obeys the LDP with speed єminus and rate function

I(ϕ) = int

T

σ(ϕ(t))minus( ˙ϕ(t) minus b(ϕ(t)))

dt

if ϕ is absolutely continuous with square-integrable deriva-tive and ϕ() = y I(ϕ) =infin otherwise (To t in the aboveformal denition take a sequence є = єn and for Pnthe law of Xєn )

e FreidlinndashWentzell theory has applications inphysics (metastability phenomena) and engineering (track-ing loops statistical analysis of signals stabilization of sys-tems and algorithms) (Freidlin andWentzell Demboand Zeitouni Olivieri and Vares )

About the AuthorFrancis Comets is Professor of Applied Mathematics atUniversity of Paris - Diderot (Paris ) France He is theHead of the team ldquoStochastic Modelsrdquo in the CNRS lab-oratory ldquoProbabiliteacute et modegraveles aleacuteatoiresrdquo since Heis the Deputy Director of the Foundation Sciences Matheacute-matiques de Paris since its creation in He hascoauthored research papers with a focus on randommedium and one book (ldquoCalcul stochastique et mod-egraveles de diusionsrdquo in French Dunod with ierryMeyre) He has supervised more than PhD thesis and

served as Associate Editor for Stochastic Processes and theirApplications (ndash)

Cross References7Asymptotic Relative Eciency in Testing7Cherno Bound7Entropy and Cross Entropy as Diversity and DistanceMeasures7Laws of Large Numbers7Limiteorems of Probabilityeory7Moderate Deviations7Robust Statistics

References and Further ReadingDembo A Zeitouni O () Large deviations techniques and

applications Springer New Yorkden Hollander F () Large deviations Am Math Soc Provi-

dence RIFreidlin MI Wentzell AD () Random perturbations of dynami-

cal systems Springer New YorkKester A () Some large deviation results in statistics CWI Tract

Centrum voor Wiskunde en Informatica AmsterdamOlivieri E Vares ME () Large deviations and metastability

Cambridge University Press Cambridge

Laws of Large Numbers

Andrew RosalskyProfessorUniversity of Florida Gainesville FL USA

e laws of large numbers (LLNs) provide bounds on theuctuation behavior of sums of random variables and aswe will discuss herein lie at the very foundation of sta-tistical science ey have a history going back over yearse literature on the LLNs is of epic proportions asthis concept is indispensable in probability and statisticaltheory and their applicationProbability theory like some other areas of mathemat-

ics such as geometry for example is a subject arising froman attempt to provide a rigorous mathematical model forreal world phenomena In the case of probability theorythe real world phenomena are chance behavior of biologi-cal processes or physical systems such as gambling gamesand their associated monetary gains or losses

e probability of an event is the abstract counterpartto the notion of the long-run relative frequency of the

Laws of Large Numbers L

L

occurence of the event through innitelymany replicationsof the experiment For example if a quality control engi-neer asserts that the probability is that a widget pro-duced by her production team meets specications thenshe is asserting that in the long-run of those widgetsmeet specicationse phrase ldquoin the long-runrdquo requiresthe notion of limit as the sample size approaches innitye long-run relative frequency approach for describingthe probability of an event is natural and intuitive but nev-ertheless it raises serious mathematical questions Doesthe limiting relative frequency always exist as the samplesize approaches innity and is the limit the same irrespec-tive of the sequence of experimental outcomes It is easyto see that the answers are negative Indeed in the aboveexample depending on the sequence of experimental out-comes the proportion of widgets meeting specicationscould uctuate repeatedly from near to near as thenumber of widgets sampled approaches innity So in whatsense can it be asserted that the limit exists and is To provide an answer to this question one needs to applya LLN

e LLNs are of two types viz weak LLNs (WLLNs)and strong LLNs (SLLNs) Each type involves a dierentmode of convergence In general a WLLN (resp a SLLN)involves convergence in probability (resp convergencealmost surely (as)) e denitions of these two modesof convergence will now be reviewedLet Unn ge be a sequence of random variables

dened on a probability space (ΩF P) and let c isin R Wesay thatUn converges in probability to c (denotedUn

Prarr c) if

limnrarrinfinP(∣Un minus c∣ gt ε) = for all ε gt

We say that Un converges as to c (denoted Un rarr c as) if

P(ω isin Ω limnrarrinfinUn(ω) = c) =

If Un rarr c as then UnPrarr c the converse is not true in

generale celebrated Kolmogorov SLLN (see eg Chow

and Teicher [] p ) is the following result LetXnn ge be a sequence of independent and identicallydistributed (iid) random variables and let c isin Ren

sumni= Xin

rarr c as if and only if EX = c ()

Using statistical terminology the suciency half of ()asserts that the sample mean converges as to the popula-tionmean as the sample size n approaches innity providedthe population mean exists and is nite is result is of

fundamental importance in statistical science It followsfrom () that

if EX = c isin R then sumni= Xin

Prarr c ()

this result is the KhintchineWLLN (see eg Petrov []p )Next suppose Ann ge is a sequence of independent

events all with the same probability p A special case of theKolmogorov SLLN is the limit result

pn rarr p as ()

where pn = sumni= IAin is the proportion of A An tooccur n ge (Here IAi is the indicator function ofAi i ge )is result is the rst SLLN ever proved and was discov-ered by Emile Borel in Hence with probability thesample proportion pn approaches the population propor-tion p as the sample size n rarr infin It is this SLLN whichthus provides the theoretical justication for the long-runrelative frequency approach to interpreting probabilitiesNote however that the convergence in () is not point-wise on Ω but rather is pointwise on some subset of Ωhaving probability Consequently any interpretation ofp = P(A) via () necessitates that one has a priori anintuitive understanding of the notion of an event havingprobability

e SLLN () is a key component in the proof of theGlivenkondashCantelli theorem (see7Glivenko-Cantellieo-rems) which roughly speaking asserts that with probabil-ity a population distribution function can be uniformlyapproximated by a sample (or empirical) distribution func-tion as the sample size approaches innity is result isreferred to by Reacutenyi ( p ) as the fundamental the-orem of mathematical statistics and by Loegraveve ( p )as the central statistical theoremIn Jacob Bernoulli (ndash) proved the rst

WLLN

pnPrarr p ()

Bernoullirsquos renowned book Ars Conjectandi (e Art ofConjecturing) was published posthumously in and it ishere where the proof of his WLLN was rst published It isinteresting to note that there is over a year gap betweenthe WLLN () of Bernoulli and the corresponding SLLN() of BorelAn interesting example is the following modication

of one of Stout ( p ) Suppose that the quality con-trol engineer referred to above would like to estimate theproportion p of widgets produced by her production team

L Laws of Large Numbers

that meet specications She estimates p by using the pro-portion pn of the rst n widgets produced that meet speci-cations and she is interested in knowing if there will everbe a point in the sequence of examined widgets such thatwith probability (at least) a specied large value pn will bewithin ε of p and stay within ε of p as the sampling contin-ues (where ε gt is a prescribed tolerance)e answer isarmative since () is equivalent to the assertion that fora given ε gt and δ gt there exists a positive integer Nεδsuch that

P (capinfinn=Nεδ [∣pn minus p∣ le ε]) ge minus δ

at is the probability is arbitrarily close to that pn will bearbitrarily close to p simultaneously for all n beyond somepoint If one applied instead the WLLN () then it couldonly be asserted that for a given ε gt and δ gt thereexists a positive integer Nεδ such that

P(∣pn minus p∣ le ε) ge minus δ for all n ge Nεδ

ere are numerous other versions of the LLNs and wewill discuss only a few of them Note that the expressionsin () and () can be rewritten respectively as

sumni= Xi minus ncn

rarr as and sumni= Xi minus ncn

Prarr

thereby suggesting the following denitions A sequenceof random variables Xnn ge is said to obey a generalSLLN (resp WLLN) with centering sequence ann ge and norming sequence bnn ge (where lt bn rarrinfin) if

sumni= Xi minus anbn

rarr as (resp sumni= Xi minus anbn

Prarr )

A famous result of Marcinkiewicz and Zygmund (seeeg Chow and Teicher () p ) extended the Kol-mogorov SLLN as follows Let Xnn ge be a sequence ofiid random variables and let lt p lt en

sumni= Xi minus ncnp

rarr as for some c isin R if and only

if E∣X∣p ltinfin

In such a case necessarily c = EX if p ge whereas c isarbitrary if p lt Feller () extended the MarcinkiewiczndashZygmund

SLLN to the case of a more general norming sequencebnn ge satisfying suitable growth conditions

e followingWLLN is ascribed to Feller by Chow andTeicher ( p ) If Xnn ge is a sequence of iidrandom variables then there exist real numbers ann ge such that

sumni= Xi minus ann

Prarr ()

if and only if

nP(∣X∣ gt n)rarr as nrarrinfin ()

In such a case an minus nE(XI[∣X ∣len])rarr as nrarrinfine condition () is weaker than E∣X∣ lt infin If Xn

n ge is a sequence of iid random variables where X hasprobability density function

f (x) =⎧⎪⎪⎨⎪⎪⎩

cx log ∣x∣ ∣x∣ ge e ∣x∣ lt e

where c is a constant then E∣X∣ = infin and the SLLNsumni= Xin rarr c as fails for every c isin R but () and hencethe WLLN () hold with an = n ge Klass and Teicher () extended the Feller WLLN

to the case of a more general norming sequencebnn ge thereby obtaining a WLLN analog of Fellerrsquos() extension of the MarcinkiewiczndashZygmund SLLNGood references for studying the LLNs are the books

by Reacuteveacutesz () Stout () Loegraveve () Chow andTeicher () and Petrov () While the LLNs havebeen studied extensively in the case of independent sum-mands some of the LLNs presented in these books involvesummands obeying a dependence structure other than thatof independenceA large literature of investigation on the LLNs for

sequences of Banach space valued random elements hasemerged beginning with the pioneering work of Mourier() See themonograph byTaylor () for backgroundmaterial and results up to Excellent references are thebooks by Vakhania Tarieladze and Chobanyan () andLedoux and Talagrand () More recent results are pro-vided by Adler et al () Cantrell and Rosalsky ()and the references in these two articles

About the AuthorProfessor Rosalsky is Associate Editor Journal of AppliedMathematics and Stochastic Analysis (ndashpresent) andAssociate Editor International Journal of Mathematics andMathematical Sciences (ndashpresent) He has collabo-rated with research workers from dierent countriesAt the University of Florida he twice received a TeachingImprovement Program Award and on ve occasions wasnamed an Anderson Scholar Faculty Honoree an honorreserved for facultymembers designated by undergraduatehonor students as being the most eective and inspiring

Cross References7Almost Sure Convergence of Random Variables7BorelndashCantelli Lemma and Its Generalizations

Learning Statistics in a Foreign Language L

L

7Chebyshevrsquos Inequality7Convergence of Random Variables7Ergodiceorem7Estimation An Overview7Expected Value7Foundations of Probability7Glivenko-Cantellieorems7Probabilityeory An Outline7Random Field7Statistics History of7Strong Approximations in Probability and Statistics

References and Further ReadingAdler A Rosalsky A Taylor RL () A weak law for normed

weighted sums of random elements in Rademacher type pBanach spaces J Multivariate Anal ndash

Cantrell A Rosalsky A () A strong law for compactly uni-formly integrable sequences of independent random elementsin Banach spaces Bull Inst Math Acad Sinica ndash

Chow YS Teicher H () Probability theory indepen-dence interchangeability martingales rd edn SpringerNew York

Feller W () A limit theorem for random variables with infinitemoments Am J Math ndash

Klass M Teicher H () Iterated logarithm laws for asymmet-ric random variables barely with or without finite mean AnnProbab ndash

Ledoux M Talagrand M () Probability in Banach spacesisoperimetry and processes Springer Berlin

Loegraveve M () Probability theory vol I th edn Springer New YorkMourier E () Eleacutements aleacuteatoires dans un espace de Banach

Annales de lrsquo Institut Henri Poincareacute ndashPetrov VV () Limit theorems of probability theory sequences

of independent random variables Clarendon OxfordReacutenyi A () Probability theory North-Holland AmsterdamReacuteveacutesz P () The laws of large numbers Academic New YorkStout WF () Almost sure convergence Academic New YorkTaylor RL () Stochastic convergence of weighted sums of ran-

dom elements in linear spaces Lecture notes in mathematicsvol Springer Berlin

Vakhania NN Tarieladze VI Chobanyan SA () Probabilitydistributions on Banach spaces D Reidel Dordrecht Holland

Learning Statistics in a ForeignLanguage

KhidirM AbdelbasitSultan Qaboos University Muscat Sultanate of Oman

Backgrounde Sultanate of Oman is an Arabic-speaking countrywhere the medium of instruction in pre-university edu-cation is Arabic In Sultan Qaboos University (SQU) all

sciences (including Statistics) are taught in English ereason is that most of the scientic literature is in Englishand teaching in the native language may leave graduatesat a disadvantage Since only few instructors speak Ara-bic the university adopts a policy of no communicationin Arabic in classes and oce hours Students are requiredto achieve a minimum level in English (about IELTSscore) before they start their study program Very few stu-dents achieve that level on entry and the majority spendsabout two semesters doing English only

Language and Cultural ProblemsIt is to be expected that students from a non-English-speaking background will face serious diculties whenlearning in English especially in the rst year or two Mostof the literature discusses problems faced by foreign stu-dents pursuing study programs in an English-speakingcountry or a minority in a multi-cultural society (seefor example Coutis P and Wood L Hubbard R Koh E)Such students live (at least while studying) in an English-speaking community with which they have to interact on adaily basisese diculties are more serious for our stu-dents who are studying in their own countrywhere Englishis not the ocial languageey hardly use English outsideclassrooms and avoid talking in class as much as they can

My SQU ExperienceStatistical concepts and methods are most eectivelytaught through real-life examples that the students appre-ciate and understand We use the most popular textbooksin the USA for our courses ese textbooks use thisapproach with US students in mind Our main prob-lems are

Most of the examples and exercises used are com-pletely alien to our studentse discussions meant tomaintain the studentsrsquo interest only serve to put ourso With limited English they have serious dicultiesunderstanding what is explained and hence tend not tolisten to what the instructor is sayingey do not readthe textbooks because they contain pages and pages oflengthy explanations and discussions they cannot fol-low A direct eect is that students may nd the subjectboring and quickly lose interesteir attention thenturns to the art of passing tests instead of acquiring theintended knowledge and skills To pass their tests theyuse both their class and study times looking throughexamples concentrating on what formula to use andwhere to plug the numbers they have to get the answeris way they manage to do the mechanics fairly wellbut the concepts are almost entirely missed

L Least Absolute Residuals Procedure

e problem is worse with introductory probabilitycourses where games of chance are extensively usedas illustrative examples in textbooks Most of our stu-dents have never seen a deck of playing cards and somemay even be oended by discussing card games in aclassroom

e burden of nding strategies to overcome these dicul-ties falls on the instructor Statistical terms and conceptssuch as parameterstatistic sampling distribution unbi-asedness consistency suciency and ideas underlyinghypothesis testing are not easy to get across even in the stu-dentsrsquo own language To do that in a foreign language is areal challenge For the Statistics program to be successfulall (or at least most of the) instructors involved should beup to this challengeis is a time-consuming task with lit-tle reward other than self satisfaction In SQU the problemis compounded further by the fact that most of the instruc-tors are expatriates on short-term contracts who are morelikely to use their time for personal career advancementrather than time-consuming community service jobs

What Can Be DoneFor our rst Statistics course we produced a manual thatcontains very brief notes and many samples of previousquizzes tests and examinations It contains a good col-lection of problems from local culture to motivate thestudentse manual was well received by the students tothe extent that students prefer to practice with examplesfrom the manual rather than the textbookTextbooks written in English that are brief and to the

point are neededese should include examples and exer-cises from the studentsrsquo own culture A student tryingto understand a specic point gets distracted by lengthyexplanations and discouraged by thick textbooks to beginwith In a classroom where studentsrsquo faces clearly indicatethat you have got nothing across it is natural to try explain-ing more using more examples In the end of semesterevaluation of a course I taught a student once wrote ldquoeinstructor explains things more than needed He makessimple points dicultrdquois indicates that when teachingin a foreign language lengthy oral or written explanationsare not helpful A better strategywill be to explain conceptsand techniques briey and provide plenty of examples andexercises that will help the students absorb the material byosmosis e basic statistical concepts can only be eec-tively communicated to students in their own languageFor this reason textbooks should contain a good glossarywhere technical terms and concepts are explained using thelocal language

I expect such textbooks to go a long way to enhancestudentsrsquo understanding of Statistics An internationalproject can be initiated to produce an introductory statis-tics textbook with dierent versions intended for dierentgeographical arease English material will be the samethe examples vary to some extent from area to area andglossaries in local languages Universities in the develop-ing world naturally look at western universities asmodelsand international (western) involvement in such a projectis needed for it to succeede project will be a major con-tribution to the promotion of understanding Statistics andexcellence in statistical education in developing countriese international statistical institute takes pride in sup-porting statistical progress in the developing world isproject can lay the foundation for this progress and henceis worth serious consideration by the institute

Cross References7Online Statistics Education7Promoting Fostering and Development of Statistics inDeveloping Countries7Selection of Appropriate Statistical Methods in Develop-ing Countries7Statistical Literacy Reasoning andinking7Statistics Education

References and Further ReadingCoutis P Wood L () Teaching statistics and academic language

in culturally diverse classrooms httpwwwmathuocgr~ictmProceedingspappdf

Hubbard R () Teaching statistics to students who are learningin a foreign language ICOTS

Koh E () Teaching statistics to students with limited languageskills using MINITAB httparchivesmathutkeduICTCMVOLCpaperpdf

Least Absolute ResidualsProcedureRichardWilliam FarebrotherHonorary Reader in EconometricsVictoria University of Manchester Manchester UK

For i = n let xi xi xiq yi represent the ithobservation on a set of q + variables and suppose that wewish to t a linear model of the form

yi = xiβ + xiβ +⋯ + xiqβq + єi

Least Absolute Residuals Procedure L

L

to these n observations en for p gt the Lp-normtting procedure chooses values for b b bq to min-imise the Lp-norm of the residuals [sumni= ∣ei∣p]

p where fori = n the ith residual is dened by

ei = yi minus xib minus xib minus minus xiqbq

e most familiar Lp-norm tting procedure knownas the 7least squares procedure sets p = and choosesvalues for b b bq tominimize the sum of the squaredresidualssumni= ei A second choice to be discussed in the present article

sets p = and chooses b b bq to minimize the sumof the absolute residualssumni= ∣ei∣A third choice sets p =infin and chooses b b bq to

minimize the largest absolute residualmaxni=∣ei∣Setting ui = ei and vi = if ei ge and ui =

and vi = minusei if ei lt we nd that ei = ui minus vi so thatthe least absolute residuals (LAR) tting problem choosesb b bq to minimize the sum of the absolute residuals

n

sumi=

(ui + vi)

subject to

xib + xib +⋯ + xiqbq +Ui minus vi = yi for i = n

and Ui ge vi ge for i = ne LAR tting problem thus takes the form of a linearprogramming problem and is oen solved by means of avariant of the dual simplex procedureGauss has noted (when q = ) that solutions of this

problem are characterized by the presence of a set of qzero residuals Such solutions are robust to the presence ofoutlying observations Indeed they remain constant undervariations in the other n minus q observations provided thatthese variations do not cause any of the residuals to changetheir signs

e LAR tting procedure corresponds to the maxi-mum likelihood estimator when the є-disturbances followa double exponential (Laplacian) distributionis estima-tor is more robust to the presence of outlying observationsthan is the standard least squares estimator which maxi-mizes the likelihood function when the є-disturbances arenormal (Gaussian)Nevertheless theLAR estimator has anasymptotic normal distribution as it is amember ofHuberrsquosclass ofM-estimators

ere are many variants of the basic LAR proce-dure but the one of greatest historical interest is thatproposed in by the Croatian Jesuit scientist Rugjer(or Rudjer) Josip Boškovic (ndash) (Latin RogeriusJosephusBoscovich Italian RuggieroGiuseppeBoscovich)

In his variant of the standard LAR procedure there are twoexplanatory variables of which the rst is constant xi = and the values of b and b are constrained to satisfy theadding-up condition sumni=(yi minus b minus xib) = usuallyassociated with the least squares procedure developed byGauss in and published by Legendre in Com-puter algorithms implementing this variant of the LARprocedure with q ge variables are still to be found in theliteratureFor an account of recent developments in this area

see the series of volumes edited by Dodge ( ) For a detailed history of the LAR procedureanalyzing the contributions of Boškovic Laplace GaussEdgeworth Turner Bowley and Rhodes see Farebrother() And for a discussion of the geometrical andmechanical representation of the least squares and LARtting procedures see Farebrother ()

About the AuthorRichard William Farebrother was a member of the teach-ing sta of the Department of Econometrics and SocialStatistics in the (Victoria) University of Manchester from until when he took early retirement (he hasbeen blind since ) From until he was anHonorary Reader in Econometrics in the Department ofEconomic Studies of the sameUniversity He has publishedthree books Linear Least Squares Computations in Fitting Linear Relationships A History of the Calculus ofObservations ndash in and Visualizing statisticalModels and Concepts in He has also published morethan research papers in a wide range of subject areasincluding econometric theory computer algorithms sta-tistical distributions statistical inference and the history ofstatistics Dr Farebrother was also visiting Associate Pro-fessor () at Monash University Australia and VisitingProfessor () at the Institute for Advanced Studies inVienna Austria

Cross References7Least Squares7Residuals

References and Further ReadingDodge Y (ed) () Statistical data analysis based on the L-norm

and related methods North-Holland AmsterdamDodge Y (ed) () L-statistical analysis and related methods

North-Holland AmsterdamDodge Y (ed) () L-statistical procedures and related topics

Institute of Mathematical Statistics HaywardDodge Y (ed) () Statistical data analysis based on the L-norm

and related methods Birkhaumluser Basel

L Least Squares

Farebrother RW () Fitting linear relationships a history of thecalculus of observations ndash Springer New York

Farebrother RW () Visualizing statistical models and conceptsMarcel Dekker New York

Least Squares

Czesław StepniakProfessorMaria Curie-Skłodowska University Lublin PolandUniversity of Rzeszoacutew Rzeszoacutew Poland

Least Squares (LS) problem involves some algebraic andnumerical techniques used in ldquosolvingrdquo overdeterminedsystems F(x) asymp b of equations where b isin Rn while F(x)is a column of the form

F(x) =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

f(x)

f(x)

fmminus(x)

fm(x)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

with entries fi = fi(x) i = n where x = (x xp)T e LS problem is linear when each fi is a linear functionand nonlinear ndash if notLinear LS problem refers to a system Ax = b of linear

equations Such a system is overdetermined if n gt p If b notinrange(A) the system has no proper solution and will bedenoted by Ax asymp b In this situation we are seeking for asolution of some optimization probleme name ldquoLeastSquaresrdquo is justied by the l-norm commonly used as ameasure of imprecision

e LS problem has a clear statistical interpretation inregression terms Consider the usual regression model

yi = fi(xi xip β βp) + ei for i = n ()

where xij i = n j = p are some constants givenby experimental design fi i = n are given functionsdepending on unknown parameters βj j = p whileyi i = n are values of these functions observed withsome random errors ei We want to estimate the unknownparameters βi on the basis of the data set xij yi

In linear regression each fi is a linear function of typefi = sump

j= cij(x xn)βj and the model () may bepresented in vector-matrix notation as

y = Xβ + e

where y = (y yn)T e = (e en)T and β =(β βp)T while X is a n times p matrix with entries xijIf e en are not correlated with mean zero and a com-mon (perhaps unknown) variance then the problem ofBest Linear Unbiased Estimation (BLUE) of β reduces tonding a vector β that minimizes the norm ∣∣y minus Xβ ∣∣

=

(y minus Xβ)T(y minus Xβ)Such a vector is said to be the ordinary LS solution of

the overparametrized system Xβ asymp y On the other handthe last one reduces to solving the consistent system

XTXβ = XTy

of linear equations called normal equations In particularif rank(X) = p then the system has a unique solution of theform

β = (XTX)minusXTy

For linear regression yi = α+βxi+ei with one regressorx the BLU estimators of the parameters α and β may bepresented in the convenient form as

β = nsxynsx

and α = y minus βx

where

nsxy =sumixiyi minus

(sumi xi) (sumi yi)n

nsx =sumixi minus

(sumi xi)

n x = sumi xi

nand y = sumi yi

n

For its computation we only need to use a simple pocketcalculator

Example e following table presents the number of resi-dents in thousands (x) and the unemployment rate in (y)for some cities of Poland Estimate the parameters β and α

xi

yi

In this casesumi xi = sumi yi = sumi xi = and sumi xiyi = erefore nsx = and nsxy =minusus β = minus and α = and hence f (x) =minusx +

Leacutevy Processes L

L

If the variancendashcovariance matrix of the error vector ecoincides (except a multiplicative scalar σ ) with a positivedenite matrix V then the Best Linear Unbiased estima-tion reduces to the minimization of (y minusXβ)TV(y minusXβ)called the weighed LS problem Moreover if rank(X) = pthen its solution is given in the form

β = (XTVminusX)minusXTVminusy

It is worth to add that a nonlinear LS problem is morecomplicated and its explicit solution is usually not knownInstead of this some algorithms are suggestedTotal least squares problem e problem has been

posed in recent years in numerical analysis as an alterna-tive for the LS problem in the casewhen all data are aectedby errorsConsider an overdetermined system of n linear equa-

tionsAx asymp b with k unknown xe TLS problem consistsin minimizing the Frobenius norm

∣∣[A b] minus [A b]∣∣Ffor all A isin Rntimesk and b isin range(A) where the Frobeniusnorm is dened by ∣∣(aij)∣∣F = sumij aij Once a minimizing[A b] is found then any x satisfying Ax = b is called a TLSsolution of the initial system Ax asymp b

e trouble is that the minimization problem maynot be solvable or its solution may not be unique As anexample one can set

A =⎡⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎦and b =

⎡⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎦

It is known that the TLS solution (if exists) is alwaysbetter than the ordinary LS in the sense that the cor-rection b minus Ax has smaller l-norm e main tool insolving the TLS problems is the following Singular ValueDecompositionFor any matrix A of n times k with real entries there

exist orthonormal matrices P = [p pn] and

Q = [q qk] such that

PTAQ = diag(σ σm) where σ ge ⋯ ge σm andm = minn k

About the AuthorFor biography see the entry 7Random Variable

Cross References7Absolute Penalty Estimation7Adaptive Linear Regression

7Analysis of Areal and Spatial Interaction Data7Autocorrelation in Regression7Best Linear Unbiased Estimation in Linear Models7Estimation7Gauss-Markoveorem7General Linear Models7Least Absolute Residuals Procedure7Linear Regression Models7Nonlinear Models7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Simple Linear Regression7Statistics History of7Two-Stage Least Squares

References and Further ReadingBjoumlrk A () Numerical methods for least squares problems

SIAM PhiladelphiaKariya T () Generalized least squares Wiley New YorkRao CR Toutenberg H () Linear models least squares and

alternatives Springer New YorkVan Huffel S Vandewalle J () The total least squares problem

computational aspects and analysis SIAM PhiladelphiaWolberg J () Data analysis using the method of least squares

extracting the most information from experiments SpringerNew York

Leacutevy Processes

Mohamed Abdel-HameedProfessorUnited Arab Emirates University Al Ain United ArabEmirates

IntroductionLeacutevy processes have become increasingly popular in engi-neering (reliability dams telecommunication) andmathe-matical nanceeir applications in reliability stems fromthe fact that they provide a realistic model for the degrada-tion of devices while their applications in the mathemati-cal theory of dams as they provide a basis for describingthe water input of dams eir popularity in nance isbecause they describe the nancialmarkets in amore accu-rate way than the celebrated BlackndashScholes model elatter model assumes that the rate of returns on assetsare normally distributed thus the process describing theasset price over time is continuous process In reality theasset prices have jumps or spikes and the asset returnsexhibit fat tails and 7skewness which negates the nor-mality assumption inherited in the BlackndashScholes model

L Leacutevy Processes

Because of the deciencies in the BlackndashScholes modelresearchers in mathematical nance have been trying tondmore suitable models for asset prices Certain types ofLeacutevy processes have been found to provide a good modelfor creep of concrete fatigue crack growth corroded steelgates and chloride ingress into concrete Furthermore cer-tain types of Leacutevy processes have been used to model thewater input in damsIn this entry we will review Leacutevy processes and give

important examples of such processes and state somereferences to their applications

Leacutevy ProcessesA stochastic process X = Xt t ge that has right con-tinuous sample paths with le limits is said to be a Leacutevyprocess if the following hold

X has stationary increments ie for every s t ge thedistribution of Xt+s minus Xt is independent of t

X has independent increments ie for every t sge Xt+s minus Xt is independent of ( Xuu le t)

X is stochastically continuous ie for every t ge andє gt

limsrarrtP(∣Xt minus Xs∣ gt є) = at is to say a Leacutevy process is a stochastically continu-ous process with stationary and independent incrementswhose sample paths are right continuous with le handlimitsIf Φ(z) is the characteristic function of a Leacutevy process

then its characteristic component φ(z) def= ln Φ(z)t is of the

form

iza minus zb+ int

R[exp(izx) minus minus izxI∣x∣lt]ν(dx)

where a isin R b isin R+ and ν is a measure on R satisfyingν() = intR( and x

)ν(dx) ltinfine measure ν characterizes the size and frequency of

the jumps If the measure is innite then the process hasinnitelymany jumps of very small sizes in any small inter-vale constant a dened above is called the dri term ofthe process and b is the variance (volatility) term

e LeacutevyndashIt^o decomposition identify any Leacutevy process

as the sum of three independent processes it is stated asfollowsGiven any a isin R b isin R+ and measure ν on R satisfying

ν() = intR(andx)ν(dx) ltinfin there exists a probability

space (Ω P) on which a Leacutevy process X is denede

process X is the sum of three independent processes()X

()X and

()X where

()X is a Brownianmotionwith dri a and

volatility b (in the sense dened below)()X is a compound

Poisson process and()X is a square integrable martingale

e characteristic components of()X

()X and

()X (denoted

by()φ (z)

()φ (z) and

()φ (z) respectively) are as follows

()φ (z) = iza minus zb

()φ (z)= int∣x∣ge(exp(izx) minus )ν(dx)()φ (z)= int∣x∣lt(exp(izx) minus minus izx)ν(dx)

Examples of the Leacutevy ProcessesThe Brownian MotionA Leacutevy process is said to be a Brownian motion (see7Brownian Motion and Diusions) with dri micro andvolatility rate σ if micro = a b = σ and ν (R)= Brow-nian motion is the only nondeterministic Leacutevy processeswith continuous sample paths

The Inverse Brownian ProcessLet X be a Brownian motion with micro gt and volatility rateσ For any x gt let Tx = inft Xt gt xen Tx is anincreasing Leacutevy process (called inverse Brownianmotion)its Leacutevy measure is given by

υ(dx) = radicπσ x

exp(minusxmicro

σ )

The Compound Poisson Processe compound Poisson process (see 7Poisson Processes)is a Leacutevy process where b = and ν is a nite measure

The Gamma Processe gamma process is a nonnegative increasing Leacutevy pro-cess X where b = a minus int xν(dx) = and its Leacutevymeasure is given by

ν(dx) = αxexp(minusxβ)dx x gt

where α β gt It follows that the mean term (E(X ))and the variance term (V(X )) for the process are equalto αβ and αβ respectively

e following is a simulated sample path of a gammaprocess where α = and β = (Fig )

The Variance Gamma Processe variance gamma process is a Leacutevy process that can berepresented as either the dierence between two indepen-dent gamma processes or as a Brownian process subordi-nated by a gamma processe latter is accomplished by arandom time change replacing the time of the Brownianprocess by a gamma process with a mean term equal to e variance gamma process has three parameters micro ndash the

Life Expectancy L

L

0 2 4 6 8 100

2

4

6

8

10

12

t

x

Leacutevy Processes Fig Gamma process

0 2 4 6 8 10minus08

minus06

minus04

minus02

0

02

04

06

t

x

Brownian motionVariance gamma

Leacutevy Processes Fig Brownian motion and variance gammasample paths

Brownian process dri term σ ndash the volatility of theBrownian process and ν ndash the variance term of the thegamma process

e following are two simulated sample paths one fora Brownian motion with a dri term micro = and volatilityterm σ = and the other is for a variance gamma processwith the same values for the dri term and the volatilityterms and ν = (Fig )

About the AuthorProfessor Mohamed Abdel-Hameed was the chairman ofthe Department of Statistics and Operations ResearchKuwait University (ndash) He was on the editorialboard of Applied Stochastic Models and Data Analysis(ndash) He was the Editor (jointly with E Cinlarand J Quinn) of the text Survival Models Maintenance

Replacement Policies and Accelerated Life Testing (Aca-demic Press ) He is an elected member of the Inter-national Statistical Institute He has published numerouspapers in Statistics and Operations research

Cross References7Non-Uniform Random Variate Generations7Poisson Processes7Statistical Modeling of Financial Markets7Stochastic Processes7Stochastic Processes Classication

References and Further ReadingAbdel-Hameed MS () Life distribution of devices subject to a

Leacutevy wear process Math Oper Res ndashAbdel-Hameed M () Optimal control of a dam using PMλτ

policies and penalty cost when the input process is a com-pound Poisson process with a positive drift J Appl Prob ndash

Black F Scholes MS () The pricing of options and corporateliabilities J Pol Econ ndash

Cont R Tankov P () Financial modeling with jump processesChapman amp HallCRC Press Boca Raton

Madan DB Carr PP Chang EC () The variance gamma processand option pricing Eur Finance Rev ndash

Patel A Kosko B () Stochastic resonance in continuous andspiking Neutron models with Leacutevy noise IEEE Trans NeuralNetw ndash

van Noortwijk JM () A survey of the applications of gammaprocesses in maintenance Reliab Eng Syst Safety ndash

Life Expectancy

Maja Biljan-AugustFull ProfessorUniversity of Rijeka Rijeka Croatia

Life expectancy is dened as the average number of yearsa person is expected to live from age x as determined bystatisticsStatistics on life expectancy are derived from a mathe-

matical model known as the 7life table In order to calcu-late this indicator the mortality rate at each age is assumedto be constant Life expectancy (ex) can be evaluated atany age and in a hypothetical stationary population canbe written in discrete form as

ex =Txlx

where x is age Tx is the number of person-years livedaged x and over and lx is the number of survivors at agex according to the life table

L Life Expectancy

Life expectancy can be calculated for combined sexesor separately for males and femalesere can be signi-cant dierences between sexesLife expectancy at birth (e) is the average number of

years a newborn child can expect to live if currentmortalitytrends remain constant

e =Tl

where T is the total size of the population and l is thenumber of births (the original number of persons in thebirth cohort)Life expectancy declines with age Life expectancy at

birth is highly inuenced by infant mortality rate eparadox of the life table refers to a situation where lifeexpectancy at birth increases for several years aer birth(e lt e lt e and even beyond)e paradox reects thehigher rates of infant and child mortality in populationsin pre-transition and middle stages of the demographictransitionLife expectancy at birth is a summary measure of mor-

tality in a population It is a frequently used indicatorof health standards and socio-economic living standardsLife expectancy is also one of the most commonly usedindicators of social development is indicator is easilycomparable through time and between areas includingcountries Inequalities in life expectancy usually indicateinequalities in health and socio-economic developmentLife expectancy rose throughout human history In

ancient Greece and Rome the average life expectancywas below years between the years and life expectancy at birth rose from about years to aglobal average of years and to more than years inthe richest countries (Riley ) Furthermore in mostindustrialized countries in the early twenty-rst centurylife expectancy averaged at about years (WHO)esechanges called the ldquohealth transitionrdquo are essentially theresult of improvements in public health medicine andnutritionLife expectancy varies signicantly across regions and

continents from life expectancies of about years insome central African populations to life expectancies of years and above in many European countries emore developed regions have an average life expectancy of years while the population of less developed regionsis at birth expected to live an average years less etwo continents that display the most extreme dierencesin life expectancies are North America ( years) andAfrica ( years) where as of recently the gap betweenlife expectancies amounts to years (UN ndash data)

Countries with the highest life expectancies in theworld ( years) are Australia Iceland Italy Switzerlandand Japan ( years) Japanese men and women live anaverage of and years respectively (WHO )In countries with a high rate of HIV infection prin-

cipally in Sub-Saharan Africa the average life expectancyis years and below Some of the worldrsquos lowest lifeexpectancies are in Sierra Leone ( years) Afghanistan( years) Lesotho ( years) and Zimbabwe ( years)In nearly all countries women live longer than men

e worldrsquos average life expectancy at birth is years formales and years for females the gap is about ve yearse female-to-male gap is expected to narrow in the moredeveloped regions andwiden in the less developed regionse Russian Federation has the greatest dierence in lifeexpectancies between the sexes ( years less for men)whereas in Tonga life expectancy for males exceeds thatfor females by years (WHO )Life expectancy is assumed to rise continuously

According to estimation by the UN global life expectancyat birth is likely to rise to an average years by ndashBy life expectancy is expected to vary across countriesfrom to years Long-range United Nations popula-tion projections predict that by on average peoplecan expect to live more than years from (Liberia) upto years (Japan)For more details on the calculation of life expectancy

including continuous notation see Keytz ( )and Preston et al ()

Cross References7Biopharmaceutical Research Statistics in7Biostatistics7Demography7Life Table7Mean Residual Life7Measurement of Economic Progress

References and Further ReadingKeyfitz N () Introduction to the Mathematics of Population

Addison-Wesly MassachusettsKeyfitz N () Applied mathematical demography Springer

New YorkPreston SH Heuveline P Guillot M () Demography measuring

and modelling potion processes Blackwell OxfordRiley JC () Rising life expectancy a global history Cambridge

University Press CambridgeRowland DT () Demographic methods and concepts Oxford

University Press New YorkSiegel JS Swanson DA () The methods and materials of demog-

raphy nd edn Elsevier Academic Press AmsterdamWorld Health Organization () World health statistics

Geneva

Life Table L

L

World Population Prospects The revision Highlights UnitedNations New York

World Population to () United Nations New York

Life Table

JanM HoemProfessor Director EmeritusMax Planck Institute for Demographic Research RostockGermany

e life table is a classical tabular representation of centralfeatures of the distribution function F of a positive variablesay X which normally is taken to represent the lifetime ofa newborn individuale life table was introduced wellbefore modern conventions concerning statistical distri-butions were developed and it comes with some specialterminology and notation as follows Suppose that F has a

density f (x) = ddxF(x) and dene the force of mortality or

death intensity at age x as the function

micro(x) = minus ddxln minus F(x) = f (x)

minus F(x)

Heuristically it is interpreted by the relation micro(x)dx =Px lt X lt x + dx∣X gt x Conversely F(x) =

minus expminusx

intmicro(s)ds e survivor function is dened as

ℓ(x) = ℓ() minus F(x) normally with ℓ() = Inmortality applications ℓ(x) is the expected number of sur-vivors to exact age x out of an original cohort of newborn babiese survival probability is

tpx = PX gt x + t∣X gt x = ℓ(x + t)ℓ(x)

= expminusintt

micro(x + s)ds

and the non-survival probability is (the converse) tqx = minustpx For t = one writes qx = qx and px = px In partic-ular we get ℓ(x + ) = ℓ(x)pxis is a practical recursionformula that permits us to compute all values of ℓ(x) oncewe know the values of px for all relevant x

e life expectancy is eo = EX = int infin ℓ(x)dxℓ()(e subscript in eo indicates that the expected value iscomputed at age (ie for newborn individuals) and thesuperscript o indicates that the computation is made in

the continuous mode)e remaining life expectancy atage x is

eox = E(X minus x∣X gt x) = intinfin

ℓ(x + t)dtℓ(x)

ie it is the expected lifetime remaining to someone whohas attained age xTo turn to the statistical estimation of these various

quantities suppose that the function micro(x) is piecewiseconstant which means that we take it to equal some con-stant say microj over each interval (xj xj+) for some partitionxj of the age axis For a collection Xi of independentobservations of X let Dj be the number of Xi that fall inthe interval (xj xj+) In mortality applications this is thenumber of deaths observed in the given interval For thecohort of the initially newborn Dj is the number of indi-viduals whodie in the interval (called the occurrences in theinterval) If individual i dies in the interval he or she willof course have lived for Xi minus xj time units during the inter-val Individuals who survive the interval will have lived forxj+ minus xj time units in the interval and individuals whodo not survive to age xj will not have lived during thisinterval at all When we aggregate the time units lived in(xj xj+) over all individuals we get a total Rj which iscalled the exposures for the interval the idea being thatindividuals are exposed to the risk of death for as long asthey live in the interval In the simple case where there areno relations between the individual parameters microj the col-lection DjRj constitutes a statistically sucient set ofobservations with a likelihood Λ that satises the relationln Λ = sumjminusmicrojRj+Dj ln microjwhich is easily seen to bemax-imized by microj = DjRje latter fraction is therefore themaximum-likelihood estimator for microj (In some connec-tions an age schedule of mortality will be specied such asthe classical GompertzndashMakeham function microx = a + bcxwhich does represent a relationship between the intensityvalues at the dierent ages x normally for single-year agegroupsMaximum likelihood estimators can then be foundby plugging this functional specication of the intensitiesinto the likelihood function nding the values a b andc that maximize Λ and using a + bcx for the intensity inthe rest of the life table computations Methods that do notamount tomaximum likelihood estimationwill sometimesbe used because they involve simpler computations Withsome luck they provide starting values for the iterative pro-cess that must usually be applied to produce the maximumlikelihood estimators For an example see Forseacuten ())is whole schema can be extended trivially to cover cen-soring (withdrawals) provided the censoringmechanism isunrelated to the mortality process

L Likelihood

If the force of mortality is constant over a single-yearage interval (x x + ) say and is estimated by microx in thisinterval then px = eminusmicrox is an estimator of the single-yearsurvival probability pxis allows us to estimate the sur-vival function recursively for all corresponding ages usingℓ(x + ) = ℓ(x)px for x = and the rest of thelife table computations follow suit Life table constructionconsists in the estimation of the parameters and the tab-ulation of functions like those above from empirical datae data can be for age at death for individuals as in theexample indicated above but they can also be observa-tions of duration until recovery from an illness of intervalsbetween births of time until breakdown of some piece ofmachinery or of any other positive duration variableSo far we have argued as if the life table is computed for

a group of mutually independent individuals who have allbeen observed in parallel essentially a cohort that is fol-lowed from a signicant common starting point (namelyfrom birth in our mortality example) and which is dimin-ished over time due to decrements (attrition) caused bythe risk in question and also subject to reduction due tocensoring (withdrawals)e corresponding table is thencalled a cohort life table It is more common however toestimate a px schedule from data collected for the mem-bers of a population during a limited time period and touse the mechanics of life-table construction to produce aperiod life table from the px valuesLife table techniques are described in detail in most

introductory textbooks in actuarial statistics7biostatistics7demography and epidemiology See eg Chiang ()Elandt-Johnson and Johnson () Manton and Stallard() Preston et al () For the history of thetopic consult Seal () Smith and Keytz () andDupacircquier ()

About the AuthorFor biography see the entry 7Demography

Cross References7Demographic Analysis A Stochastic Approach7Demography7Event History Analysis7Kaplan-Meier Estimator7Population Projections7Statistics An Overview

References and Further ReadingChiang CL () The life table and its applications Krieger

MalabarDupacircquier J () Lrsquoinvention de la table de mortaliteacute Presses

universitaires de France Paris

Elandt-Johnson RC Johnson NL () Survival models and dataanalysis Wiley New York

Forseacuten L () The efficiency of selected moment methods inGompertzndashMakeham graduation of mortality Scand Actuarial Jndash

Manton KG Stallard E () Recent trends in mortality analysisAcademic Press Orlando

Preston SH Heuveline P Guillot M () Demography Measuringand modeling populations Blackwell Oxford

Seal H () Studies in history of probability and statistics mul-tiple decrements of competing risks Biometrica ()ndash

Smith D Keyfitz N () Mathematical demography selectedpapers Springer Heidelberg

Likelihood

Nancy ReidProfessorUniversity of Toronto Toronto ON Canada

Introductione likelihood function in a statistical model is propor-tional to the density function for the random variable tobe observed in the model Most oen in applications oflikelihood we have a parametric model f (y θ) where theparameter θ is assumed to take values in a subset of Rkand the variable y is assumed to take values in a subset ofRn the likelihood function is dened by

L(θ) = L(θ y) = cf (y θ) ()

where c can depend on y but not on θ In more gen-eral settings where the model is semi-parametric or non-parametric the explicit denition is more dicult becausethe density needs to be dened relative to a dominatingmeasure whichmay not exist seeVan derVaart () andMurphy andVan derVaart ()is article will consideronly nite-dimensional parametric modelsWithin the context of the given parametric model the

likelihood function measures the relative plausibility ofvarious values of θ for a given observed data point y Val-ues of the likelihood function are only meaningful relativeto each other and for this reason are sometimes stan-dardized by the maximum value of the likelihood func-tion although other reference points might be of interestdepending on the contextIf ourmodel is f (y θ) = (ny)θy(minusθ)nminusy y = n

θ isin [ ] then the likelihood function is (any functionproportional to)

L(θ y) = θy( minus θ)nminusy

Likelihood L

L

and can be plotted as a function of θ for any xed valueof y e likelihood function is maximized at θ = ynis model might be appropriate for a sampling schemewhich recorded the number of successes amongn indepen-dent trials that result in success or failure each trial havingthe same probability of success θ Another example is thelikelihood function for the mean and variance parameterswhen sampling from a normal distribution with mean microand variance σ

L(θ y) = expminusn log σ minus (σ )Σ(yi minus micro)

where θ = (micro σ )is could also be plotted as a functionof micro and σ for a given sample y yn and it is not dif-cult to show that this likelihood function only dependson the sample through the sample mean y = nminusΣyi andsample variance s = (n minus )minusΣ(yi minus y) or equivalentlythrough Σyi and Σyi It is a general property of likelihoodfunctions that they depend on the data only through theminimal sucient statistic

Inferencee likelihood function was dened in a seminal paperof Fisher () and has since become the basis for mostmethods of statistical inference One version of likelihoodinference suggested by Fisher is to use some rule suchas L(θ)L(θ) gt k to dene a range of ldquolikelyrdquo or ldquoplau-siblerdquo values of θ Many authors including Royall ()and Edwards () have promoted the use of plots ofthe likelihood function along with interval estimates ofplausible valuesis approach is somewhat limited how-ever as it requires that θ have dimension or possibly or that a likelihood function can be constructed that onlydepends on a component of θ that is of interest see sectionldquo7Nuisance Parametersrdquo belowIn general we would wish to calibrate our inference

for θ by referring to the probabilistic properties of theinferential method One way to do this is to introduce aprobability measure on the unknown parameter θ typi-cally called a prior distribution and use Bayesrsquo rule forconditional probabilities to conclude

π(θ ∣ y) = L(θ y)π(θ)intθL(θ y)π(θ)dθ

where π(θ) is the density for the prior measure and π(θ ∣y) provides a probabilistic assessment of θ aer observingY = y in the model f (y θ) We could then make con-clusions of the form ldquohaving observed successes in trials and assuming π(θ) = the posterior probabilitythat θ gt is rdquo and so on

is is a very brief description of Bayesian inference inwhich probability statements refer to that generated from

the prior through the likelihood to the posterior A majordiculty with this approach is the choice of prior prob-ability function In some applications there may be anaccumulation of previous data that can be incorporatedinto a probability distribution but in general there is notand some rather ad hoc choices are oen made Anotherdiculty is themeaning to be attached to probability state-ments about the parameterInference based on the likelihood function can also be

calibrated with reference to the probability model f (y θ)by examining the distribution ofL(θY) as a random func-tion or more usually by examining the distribution ofvarious derived quantitiesis is the basis for likelihoodinference from a frequentist point of view In particularit can be shown that logL(θY)L(θY) where θ =θ(Y) is the value of θ at which L(θY) is maximized isapproximately distributed as a χk random variable wherek is the dimension of θ To make this precise requires anasymptotic theory for likelihood which is based on a cen-tral limit theorem (see 7Central Limiteorems) for thescore function

U(θY) = partpartθlogL(θY)

If Y = (Y Yn) has independent components thenU(θ) is a sum of n independent components which undermild regularity conditions will be asymptotically normalTo obtain the χ result quoted above it is also necessary toinvestigate the convergence of θ to the true value govern-ing the model f (y θ) Showing this convergence usuallyin probability but sometimes almost surely can be di-cult see Scholz () for a summary of some of the issuesthat ariseAssuming that θ is consistent for θ and that L(θY)

has sucient regularity the follow asymptotic results canbe established

(θ minus θ)T i(θ)(θ minus θ) drarr χk ()

U(θ)T iminus(θ)U(θ) drarr χk ()

ℓ(θ) minus ℓ(θ) drarr χk ()

where i(θ) = Eminusℓprimeprime(θY) θ is the expected Fisher infor-mation function ℓ(θ) = logL(θ) is the log-likelihoodfunction and χk is the 7chi-square distribution with kdegrees of freedom

ese results are all versions of a more general resultthat the log-likelihood function converges to the quadraticform corresponding to a multivariate normal distribution(see 7Multivariate Normal Distributions) under suitablystated limiting conditions ere is a similar asymptoticresult showing that the posterior density is asymptotically

L Likelihood

normal and in fact asymptotically free of the prior distri-bution although this result requires that the prior distribu-tion be a proper probability density ie has integral overthe parameter space equal to

Nuisance ParametersIn models where the dimension of θ is large plotting thelikelihood function is not possible and inference based onthe multivariate normal distribution for θ or the χk dis-tribution of the log-likelihood ratio doesnrsquot lead easily tointerval estimates for components of θ However it is pos-sible to use the likelihood function to construct inferencefor parameters of interest using variousmethods that havebeen proposed to eliminate nuisance parametersSuppose in the model f (y θ) that θ = (ψ λ) where ψ

is a k-dimensional parameter of interest (which will oenbe )e prole log-likelihood function of ψ is

ℓP(ψ) = ℓ(ψ λψ)

where λψ is the constrainedmaximum likelihood estimateit maximizes the likelihood function L(ψ λ) when ψ isheld xede prole log-likelihood function is also calledthe concentrated log-likelihood function especially ineconometrics If the parameter of interest is not expressedexplicitly as a subvector of θ then the constrained maxi-mum likelihood estimate is found using Lagrange multi-pliersIt can be veried under suitable smoothness conditions

that results similar to those at ( ndash ) hold as well for theprole log-likelihood function in particular

ℓP(ψ) minus ℓP(ψ) = ℓ(ψ λ) minus ℓ(ψ λψ)drarr χk

is method of eliminating nuisance parameters is notcompletely satisfactory especially when there are manynuisance parameters in particular it doesnrsquot allow forerrors in estimation of λ For example the prole likeli-hood approach to estimation of σ in the linear regressionmodel (see 7Linear Regression Models) y sim N(Xβ σ )will lead to the estimator σ = Σ(yi minus yi)n whereas theestimator usually preferred has divisor nminusp where p is thedimension of β

us a large literature has developed on improvementsto the prole log-likelihood For Bayesian inference suchimprovements are ldquoautomaticallyrdquo included in the formu-lation of the marginal posterior density for ψ

πM(ψ ∣ y)prop int π(ψ λ ∣ y)dλ

but it is typically quite dicult to specify priors for possiblyhigh-dimensional nuisance parameters For non-Bayesian

inference most modications to the prole log-likelihoodare derived by considering conditional or marginal infer-ence in models that admit factorizations at least approxi-mately like the following

f (y θ) = f(yψ)f(y ∣ y λ) orf (y θ) = f(y ∣ yψ)f(y λ)

A discussion of conditional inference and density factori-sations is given in Reid ()is literature is closely tiedto that on higher order asymptotic theory for likelihoode latter theory builds on saddlepoint and Laplace expan-sions to derive more accurate versions of (ndash) see forexample Severini () and Brazzale et al () edirect likelihood approach of Royall () and others doesnot generalize very well to the nuisance parameter settingalthough Royall and Tsou () present some results inthis direction

Extensions to Likelihoode likelihood function is such an important aspect ofinference based on models that it has been extended toldquolikelihood-likerdquo functions formore complex data settingsExamples include nonparametric and semi-parametriclikelihoods the most famous semi-parametric likelihoodis the proportional hazards model of Cox () Butmany other extensions have been suggested to empiri-cal likelihood (Owen ) which is a type of nonpara-metric likelihood supported on the observed sample toquasi-likelihood (McCullagh ) which starts from thescore function U(θ) and works backwards to an infer-ence function to bootstrap likelihood (Davison et al )and many modications of prole likelihood (Barndor-Nielsen andCox Fraser )ere is recent interestfor multi-dimensional responses Yi in composite likeli-hoods which are products of lower dimensional condi-tional or marginal distributions (Varin ) Reid ()concluded a review article on likelihood by stating

7 From either a Bayesian or frequentist perspective the like-lihood function plays an essential role in inference Themaximum likelihood estimator once regarded on an equalfooting among competing point estimators is now typi-cally the estimator of choice although some refinementis needed in problems with large numbers of nuisanceparameters The likelihood ratio statistic is the basis formost tests of hypotheses and interval estimates The emer-gence of the centrality of the likelihood function for infer-ence partly due to the large increase in computing poweris one of the central developments in the theory of statisticsduring the latter half of the twentieth century

Likelihood L

L

Further Readinge book by Cox and Hinkley () gives a detailedaccount of likelihood inference and principles of statis-tical inference see also Cox () ere are severalbook-length treatments of likelihood inference includingEdwards () Azzalini () Pawitan () and Sev-erini () this last discusses higher order asymptotictheory in detail as does Barndor-Nielsen andCox ()and Brazzale Davison and Reid () A short reviewpaper is Reid () An excellent overview of consis-tency results for maximum likelihood estimators is Scholz() see also Lehmann andCasella () Foundationalissues surrounding likelihood inference are discussed inBerger and Wolpert ()

About the AuthorProfessor Reid is a Past President of the Statistical Societyof Canada (ndash) During (ndash) she servedas the President of the Institute of Mathematical Statis-tics Among many awards she received the Emanuel andCarol Parzen Prize for Statistical Innovation () ldquoforleadership in statistical science for outstanding researchin theoretical statistics and highly accurate inference fromthe likelihood function and for inuential contributionsto statistical methods in biology environmental sciencehigh energy physics and complex social surveysrdquo She wasawarded the Gold Medal Statistical Society of Canada() and FlorenceNightingaleDavidAward Committeeof Presidents of Statistical Societies () She is AssociateEditor of Statistical Science (ndash) Bernoulli (ndash) andMetrika (ndash)

Cross References7Bayesian Analysis or Evidence Based Statistics7Bayesian Statistics7Bayesian Versus Frequentist Statistical Reasoning7Bayesian vs Classical Point Estimation A ComparativeOverview7Chi-Square Test Analysis of Contingency Tables7Empirical Likelihood Approach to Inference fromSample Survey Data7Estimation7Fiducial Inference7General Linear Models7Generalized Linear Models7Generalized Quasi-Likelihood (GQL) Inferences7Mixture Models7Philosophical Foundations of Statistics

7Risk Analysis7Statistical Evidence7Statistical Inference7Statistical Inference An Overview7Statistics An Overview7Statistics Nelderrsquos view7Testing Variance Components in Mixed LinearModels7Uniform Distribution in Statistics

References and Further ReadingAzzalini A () Statistical inference Chapman and Hall LondonBarndorff-Nielsen OE Cox DR () Inference and asymptotics

Chapman and Hall LondonBerger JO Wolpert R () The likelihood principle Institute of

Mathematical Statistics HaywardBirnbaum A () On the foundations of statistical inference Am

Stat Assoc ndashBrazzale AR Davison AC Reid N () Applied asymptotics

Cambridge University Press CambridgeCox DR () Regression models and life tables J R Stat Soc B

ndash (with discussion)Cox DR () Principles of statistical inference Cambridge Uni-

versity Press CambridgeCox DR Hinkley DV () Theoretical statistics Chapman and

Hall LondonDavison AC Hinkley DV Worton B () Bootstrap likelihoods

Biometrika ndashEdwards AF () Likelihood Oxford University Press OxfordFisher RA () On the mathematical foundations of theoretical

statistics Phil Trans R Soc A ndashLehmann EL Casella G () Theory of point estimation nd edn

Springer New YorkMcCullagh P () Quasi-likelihood functions Ann Stat ndashMurphy SA Van der Vaart A () Semiparametric likelihood ratio

inference Ann Stat ndashOwen A () Empirical likelihood ratio confidence intervals for a

single functional Biometrika ndashPawitan Y () In all likelihood Oxford University Press OxfordReid N () The roles of conditioning in inference Stat Sci

ndashReid N () Likelihood J Am Stat Assoc ndashRoyall RM () Statistical evidence a likelihood paradigm

Chapman and Hall LondonRoyall RM Tsou TS () Interpreting statistical evidence using

imperfect models robust adjusted likelihoodfunctions J R StatSoc B

Scholz F () Maximum likelihood estimation In Ency-clopedia of statistical sciences Wiley New York doiesspub Accessed Aug

Severini TA () Likelihood methods in statistics Oxford Univer-sity Press Oxford

Van der Vaart AW () Infinite-dimensional likelihood meth-ods in statistics httpwwwstieltjesorgarchiefbiennialframenodehtml Accessed Aug

Varin C () On composite marginal likelihood Adv Stat Analndash

L Limit Theorems of Probability Theory

Limit Theorems of ProbabilityTheory

Alexandr Alekseevich BorovkovProfessor Head of Probability and Statistics Chair at theNovosibirsk UniversityNovosibirsk University Novosibirsk Russia

Limit eorems of Probability eory is a broad namereferring to the most essential and extensive research areain Probability eory which at the same time has thegreatest impact on the numerous applications of the latterBy its very nature Probability eory is concerned

with asymptotic (limiting) laws that emerge in a long seriesof observations on random events Because of this in theearly twentieth century even the very denition of prob-ability of an event was given by a group of specialists(R von Mises and some others) as the limit of the rela-tive frequency of the occurrence of this event in a longrow of independent random experiments e ldquostabilityrdquoof this frequency (ie that such a limit always exists) waspostulated Aer the s Kolmogorovrsquos axiomatic con-struction of probability theory has prevailed One of themain assertions in this axiomatic theory is the Law of LargeNumbers (LLN) on the convergence of the averages of largenumbers of random variables to their expectation islaw implies the aforementioned stability of the relative fre-quencies and their convergence to the probability of thecorresponding event

e LLN is the simplest limit theorem (LT) of probabil-ity theory elucidating the physical meaning of probabilitye LLN is stated as follows if XXX is a sequenceof iid random variables

Sn =n

sumj=Xj

and the expectation a = EX exists then nminusSnasETHrarr a

(almost surely ie with probability )us the value nacan be called the rst order approximation for the sums Sne Central Limiteorem (CLT) gives one a more preciseapproximation for Sn It says that if σ = E(X minus a) ltinfinthen the distribution of the standardized sum ζn = (Sn minusna)σ

radicn converges as n rarr infin to the standard normal

(Gaussian) lawat is for all x

P(ζn lt x)rarr Φ(x) = radicπ

x

intminusinfin

eminustdt

e quantity nEξ + ζσradicn where ζ is a standard normal

random variable (so that P(ζ lt x) = Φ(x)) can be calledthe second order approximation for Sn

e rst LLN (for the Bernoulli scheme) was provedby Jacob Bernoulli in the late s (published posthu-mously in ) e rst CLT (also for the Bernoullischeme) was established by A de Moivre (rst publishedin and referred nowadays to as the deMoivrendashLaplacetheorem) In the beginning of the nineteenth centuryPS Laplace and CF Gauss contributed to the generaliza-tion of these assertions and appreciation of their enormousapplied importance (in particular for the theory of errorsof observations) while later in that century further break-throughs in both methodology and applicability rangeof the CLT were achieved by PL Chebyshev () andAM Lyapunov ()

e main directions in which the two aforementionedmain LTs have been extended and rened since then are

Relaxing the assumption EX lt infin When the sec-ond moment is innite one needs to assume that theldquotailrdquo P(x) = P(X gt x) + P(X lt minusx) is a func-tion regularly varying at innity such that the limitlimxrarrinfinP(X gt x)P(x) existsen the distribution of thenormalized sum Snσ(n) where σ(n) = Pminus(nminus)Pminus being the generalized inverse of the function P andwe assume that Eξ = when the expectation is niteconverges to one of the so-called stable laws as nrarrinfine7characteristic functions of these laws have simpleclosed-form representations

Relaxing the assumption that the Xjrsquos are identicallydistributed and proceeding to study the so-called tri-angular array scheme where the distributions of thesummands Xj = Xjn forming the sum Sn depend notonly on j but on n as well In this case the class of alllimit laws for the distribution of Sn (under suitable nor-malization) is substantially wider it coincides with theclass of the so-called innitely divisible distributionsAn important special case here is the Poisson limit the-orem on convergence in distribution of the number ofoccurrences of rare events to a Poisson law

Relaxing the assumption of independence of the XjrsquosSeveral types of ldquoweak dependencerdquo assumptions onXjunder which the LLN and CLT still hold true havebeen suggested and investigatedOne should alsomen-tion here the so-called ergodic theorems (see7Ergodiceorem) for a wide class of random sequences andprocesses

Renement of the main LTs and derivation of asymp-totic expansions For instance in the CLT bounds ofthe rate of convergence P(ζn lt x) minus Φ(x) rarr and

Limit Theorems of Probability Theory L

L

asymptotic expansions for this dierence (in the pow-ers of nminus in the case of iid summands) have beenobtained under broad assumptions

Studying large deviation probabilities for the sums Sn(theorems on rare events) If x rarr infin together with nthen the CLT can only assert that P(ζn gt x) rarr eorems on large deviation probabilities aim to nd afunction P(xn) such that

P(ζn gt x)P(xn) rarr as nrarrinfin x rarrinfin

e nature of the function P(xn) essentially dependson the rate of decay of P(X gt x) as x rarrinfin and on theldquodeviation zonerdquo ie on the asymptotic behavior of theratio xn as nrarrinfin

Considering observations X Xn of a more com-plex nature ndash rst of all multivariate random vectorsIf Xj isin Rd then the role of the limit law in the CLTwill be played by a d-dimensional normal (Gaussian)distribution with the covariance matrix E(X minus EX)(X minus EX)T

e variety of application areas of the LLN and CLTis enormous us Mathematical Statistics is based onthese LTs Let Xlowastn = (X Xn) be a sample froma distribution F and Flowastn (u) the corresponding empiricaldistribution functione fundamental GlivenkondashCantellitheorem (see 7Glivenko-Cantelli eorems) stating thatsupu ∣F

lowastn (u)minus F(u)∣

asETHrarr as nrarrinfin is of the same natureas the LLN and basically means that the unknown distri-bution F can be estimated arbitrary well from the randomsample Xlowastn of a large enough size n

e existence of consistent estimators for the unknownparameters a = Eξ and σ = E(X minus a) also follows fromthe LLN since as nrarrinfin

alowast = n

n

sumj=Xj

asETHrarr a (σ )lowast = n

n

sumj=

(Xj minus alowast)

= n

n

sumj=Xj minus (alowast) asETHrarr σ

Under additional moment assumptions on the distri-bution F one can also construct asymptotic condenceintervals for the parameters a and σ as the distributionsof the quantities

radicn(alowast minus a) and

radicn((σ )lowast minus σ ) con-

verge as n rarr infin to the normal onese same can alsobe said about other parameters that are ldquosmoothrdquo enoughfunctionals of the unknown distribution F

e theorem on the 7asymptotic normality andasymptotic eciency of maximum likelihood estimatorsis another classical example of LTsrsquo applications in mathe-matical statistics (see eg Borovkov ) Furthermore in

estimation theory and hypotheses testing one also needstheorems on large deviation probabilities for the respec-tive statistics as it is statistical procedures with small errorprobabilities that are oen required in applicationsIt is worth noting that summation of random variables

is by no means the only situation in which LTs appear inProbabilityeoryGenerally speaking the main objective of Probabil-

ityeory in applications is nding appropriate stochasticmodels for objects under study and then determining thedistributions or parameters one is interested in As a rulethe explicit form of these distributions and parameters isnot known LTs can be used to nd suitable approximationsto the characteristics in questionAt least two possible approaches to this problem should

be noted here

Suppose that the unknown distribution Fθ depends ona parameter θ such that as θ approaches some ldquocriticalrdquovalue θ the distributions Fθ become ldquodegeneraterdquo inone sense or anotheren in a number of cases onecan nd an approximation for Fθ which is valid for thevalues of θ that are close to θ For instance in actuarialstudies7queueing theory and some other applicationsone of the main problems is concerned with the dis-tribution of S = sup

kge(Sk minus θk) under the assumption

that EX = If θ gt then S is a proper random vari-able If however θ rarr then S asETHrarr infin Here we dealwith the so-called ldquotransient phenomenardquo It turns outthat if σ = Var (X) ltinfin then there exists the limit

limθdarrP(θS gt x) = eminusxσ x gt

is (KingmanndashProkhorov) LT enables one to ndapproximations for the distribution of S in situationswhere θ is small

Sometimes one can estimate the ldquotailsrdquo of the unknowndistributions ie their asymptotic behavior at innityis is of importance in those applications where oneneeds to evaluate the probabilities of rare events If theequation Eemicro(Xminusθ) = has a solution micro gt then inthe above example one has

P(S gt x) sim ceminusmicrox x rarrinfin

where c is a known constant If the distribution F of Xis subexponential (in this case Eemicro(Xminusθ) = infin for anymicro gt ) then

P(S gt x) sim θ int

infin

x( minus F(t))dt x rarrinfin

L Linear Mixed Models

isLTenablesone tondapproximations forP(S gt x)for large xFor both approaches the obtained approximations

can be rened

An important part of Probabilityeory is concernedwith LTs for random processes eir main objective isto nd conditions under which random processes con-verge in some sense to some limit process An extensionof the CLT to that context is the so-called FunctionalCLT (aka the DonskerndashProkhorov invariance princi-ple) which states that as n rarr infin the processesζn(t) = (Slfloorntrfloor minus ant)σ

radicntisin[] converge in distribu-

tion to the standard Wiener process w(t)tisin[]e LTs(including large deviation theorems) for a broad class offunctionals of the sequence (7random walk) S Sncan also be classied as LTs for 7stochastic processese same can be said about Law of iterated logarithmwhich states that for an arbitrary ε gt the random walkSkinfink= crosses the boundary (minusε)σ

radick ln ln k innitely

many times but crosses the boundary ( + ε)σradick ln ln k

nitely many times with probability Similar results holdtrue for trajectories of Wiener processes w(t)tisin[] andw(t)tisin[infin)In mathematical statistics a closely related to func-

tional CLT result says that the so-called ldquoempirical processrdquoradicn(Flowastn (u) minus F(u))uisin(minusinfininfin) converges in distribution

to w(F(u))uisin(minusinfininfin) where w(t) = w(t) minus tw() isthe Brownian bridge processis LT implies7asymptoticnormality of a great many estimators that can be repre-sented as smooth functionals of the empirical distributionfunction Flowastn (u)

ere are many other areas in Probability eoryand its applications where various LTs appear and areextensively used For instance convergence theoremsfor 7martingales asymptotics of extinction probabilityof a branching processes and conditional (under non-extinction condition) LTs on a number of particles etc

About the AuthorProfessor Borovkov is Head of Probability and Statis-tics Department of the Sobolev Institute of MathematicsNovosibirsk (RussianAcademy of Sciences) since Heis Head of Probability and Statistics Chair at the Novosi-birsk University since He is Concelour of RussianAcademy of Sciences and fullmember of RussianAcademyof Sciences (Academician) () He was awarded theState Prize of the USSR () the Markov Prize of theRussian Academy of Sciences () and GovernmentPrize in Education () Professor Borovkov is Editor-in-Chief of the journal ldquoSiberianAdvances inMathematicsrdquo

and Associated Editor of journals ldquoeory of Probabil-ity and its Applicationsrdquo ldquoSiberian Mathematical JournalrdquoldquoMathematical Methods of Statisticsrdquo ldquoElectronic Journal ofProbabilityrdquo

Cross References7Almost Sure Convergence of Random Variables7Approximations to Distributions7Asymptotic Normality7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Central Limiteorems7Empirical Processes7Ergodiceorem7Glivenko-Cantellieorems7Large Deviations and Applications7Laws of Large Numbers7Martingale Central Limiteorem7Probabilityeory An Outline7RandomMatrixeory7Strong Approximations in Probability and Statistics7Weak Convergence of Probability Measures

References and Further ReadingBillingsley P () Convergence of probability measures Wiley

New YorkBorovkov AA () Mathematical statistics Gordon amp Breach

AmsterdamGnedenko BV Kolmogorov AN () Limit distributions for sums

of independent random variables Addison-Wesley CambridgeLeacutevy P () Theacuteorie de lrsquoAddition des Variables Aleacuteatories

Gauthier-Villars ParisLoegraveve M (ndash) Probability theory vols I and II th edn

Springer New YorkPetrov VV () Limit theorems of probability theory sequences of

independent random variables ClarendonOxford UniversityPress New York

Linear Mixed Models

GeertMolenberghsProfessorUniversiteit Hasselt amp Katholieke Universiteit LeuvenLeuven Belgium

In observational studies repeated measurements may betaken at almost arbitrary time points resulting in anextremely large number of time points at which only one

Linear Mixed Models L

L

or only a few measurements have been taken Many of theparametric covariance models described so far may thencontain toomany parameters tomake them useful in prac-tice while othermore parsimoniousmodelsmay be basedon assumptions which are too simplistic to be realisticA general and very exible class of parametric models forcontinuous longitudinal data is formulated as follows

yi∣bi sim N(Xiβ + ZibiΣi) ()bi sim N(D) ()

where Xi and Zi are (ni times p) and (ni times q) dimensionalmatrices of known covariates β is a p-dimensional vec-tor of regression parameters called the xed eects D isa general (q times q) covariance matrix and Σi is a (ni times ni)covariance matrix which depends on i only through itsdimension ni ie the set of unknown parameters in Σi willnot depend upon i Finally bi is a vector of subject-specicor random eects

e above model can be interpreted as a linear regres-sion model (see 7Linear Regression Models) for the vec-tor yi of repeated measurements for each unit separatelywhere some of the regression parameters are specic (ran-dom eects bi) while others are not (xed eects β)edistributional assumptions in () with respect to the ran-dom eects can be motivated as follows First E(bi) = implies that the mean of yi still equals Xiβ such that thexed eects in the random-eects model () can also beinterpreted marginally Not only do they reect the eectof changing covariates within specic units they alsomea-sure the marginal eect in the population of changing thesame covariates Second the normality assumption imme-diately implies that marginally yi also follows a normaldistribution with mean vector Xiβ and with covariancematrix Vi = ZiDZprimei + ΣiNote that the random eects in () implicitly imply the

marginal covariance matrix Vi of yi to be of the very spe-cic form Vi = ZiDZprimei + Σi Let us consider two examplesunder the assumption of conditional independence ieassuming Σi = σ Ini First consider the case where therandom eects are univariate and represent unit-specicintercepts is corresponds to covariates Zi which areni-dimensional vectors containing only ones

e marginal model implied by expressions () and() is

yi sim N(XiβVi) Vi = ZiDZprimei + Σi

which can be viewed as a multivariate linear regressionmodel with a very particular parameterization of thecovariance matrix Vi

With respect to the estimation of unit-specic param-eters bi the posterior distribution of bi given the observeddata yi can be shown to be (multivariate) normal withmean vector equal to DZprimeiV

minusi (α)(yi minus Xiβ) Replacing β

and α by their maximum likelihood estimates we obtainthe so-called empirical Bayes estimates bi for the bi A keyproperty of these EB estimates is shrinkage which is bestillustrated by considering the prediction yi equiv Xi β +Zibi ofthe ith prole It can easily be shown that

yi = ΣiVminusi Xi β + (Ini minus ΣiVminusi ) yi

which can be interpreted as a weighted average of thepopulation-averaged prole Xi β and the observed data yiwith weights ΣiVminusi and Ini minusΣiVminusi respectively Note thatthe ldquonumeratorrdquo of ΣiVminusi represents within-unit variabil-ity and the ldquodenominatorrdquo is the overall covariance matrixVi Hence much weight will be given to the overall averageprole if the within-unit variability is large in comparisonto the between-unit variability (modeled by the randomeects) whereasmuchweight will be given to the observeddata if the opposite is trueis phenomenon is referred toas shrinkage toward the average proleXi β An immediateconsequence of shrinkage is that the EB estimates show lessvariability than actually present in the random-eects dis-tribution ie for any linear combination λ of the randomeects

var(λprimebi)le var(λprimebi) = λprimeDλ

About the AuthorGeert Molenberghs is Professor of Biostatistics at Uni-versiteit Hasselt and Katholieke Universiteit Leuven inBelgium He was Joint Editor of Applied Statistics (ndash) and Co-Editor of Biometrics (ndash) Cur-rently he is Co-Editor of Biostatistics (ndash) He wasPresident of the International Biometric Society (ndash) received the Guy Medal in Bronze from the RoyalStatistical Society and the Myrto Lefkopoulou award fromthe Harvard School of Public Health Geert Molenberghsis founding director of the Center for Statistics He isalso the director of the Interuniversity Institute for Bio-statistics and statistical Bioinformatics (I-BioStat) Jointlywith Geert Verbeke Mike Kenward Tomasz BurzykowskiMarc Buyse and Marc Aerts he authored books on lon-gitudinal and incomplete data and on surrogate markerevaluation GeertMolenberghs received several Excellencein Continuing Education Awards of the American Statisti-cal Association for courses at Joint Statistical Meetings

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 2: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

L Laws of Large Numbers

for the empirical distribution of a general n-sample withspeed n and rate I(ν) = H(ν∣micro) given by the natu-ral generalization of the above formula is result dueto Sanov has many consequences in information the-ory and statistical mechanics (Dembo and Zeitouni den Hollander ) and for exponential families instatistics Applications in statistics also include point esti-mation (by giving the exponential rate of convergenceof M-estimators) and for hypothesis testing (Bahadureciency) (Kester ) and concentration inequalities(Dembo and Zeitouni )

e FreidlinndashWentzell theory deals with diusion pro-cesses with small noise

dXєt = b (Xє

t )dt +radicє σ (Xє

t )dBt Xє = y

e coecients b σ are uniformly lipshitz functions andB is a standard Brownian motion (see 7Brownian Motionand Diusions)e sequence Xє can be viewed as є as a small random perturbation of the ordinary dierentialequation

dxt = b(xt)dt x = yIndeed Xє rarr x in the supremum norm on bounded time-intervals Freidlin andWentzell have shown that on a nitetime interval [T] the sequenceXє with values in the pathspace obeys the LDP with speed єminus and rate function

I(ϕ) = int

T

σ(ϕ(t))minus( ˙ϕ(t) minus b(ϕ(t)))

dt

if ϕ is absolutely continuous with square-integrable deriva-tive and ϕ() = y I(ϕ) =infin otherwise (To t in the aboveformal denition take a sequence є = єn and for Pnthe law of Xєn )

e FreidlinndashWentzell theory has applications inphysics (metastability phenomena) and engineering (track-ing loops statistical analysis of signals stabilization of sys-tems and algorithms) (Freidlin andWentzell Demboand Zeitouni Olivieri and Vares )

About the AuthorFrancis Comets is Professor of Applied Mathematics atUniversity of Paris - Diderot (Paris ) France He is theHead of the team ldquoStochastic Modelsrdquo in the CNRS lab-oratory ldquoProbabiliteacute et modegraveles aleacuteatoiresrdquo since Heis the Deputy Director of the Foundation Sciences Matheacute-matiques de Paris since its creation in He hascoauthored research papers with a focus on randommedium and one book (ldquoCalcul stochastique et mod-egraveles de diusionsrdquo in French Dunod with ierryMeyre) He has supervised more than PhD thesis and

served as Associate Editor for Stochastic Processes and theirApplications (ndash)

Cross References7Asymptotic Relative Eciency in Testing7Cherno Bound7Entropy and Cross Entropy as Diversity and DistanceMeasures7Laws of Large Numbers7Limiteorems of Probabilityeory7Moderate Deviations7Robust Statistics

References and Further ReadingDembo A Zeitouni O () Large deviations techniques and

applications Springer New Yorkden Hollander F () Large deviations Am Math Soc Provi-

dence RIFreidlin MI Wentzell AD () Random perturbations of dynami-

cal systems Springer New YorkKester A () Some large deviation results in statistics CWI Tract

Centrum voor Wiskunde en Informatica AmsterdamOlivieri E Vares ME () Large deviations and metastability

Cambridge University Press Cambridge

Laws of Large Numbers

Andrew RosalskyProfessorUniversity of Florida Gainesville FL USA

e laws of large numbers (LLNs) provide bounds on theuctuation behavior of sums of random variables and aswe will discuss herein lie at the very foundation of sta-tistical science ey have a history going back over yearse literature on the LLNs is of epic proportions asthis concept is indispensable in probability and statisticaltheory and their applicationProbability theory like some other areas of mathemat-

ics such as geometry for example is a subject arising froman attempt to provide a rigorous mathematical model forreal world phenomena In the case of probability theorythe real world phenomena are chance behavior of biologi-cal processes or physical systems such as gambling gamesand their associated monetary gains or losses

e probability of an event is the abstract counterpartto the notion of the long-run relative frequency of the

Laws of Large Numbers L

L

occurence of the event through innitelymany replicationsof the experiment For example if a quality control engi-neer asserts that the probability is that a widget pro-duced by her production team meets specications thenshe is asserting that in the long-run of those widgetsmeet specicationse phrase ldquoin the long-runrdquo requiresthe notion of limit as the sample size approaches innitye long-run relative frequency approach for describingthe probability of an event is natural and intuitive but nev-ertheless it raises serious mathematical questions Doesthe limiting relative frequency always exist as the samplesize approaches innity and is the limit the same irrespec-tive of the sequence of experimental outcomes It is easyto see that the answers are negative Indeed in the aboveexample depending on the sequence of experimental out-comes the proportion of widgets meeting specicationscould uctuate repeatedly from near to near as thenumber of widgets sampled approaches innity So in whatsense can it be asserted that the limit exists and is To provide an answer to this question one needs to applya LLN

e LLNs are of two types viz weak LLNs (WLLNs)and strong LLNs (SLLNs) Each type involves a dierentmode of convergence In general a WLLN (resp a SLLN)involves convergence in probability (resp convergencealmost surely (as)) e denitions of these two modesof convergence will now be reviewedLet Unn ge be a sequence of random variables

dened on a probability space (ΩF P) and let c isin R Wesay thatUn converges in probability to c (denotedUn

Prarr c) if

limnrarrinfinP(∣Un minus c∣ gt ε) = for all ε gt

We say that Un converges as to c (denoted Un rarr c as) if

P(ω isin Ω limnrarrinfinUn(ω) = c) =

If Un rarr c as then UnPrarr c the converse is not true in

generale celebrated Kolmogorov SLLN (see eg Chow

and Teicher [] p ) is the following result LetXnn ge be a sequence of independent and identicallydistributed (iid) random variables and let c isin Ren

sumni= Xin

rarr c as if and only if EX = c ()

Using statistical terminology the suciency half of ()asserts that the sample mean converges as to the popula-tionmean as the sample size n approaches innity providedthe population mean exists and is nite is result is of

fundamental importance in statistical science It followsfrom () that

if EX = c isin R then sumni= Xin

Prarr c ()

this result is the KhintchineWLLN (see eg Petrov []p )Next suppose Ann ge is a sequence of independent

events all with the same probability p A special case of theKolmogorov SLLN is the limit result

pn rarr p as ()

where pn = sumni= IAin is the proportion of A An tooccur n ge (Here IAi is the indicator function ofAi i ge )is result is the rst SLLN ever proved and was discov-ered by Emile Borel in Hence with probability thesample proportion pn approaches the population propor-tion p as the sample size n rarr infin It is this SLLN whichthus provides the theoretical justication for the long-runrelative frequency approach to interpreting probabilitiesNote however that the convergence in () is not point-wise on Ω but rather is pointwise on some subset of Ωhaving probability Consequently any interpretation ofp = P(A) via () necessitates that one has a priori anintuitive understanding of the notion of an event havingprobability

e SLLN () is a key component in the proof of theGlivenkondashCantelli theorem (see7Glivenko-Cantellieo-rems) which roughly speaking asserts that with probabil-ity a population distribution function can be uniformlyapproximated by a sample (or empirical) distribution func-tion as the sample size approaches innity is result isreferred to by Reacutenyi ( p ) as the fundamental the-orem of mathematical statistics and by Loegraveve ( p )as the central statistical theoremIn Jacob Bernoulli (ndash) proved the rst

WLLN

pnPrarr p ()

Bernoullirsquos renowned book Ars Conjectandi (e Art ofConjecturing) was published posthumously in and it ishere where the proof of his WLLN was rst published It isinteresting to note that there is over a year gap betweenthe WLLN () of Bernoulli and the corresponding SLLN() of BorelAn interesting example is the following modication

of one of Stout ( p ) Suppose that the quality con-trol engineer referred to above would like to estimate theproportion p of widgets produced by her production team

L Laws of Large Numbers

that meet specications She estimates p by using the pro-portion pn of the rst n widgets produced that meet speci-cations and she is interested in knowing if there will everbe a point in the sequence of examined widgets such thatwith probability (at least) a specied large value pn will bewithin ε of p and stay within ε of p as the sampling contin-ues (where ε gt is a prescribed tolerance)e answer isarmative since () is equivalent to the assertion that fora given ε gt and δ gt there exists a positive integer Nεδsuch that

P (capinfinn=Nεδ [∣pn minus p∣ le ε]) ge minus δ

at is the probability is arbitrarily close to that pn will bearbitrarily close to p simultaneously for all n beyond somepoint If one applied instead the WLLN () then it couldonly be asserted that for a given ε gt and δ gt thereexists a positive integer Nεδ such that

P(∣pn minus p∣ le ε) ge minus δ for all n ge Nεδ

ere are numerous other versions of the LLNs and wewill discuss only a few of them Note that the expressionsin () and () can be rewritten respectively as

sumni= Xi minus ncn

rarr as and sumni= Xi minus ncn

Prarr

thereby suggesting the following denitions A sequenceof random variables Xnn ge is said to obey a generalSLLN (resp WLLN) with centering sequence ann ge and norming sequence bnn ge (where lt bn rarrinfin) if

sumni= Xi minus anbn

rarr as (resp sumni= Xi minus anbn

Prarr )

A famous result of Marcinkiewicz and Zygmund (seeeg Chow and Teicher () p ) extended the Kol-mogorov SLLN as follows Let Xnn ge be a sequence ofiid random variables and let lt p lt en

sumni= Xi minus ncnp

rarr as for some c isin R if and only

if E∣X∣p ltinfin

In such a case necessarily c = EX if p ge whereas c isarbitrary if p lt Feller () extended the MarcinkiewiczndashZygmund

SLLN to the case of a more general norming sequencebnn ge satisfying suitable growth conditions

e followingWLLN is ascribed to Feller by Chow andTeicher ( p ) If Xnn ge is a sequence of iidrandom variables then there exist real numbers ann ge such that

sumni= Xi minus ann

Prarr ()

if and only if

nP(∣X∣ gt n)rarr as nrarrinfin ()

In such a case an minus nE(XI[∣X ∣len])rarr as nrarrinfine condition () is weaker than E∣X∣ lt infin If Xn

n ge is a sequence of iid random variables where X hasprobability density function

f (x) =⎧⎪⎪⎨⎪⎪⎩

cx log ∣x∣ ∣x∣ ge e ∣x∣ lt e

where c is a constant then E∣X∣ = infin and the SLLNsumni= Xin rarr c as fails for every c isin R but () and hencethe WLLN () hold with an = n ge Klass and Teicher () extended the Feller WLLN

to the case of a more general norming sequencebnn ge thereby obtaining a WLLN analog of Fellerrsquos() extension of the MarcinkiewiczndashZygmund SLLNGood references for studying the LLNs are the books

by Reacuteveacutesz () Stout () Loegraveve () Chow andTeicher () and Petrov () While the LLNs havebeen studied extensively in the case of independent sum-mands some of the LLNs presented in these books involvesummands obeying a dependence structure other than thatof independenceA large literature of investigation on the LLNs for

sequences of Banach space valued random elements hasemerged beginning with the pioneering work of Mourier() See themonograph byTaylor () for backgroundmaterial and results up to Excellent references are thebooks by Vakhania Tarieladze and Chobanyan () andLedoux and Talagrand () More recent results are pro-vided by Adler et al () Cantrell and Rosalsky ()and the references in these two articles

About the AuthorProfessor Rosalsky is Associate Editor Journal of AppliedMathematics and Stochastic Analysis (ndashpresent) andAssociate Editor International Journal of Mathematics andMathematical Sciences (ndashpresent) He has collabo-rated with research workers from dierent countriesAt the University of Florida he twice received a TeachingImprovement Program Award and on ve occasions wasnamed an Anderson Scholar Faculty Honoree an honorreserved for facultymembers designated by undergraduatehonor students as being the most eective and inspiring

Cross References7Almost Sure Convergence of Random Variables7BorelndashCantelli Lemma and Its Generalizations

Learning Statistics in a Foreign Language L

L

7Chebyshevrsquos Inequality7Convergence of Random Variables7Ergodiceorem7Estimation An Overview7Expected Value7Foundations of Probability7Glivenko-Cantellieorems7Probabilityeory An Outline7Random Field7Statistics History of7Strong Approximations in Probability and Statistics

References and Further ReadingAdler A Rosalsky A Taylor RL () A weak law for normed

weighted sums of random elements in Rademacher type pBanach spaces J Multivariate Anal ndash

Cantrell A Rosalsky A () A strong law for compactly uni-formly integrable sequences of independent random elementsin Banach spaces Bull Inst Math Acad Sinica ndash

Chow YS Teicher H () Probability theory indepen-dence interchangeability martingales rd edn SpringerNew York

Feller W () A limit theorem for random variables with infinitemoments Am J Math ndash

Klass M Teicher H () Iterated logarithm laws for asymmet-ric random variables barely with or without finite mean AnnProbab ndash

Ledoux M Talagrand M () Probability in Banach spacesisoperimetry and processes Springer Berlin

Loegraveve M () Probability theory vol I th edn Springer New YorkMourier E () Eleacutements aleacuteatoires dans un espace de Banach

Annales de lrsquo Institut Henri Poincareacute ndashPetrov VV () Limit theorems of probability theory sequences

of independent random variables Clarendon OxfordReacutenyi A () Probability theory North-Holland AmsterdamReacuteveacutesz P () The laws of large numbers Academic New YorkStout WF () Almost sure convergence Academic New YorkTaylor RL () Stochastic convergence of weighted sums of ran-

dom elements in linear spaces Lecture notes in mathematicsvol Springer Berlin

Vakhania NN Tarieladze VI Chobanyan SA () Probabilitydistributions on Banach spaces D Reidel Dordrecht Holland

Learning Statistics in a ForeignLanguage

KhidirM AbdelbasitSultan Qaboos University Muscat Sultanate of Oman

Backgrounde Sultanate of Oman is an Arabic-speaking countrywhere the medium of instruction in pre-university edu-cation is Arabic In Sultan Qaboos University (SQU) all

sciences (including Statistics) are taught in English ereason is that most of the scientic literature is in Englishand teaching in the native language may leave graduatesat a disadvantage Since only few instructors speak Ara-bic the university adopts a policy of no communicationin Arabic in classes and oce hours Students are requiredto achieve a minimum level in English (about IELTSscore) before they start their study program Very few stu-dents achieve that level on entry and the majority spendsabout two semesters doing English only

Language and Cultural ProblemsIt is to be expected that students from a non-English-speaking background will face serious diculties whenlearning in English especially in the rst year or two Mostof the literature discusses problems faced by foreign stu-dents pursuing study programs in an English-speakingcountry or a minority in a multi-cultural society (seefor example Coutis P and Wood L Hubbard R Koh E)Such students live (at least while studying) in an English-speaking community with which they have to interact on adaily basisese diculties are more serious for our stu-dents who are studying in their own countrywhere Englishis not the ocial languageey hardly use English outsideclassrooms and avoid talking in class as much as they can

My SQU ExperienceStatistical concepts and methods are most eectivelytaught through real-life examples that the students appre-ciate and understand We use the most popular textbooksin the USA for our courses ese textbooks use thisapproach with US students in mind Our main prob-lems are

Most of the examples and exercises used are com-pletely alien to our studentse discussions meant tomaintain the studentsrsquo interest only serve to put ourso With limited English they have serious dicultiesunderstanding what is explained and hence tend not tolisten to what the instructor is sayingey do not readthe textbooks because they contain pages and pages oflengthy explanations and discussions they cannot fol-low A direct eect is that students may nd the subjectboring and quickly lose interesteir attention thenturns to the art of passing tests instead of acquiring theintended knowledge and skills To pass their tests theyuse both their class and study times looking throughexamples concentrating on what formula to use andwhere to plug the numbers they have to get the answeris way they manage to do the mechanics fairly wellbut the concepts are almost entirely missed

L Least Absolute Residuals Procedure

e problem is worse with introductory probabilitycourses where games of chance are extensively usedas illustrative examples in textbooks Most of our stu-dents have never seen a deck of playing cards and somemay even be oended by discussing card games in aclassroom

e burden of nding strategies to overcome these dicul-ties falls on the instructor Statistical terms and conceptssuch as parameterstatistic sampling distribution unbi-asedness consistency suciency and ideas underlyinghypothesis testing are not easy to get across even in the stu-dentsrsquo own language To do that in a foreign language is areal challenge For the Statistics program to be successfulall (or at least most of the) instructors involved should beup to this challengeis is a time-consuming task with lit-tle reward other than self satisfaction In SQU the problemis compounded further by the fact that most of the instruc-tors are expatriates on short-term contracts who are morelikely to use their time for personal career advancementrather than time-consuming community service jobs

What Can Be DoneFor our rst Statistics course we produced a manual thatcontains very brief notes and many samples of previousquizzes tests and examinations It contains a good col-lection of problems from local culture to motivate thestudentse manual was well received by the students tothe extent that students prefer to practice with examplesfrom the manual rather than the textbookTextbooks written in English that are brief and to the

point are neededese should include examples and exer-cises from the studentsrsquo own culture A student tryingto understand a specic point gets distracted by lengthyexplanations and discouraged by thick textbooks to beginwith In a classroom where studentsrsquo faces clearly indicatethat you have got nothing across it is natural to try explain-ing more using more examples In the end of semesterevaluation of a course I taught a student once wrote ldquoeinstructor explains things more than needed He makessimple points dicultrdquois indicates that when teachingin a foreign language lengthy oral or written explanationsare not helpful A better strategywill be to explain conceptsand techniques briey and provide plenty of examples andexercises that will help the students absorb the material byosmosis e basic statistical concepts can only be eec-tively communicated to students in their own languageFor this reason textbooks should contain a good glossarywhere technical terms and concepts are explained using thelocal language

I expect such textbooks to go a long way to enhancestudentsrsquo understanding of Statistics An internationalproject can be initiated to produce an introductory statis-tics textbook with dierent versions intended for dierentgeographical arease English material will be the samethe examples vary to some extent from area to area andglossaries in local languages Universities in the develop-ing world naturally look at western universities asmodelsand international (western) involvement in such a projectis needed for it to succeede project will be a major con-tribution to the promotion of understanding Statistics andexcellence in statistical education in developing countriese international statistical institute takes pride in sup-porting statistical progress in the developing world isproject can lay the foundation for this progress and henceis worth serious consideration by the institute

Cross References7Online Statistics Education7Promoting Fostering and Development of Statistics inDeveloping Countries7Selection of Appropriate Statistical Methods in Develop-ing Countries7Statistical Literacy Reasoning andinking7Statistics Education

References and Further ReadingCoutis P Wood L () Teaching statistics and academic language

in culturally diverse classrooms httpwwwmathuocgr~ictmProceedingspappdf

Hubbard R () Teaching statistics to students who are learningin a foreign language ICOTS

Koh E () Teaching statistics to students with limited languageskills using MINITAB httparchivesmathutkeduICTCMVOLCpaperpdf

Least Absolute ResidualsProcedureRichardWilliam FarebrotherHonorary Reader in EconometricsVictoria University of Manchester Manchester UK

For i = n let xi xi xiq yi represent the ithobservation on a set of q + variables and suppose that wewish to t a linear model of the form

yi = xiβ + xiβ +⋯ + xiqβq + єi

Least Absolute Residuals Procedure L

L

to these n observations en for p gt the Lp-normtting procedure chooses values for b b bq to min-imise the Lp-norm of the residuals [sumni= ∣ei∣p]

p where fori = n the ith residual is dened by

ei = yi minus xib minus xib minus minus xiqbq

e most familiar Lp-norm tting procedure knownas the 7least squares procedure sets p = and choosesvalues for b b bq tominimize the sum of the squaredresidualssumni= ei A second choice to be discussed in the present article

sets p = and chooses b b bq to minimize the sumof the absolute residualssumni= ∣ei∣A third choice sets p =infin and chooses b b bq to

minimize the largest absolute residualmaxni=∣ei∣Setting ui = ei and vi = if ei ge and ui =

and vi = minusei if ei lt we nd that ei = ui minus vi so thatthe least absolute residuals (LAR) tting problem choosesb b bq to minimize the sum of the absolute residuals

n

sumi=

(ui + vi)

subject to

xib + xib +⋯ + xiqbq +Ui minus vi = yi for i = n

and Ui ge vi ge for i = ne LAR tting problem thus takes the form of a linearprogramming problem and is oen solved by means of avariant of the dual simplex procedureGauss has noted (when q = ) that solutions of this

problem are characterized by the presence of a set of qzero residuals Such solutions are robust to the presence ofoutlying observations Indeed they remain constant undervariations in the other n minus q observations provided thatthese variations do not cause any of the residuals to changetheir signs

e LAR tting procedure corresponds to the maxi-mum likelihood estimator when the є-disturbances followa double exponential (Laplacian) distributionis estima-tor is more robust to the presence of outlying observationsthan is the standard least squares estimator which maxi-mizes the likelihood function when the є-disturbances arenormal (Gaussian)Nevertheless theLAR estimator has anasymptotic normal distribution as it is amember ofHuberrsquosclass ofM-estimators

ere are many variants of the basic LAR proce-dure but the one of greatest historical interest is thatproposed in by the Croatian Jesuit scientist Rugjer(or Rudjer) Josip Boškovic (ndash) (Latin RogeriusJosephusBoscovich Italian RuggieroGiuseppeBoscovich)

In his variant of the standard LAR procedure there are twoexplanatory variables of which the rst is constant xi = and the values of b and b are constrained to satisfy theadding-up condition sumni=(yi minus b minus xib) = usuallyassociated with the least squares procedure developed byGauss in and published by Legendre in Com-puter algorithms implementing this variant of the LARprocedure with q ge variables are still to be found in theliteratureFor an account of recent developments in this area

see the series of volumes edited by Dodge ( ) For a detailed history of the LAR procedureanalyzing the contributions of Boškovic Laplace GaussEdgeworth Turner Bowley and Rhodes see Farebrother() And for a discussion of the geometrical andmechanical representation of the least squares and LARtting procedures see Farebrother ()

About the AuthorRichard William Farebrother was a member of the teach-ing sta of the Department of Econometrics and SocialStatistics in the (Victoria) University of Manchester from until when he took early retirement (he hasbeen blind since ) From until he was anHonorary Reader in Econometrics in the Department ofEconomic Studies of the sameUniversity He has publishedthree books Linear Least Squares Computations in Fitting Linear Relationships A History of the Calculus ofObservations ndash in and Visualizing statisticalModels and Concepts in He has also published morethan research papers in a wide range of subject areasincluding econometric theory computer algorithms sta-tistical distributions statistical inference and the history ofstatistics Dr Farebrother was also visiting Associate Pro-fessor () at Monash University Australia and VisitingProfessor () at the Institute for Advanced Studies inVienna Austria

Cross References7Least Squares7Residuals

References and Further ReadingDodge Y (ed) () Statistical data analysis based on the L-norm

and related methods North-Holland AmsterdamDodge Y (ed) () L-statistical analysis and related methods

North-Holland AmsterdamDodge Y (ed) () L-statistical procedures and related topics

Institute of Mathematical Statistics HaywardDodge Y (ed) () Statistical data analysis based on the L-norm

and related methods Birkhaumluser Basel

L Least Squares

Farebrother RW () Fitting linear relationships a history of thecalculus of observations ndash Springer New York

Farebrother RW () Visualizing statistical models and conceptsMarcel Dekker New York

Least Squares

Czesław StepniakProfessorMaria Curie-Skłodowska University Lublin PolandUniversity of Rzeszoacutew Rzeszoacutew Poland

Least Squares (LS) problem involves some algebraic andnumerical techniques used in ldquosolvingrdquo overdeterminedsystems F(x) asymp b of equations where b isin Rn while F(x)is a column of the form

F(x) =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

f(x)

f(x)

fmminus(x)

fm(x)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

with entries fi = fi(x) i = n where x = (x xp)T e LS problem is linear when each fi is a linear functionand nonlinear ndash if notLinear LS problem refers to a system Ax = b of linear

equations Such a system is overdetermined if n gt p If b notinrange(A) the system has no proper solution and will bedenoted by Ax asymp b In this situation we are seeking for asolution of some optimization probleme name ldquoLeastSquaresrdquo is justied by the l-norm commonly used as ameasure of imprecision

e LS problem has a clear statistical interpretation inregression terms Consider the usual regression model

yi = fi(xi xip β βp) + ei for i = n ()

where xij i = n j = p are some constants givenby experimental design fi i = n are given functionsdepending on unknown parameters βj j = p whileyi i = n are values of these functions observed withsome random errors ei We want to estimate the unknownparameters βi on the basis of the data set xij yi

In linear regression each fi is a linear function of typefi = sump

j= cij(x xn)βj and the model () may bepresented in vector-matrix notation as

y = Xβ + e

where y = (y yn)T e = (e en)T and β =(β βp)T while X is a n times p matrix with entries xijIf e en are not correlated with mean zero and a com-mon (perhaps unknown) variance then the problem ofBest Linear Unbiased Estimation (BLUE) of β reduces tonding a vector β that minimizes the norm ∣∣y minus Xβ ∣∣

=

(y minus Xβ)T(y minus Xβ)Such a vector is said to be the ordinary LS solution of

the overparametrized system Xβ asymp y On the other handthe last one reduces to solving the consistent system

XTXβ = XTy

of linear equations called normal equations In particularif rank(X) = p then the system has a unique solution of theform

β = (XTX)minusXTy

For linear regression yi = α+βxi+ei with one regressorx the BLU estimators of the parameters α and β may bepresented in the convenient form as

β = nsxynsx

and α = y minus βx

where

nsxy =sumixiyi minus

(sumi xi) (sumi yi)n

nsx =sumixi minus

(sumi xi)

n x = sumi xi

nand y = sumi yi

n

For its computation we only need to use a simple pocketcalculator

Example e following table presents the number of resi-dents in thousands (x) and the unemployment rate in (y)for some cities of Poland Estimate the parameters β and α

xi

yi

In this casesumi xi = sumi yi = sumi xi = and sumi xiyi = erefore nsx = and nsxy =minusus β = minus and α = and hence f (x) =minusx +

Leacutevy Processes L

L

If the variancendashcovariance matrix of the error vector ecoincides (except a multiplicative scalar σ ) with a positivedenite matrix V then the Best Linear Unbiased estima-tion reduces to the minimization of (y minusXβ)TV(y minusXβ)called the weighed LS problem Moreover if rank(X) = pthen its solution is given in the form

β = (XTVminusX)minusXTVminusy

It is worth to add that a nonlinear LS problem is morecomplicated and its explicit solution is usually not knownInstead of this some algorithms are suggestedTotal least squares problem e problem has been

posed in recent years in numerical analysis as an alterna-tive for the LS problem in the casewhen all data are aectedby errorsConsider an overdetermined system of n linear equa-

tionsAx asymp b with k unknown xe TLS problem consistsin minimizing the Frobenius norm

∣∣[A b] minus [A b]∣∣Ffor all A isin Rntimesk and b isin range(A) where the Frobeniusnorm is dened by ∣∣(aij)∣∣F = sumij aij Once a minimizing[A b] is found then any x satisfying Ax = b is called a TLSsolution of the initial system Ax asymp b

e trouble is that the minimization problem maynot be solvable or its solution may not be unique As anexample one can set

A =⎡⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎦and b =

⎡⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎦

It is known that the TLS solution (if exists) is alwaysbetter than the ordinary LS in the sense that the cor-rection b minus Ax has smaller l-norm e main tool insolving the TLS problems is the following Singular ValueDecompositionFor any matrix A of n times k with real entries there

exist orthonormal matrices P = [p pn] and

Q = [q qk] such that

PTAQ = diag(σ σm) where σ ge ⋯ ge σm andm = minn k

About the AuthorFor biography see the entry 7Random Variable

Cross References7Absolute Penalty Estimation7Adaptive Linear Regression

7Analysis of Areal and Spatial Interaction Data7Autocorrelation in Regression7Best Linear Unbiased Estimation in Linear Models7Estimation7Gauss-Markoveorem7General Linear Models7Least Absolute Residuals Procedure7Linear Regression Models7Nonlinear Models7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Simple Linear Regression7Statistics History of7Two-Stage Least Squares

References and Further ReadingBjoumlrk A () Numerical methods for least squares problems

SIAM PhiladelphiaKariya T () Generalized least squares Wiley New YorkRao CR Toutenberg H () Linear models least squares and

alternatives Springer New YorkVan Huffel S Vandewalle J () The total least squares problem

computational aspects and analysis SIAM PhiladelphiaWolberg J () Data analysis using the method of least squares

extracting the most information from experiments SpringerNew York

Leacutevy Processes

Mohamed Abdel-HameedProfessorUnited Arab Emirates University Al Ain United ArabEmirates

IntroductionLeacutevy processes have become increasingly popular in engi-neering (reliability dams telecommunication) andmathe-matical nanceeir applications in reliability stems fromthe fact that they provide a realistic model for the degrada-tion of devices while their applications in the mathemati-cal theory of dams as they provide a basis for describingthe water input of dams eir popularity in nance isbecause they describe the nancialmarkets in amore accu-rate way than the celebrated BlackndashScholes model elatter model assumes that the rate of returns on assetsare normally distributed thus the process describing theasset price over time is continuous process In reality theasset prices have jumps or spikes and the asset returnsexhibit fat tails and 7skewness which negates the nor-mality assumption inherited in the BlackndashScholes model

L Leacutevy Processes

Because of the deciencies in the BlackndashScholes modelresearchers in mathematical nance have been trying tondmore suitable models for asset prices Certain types ofLeacutevy processes have been found to provide a good modelfor creep of concrete fatigue crack growth corroded steelgates and chloride ingress into concrete Furthermore cer-tain types of Leacutevy processes have been used to model thewater input in damsIn this entry we will review Leacutevy processes and give

important examples of such processes and state somereferences to their applications

Leacutevy ProcessesA stochastic process X = Xt t ge that has right con-tinuous sample paths with le limits is said to be a Leacutevyprocess if the following hold

X has stationary increments ie for every s t ge thedistribution of Xt+s minus Xt is independent of t

X has independent increments ie for every t sge Xt+s minus Xt is independent of ( Xuu le t)

X is stochastically continuous ie for every t ge andє gt

limsrarrtP(∣Xt minus Xs∣ gt є) = at is to say a Leacutevy process is a stochastically continu-ous process with stationary and independent incrementswhose sample paths are right continuous with le handlimitsIf Φ(z) is the characteristic function of a Leacutevy process

then its characteristic component φ(z) def= ln Φ(z)t is of the

form

iza minus zb+ int

R[exp(izx) minus minus izxI∣x∣lt]ν(dx)

where a isin R b isin R+ and ν is a measure on R satisfyingν() = intR( and x

)ν(dx) ltinfine measure ν characterizes the size and frequency of

the jumps If the measure is innite then the process hasinnitelymany jumps of very small sizes in any small inter-vale constant a dened above is called the dri term ofthe process and b is the variance (volatility) term

e LeacutevyndashIt^o decomposition identify any Leacutevy process

as the sum of three independent processes it is stated asfollowsGiven any a isin R b isin R+ and measure ν on R satisfying

ν() = intR(andx)ν(dx) ltinfin there exists a probability

space (Ω P) on which a Leacutevy process X is denede

process X is the sum of three independent processes()X

()X and

()X where

()X is a Brownianmotionwith dri a and

volatility b (in the sense dened below)()X is a compound

Poisson process and()X is a square integrable martingale

e characteristic components of()X

()X and

()X (denoted

by()φ (z)

()φ (z) and

()φ (z) respectively) are as follows

()φ (z) = iza minus zb

()φ (z)= int∣x∣ge(exp(izx) minus )ν(dx)()φ (z)= int∣x∣lt(exp(izx) minus minus izx)ν(dx)

Examples of the Leacutevy ProcessesThe Brownian MotionA Leacutevy process is said to be a Brownian motion (see7Brownian Motion and Diusions) with dri micro andvolatility rate σ if micro = a b = σ and ν (R)= Brow-nian motion is the only nondeterministic Leacutevy processeswith continuous sample paths

The Inverse Brownian ProcessLet X be a Brownian motion with micro gt and volatility rateσ For any x gt let Tx = inft Xt gt xen Tx is anincreasing Leacutevy process (called inverse Brownianmotion)its Leacutevy measure is given by

υ(dx) = radicπσ x

exp(minusxmicro

σ )

The Compound Poisson Processe compound Poisson process (see 7Poisson Processes)is a Leacutevy process where b = and ν is a nite measure

The Gamma Processe gamma process is a nonnegative increasing Leacutevy pro-cess X where b = a minus int xν(dx) = and its Leacutevymeasure is given by

ν(dx) = αxexp(minusxβ)dx x gt

where α β gt It follows that the mean term (E(X ))and the variance term (V(X )) for the process are equalto αβ and αβ respectively

e following is a simulated sample path of a gammaprocess where α = and β = (Fig )

The Variance Gamma Processe variance gamma process is a Leacutevy process that can berepresented as either the dierence between two indepen-dent gamma processes or as a Brownian process subordi-nated by a gamma processe latter is accomplished by arandom time change replacing the time of the Brownianprocess by a gamma process with a mean term equal to e variance gamma process has three parameters micro ndash the

Life Expectancy L

L

0 2 4 6 8 100

2

4

6

8

10

12

t

x

Leacutevy Processes Fig Gamma process

0 2 4 6 8 10minus08

minus06

minus04

minus02

0

02

04

06

t

x

Brownian motionVariance gamma

Leacutevy Processes Fig Brownian motion and variance gammasample paths

Brownian process dri term σ ndash the volatility of theBrownian process and ν ndash the variance term of the thegamma process

e following are two simulated sample paths one fora Brownian motion with a dri term micro = and volatilityterm σ = and the other is for a variance gamma processwith the same values for the dri term and the volatilityterms and ν = (Fig )

About the AuthorProfessor Mohamed Abdel-Hameed was the chairman ofthe Department of Statistics and Operations ResearchKuwait University (ndash) He was on the editorialboard of Applied Stochastic Models and Data Analysis(ndash) He was the Editor (jointly with E Cinlarand J Quinn) of the text Survival Models Maintenance

Replacement Policies and Accelerated Life Testing (Aca-demic Press ) He is an elected member of the Inter-national Statistical Institute He has published numerouspapers in Statistics and Operations research

Cross References7Non-Uniform Random Variate Generations7Poisson Processes7Statistical Modeling of Financial Markets7Stochastic Processes7Stochastic Processes Classication

References and Further ReadingAbdel-Hameed MS () Life distribution of devices subject to a

Leacutevy wear process Math Oper Res ndashAbdel-Hameed M () Optimal control of a dam using PMλτ

policies and penalty cost when the input process is a com-pound Poisson process with a positive drift J Appl Prob ndash

Black F Scholes MS () The pricing of options and corporateliabilities J Pol Econ ndash

Cont R Tankov P () Financial modeling with jump processesChapman amp HallCRC Press Boca Raton

Madan DB Carr PP Chang EC () The variance gamma processand option pricing Eur Finance Rev ndash

Patel A Kosko B () Stochastic resonance in continuous andspiking Neutron models with Leacutevy noise IEEE Trans NeuralNetw ndash

van Noortwijk JM () A survey of the applications of gammaprocesses in maintenance Reliab Eng Syst Safety ndash

Life Expectancy

Maja Biljan-AugustFull ProfessorUniversity of Rijeka Rijeka Croatia

Life expectancy is dened as the average number of yearsa person is expected to live from age x as determined bystatisticsStatistics on life expectancy are derived from a mathe-

matical model known as the 7life table In order to calcu-late this indicator the mortality rate at each age is assumedto be constant Life expectancy (ex) can be evaluated atany age and in a hypothetical stationary population canbe written in discrete form as

ex =Txlx

where x is age Tx is the number of person-years livedaged x and over and lx is the number of survivors at agex according to the life table

L Life Expectancy

Life expectancy can be calculated for combined sexesor separately for males and femalesere can be signi-cant dierences between sexesLife expectancy at birth (e) is the average number of

years a newborn child can expect to live if currentmortalitytrends remain constant

e =Tl

where T is the total size of the population and l is thenumber of births (the original number of persons in thebirth cohort)Life expectancy declines with age Life expectancy at

birth is highly inuenced by infant mortality rate eparadox of the life table refers to a situation where lifeexpectancy at birth increases for several years aer birth(e lt e lt e and even beyond)e paradox reects thehigher rates of infant and child mortality in populationsin pre-transition and middle stages of the demographictransitionLife expectancy at birth is a summary measure of mor-

tality in a population It is a frequently used indicatorof health standards and socio-economic living standardsLife expectancy is also one of the most commonly usedindicators of social development is indicator is easilycomparable through time and between areas includingcountries Inequalities in life expectancy usually indicateinequalities in health and socio-economic developmentLife expectancy rose throughout human history In

ancient Greece and Rome the average life expectancywas below years between the years and life expectancy at birth rose from about years to aglobal average of years and to more than years inthe richest countries (Riley ) Furthermore in mostindustrialized countries in the early twenty-rst centurylife expectancy averaged at about years (WHO)esechanges called the ldquohealth transitionrdquo are essentially theresult of improvements in public health medicine andnutritionLife expectancy varies signicantly across regions and

continents from life expectancies of about years insome central African populations to life expectancies of years and above in many European countries emore developed regions have an average life expectancy of years while the population of less developed regionsis at birth expected to live an average years less etwo continents that display the most extreme dierencesin life expectancies are North America ( years) andAfrica ( years) where as of recently the gap betweenlife expectancies amounts to years (UN ndash data)

Countries with the highest life expectancies in theworld ( years) are Australia Iceland Italy Switzerlandand Japan ( years) Japanese men and women live anaverage of and years respectively (WHO )In countries with a high rate of HIV infection prin-

cipally in Sub-Saharan Africa the average life expectancyis years and below Some of the worldrsquos lowest lifeexpectancies are in Sierra Leone ( years) Afghanistan( years) Lesotho ( years) and Zimbabwe ( years)In nearly all countries women live longer than men

e worldrsquos average life expectancy at birth is years formales and years for females the gap is about ve yearse female-to-male gap is expected to narrow in the moredeveloped regions andwiden in the less developed regionse Russian Federation has the greatest dierence in lifeexpectancies between the sexes ( years less for men)whereas in Tonga life expectancy for males exceeds thatfor females by years (WHO )Life expectancy is assumed to rise continuously

According to estimation by the UN global life expectancyat birth is likely to rise to an average years by ndashBy life expectancy is expected to vary across countriesfrom to years Long-range United Nations popula-tion projections predict that by on average peoplecan expect to live more than years from (Liberia) upto years (Japan)For more details on the calculation of life expectancy

including continuous notation see Keytz ( )and Preston et al ()

Cross References7Biopharmaceutical Research Statistics in7Biostatistics7Demography7Life Table7Mean Residual Life7Measurement of Economic Progress

References and Further ReadingKeyfitz N () Introduction to the Mathematics of Population

Addison-Wesly MassachusettsKeyfitz N () Applied mathematical demography Springer

New YorkPreston SH Heuveline P Guillot M () Demography measuring

and modelling potion processes Blackwell OxfordRiley JC () Rising life expectancy a global history Cambridge

University Press CambridgeRowland DT () Demographic methods and concepts Oxford

University Press New YorkSiegel JS Swanson DA () The methods and materials of demog-

raphy nd edn Elsevier Academic Press AmsterdamWorld Health Organization () World health statistics

Geneva

Life Table L

L

World Population Prospects The revision Highlights UnitedNations New York

World Population to () United Nations New York

Life Table

JanM HoemProfessor Director EmeritusMax Planck Institute for Demographic Research RostockGermany

e life table is a classical tabular representation of centralfeatures of the distribution function F of a positive variablesay X which normally is taken to represent the lifetime ofa newborn individuale life table was introduced wellbefore modern conventions concerning statistical distri-butions were developed and it comes with some specialterminology and notation as follows Suppose that F has a

density f (x) = ddxF(x) and dene the force of mortality or

death intensity at age x as the function

micro(x) = minus ddxln minus F(x) = f (x)

minus F(x)

Heuristically it is interpreted by the relation micro(x)dx =Px lt X lt x + dx∣X gt x Conversely F(x) =

minus expminusx

intmicro(s)ds e survivor function is dened as

ℓ(x) = ℓ() minus F(x) normally with ℓ() = Inmortality applications ℓ(x) is the expected number of sur-vivors to exact age x out of an original cohort of newborn babiese survival probability is

tpx = PX gt x + t∣X gt x = ℓ(x + t)ℓ(x)

= expminusintt

micro(x + s)ds

and the non-survival probability is (the converse) tqx = minustpx For t = one writes qx = qx and px = px In partic-ular we get ℓ(x + ) = ℓ(x)pxis is a practical recursionformula that permits us to compute all values of ℓ(x) oncewe know the values of px for all relevant x

e life expectancy is eo = EX = int infin ℓ(x)dxℓ()(e subscript in eo indicates that the expected value iscomputed at age (ie for newborn individuals) and thesuperscript o indicates that the computation is made in

the continuous mode)e remaining life expectancy atage x is

eox = E(X minus x∣X gt x) = intinfin

ℓ(x + t)dtℓ(x)

ie it is the expected lifetime remaining to someone whohas attained age xTo turn to the statistical estimation of these various

quantities suppose that the function micro(x) is piecewiseconstant which means that we take it to equal some con-stant say microj over each interval (xj xj+) for some partitionxj of the age axis For a collection Xi of independentobservations of X let Dj be the number of Xi that fall inthe interval (xj xj+) In mortality applications this is thenumber of deaths observed in the given interval For thecohort of the initially newborn Dj is the number of indi-viduals whodie in the interval (called the occurrences in theinterval) If individual i dies in the interval he or she willof course have lived for Xi minus xj time units during the inter-val Individuals who survive the interval will have lived forxj+ minus xj time units in the interval and individuals whodo not survive to age xj will not have lived during thisinterval at all When we aggregate the time units lived in(xj xj+) over all individuals we get a total Rj which iscalled the exposures for the interval the idea being thatindividuals are exposed to the risk of death for as long asthey live in the interval In the simple case where there areno relations between the individual parameters microj the col-lection DjRj constitutes a statistically sucient set ofobservations with a likelihood Λ that satises the relationln Λ = sumjminusmicrojRj+Dj ln microjwhich is easily seen to bemax-imized by microj = DjRje latter fraction is therefore themaximum-likelihood estimator for microj (In some connec-tions an age schedule of mortality will be specied such asthe classical GompertzndashMakeham function microx = a + bcxwhich does represent a relationship between the intensityvalues at the dierent ages x normally for single-year agegroupsMaximum likelihood estimators can then be foundby plugging this functional specication of the intensitiesinto the likelihood function nding the values a b andc that maximize Λ and using a + bcx for the intensity inthe rest of the life table computations Methods that do notamount tomaximum likelihood estimationwill sometimesbe used because they involve simpler computations Withsome luck they provide starting values for the iterative pro-cess that must usually be applied to produce the maximumlikelihood estimators For an example see Forseacuten ())is whole schema can be extended trivially to cover cen-soring (withdrawals) provided the censoringmechanism isunrelated to the mortality process

L Likelihood

If the force of mortality is constant over a single-yearage interval (x x + ) say and is estimated by microx in thisinterval then px = eminusmicrox is an estimator of the single-yearsurvival probability pxis allows us to estimate the sur-vival function recursively for all corresponding ages usingℓ(x + ) = ℓ(x)px for x = and the rest of thelife table computations follow suit Life table constructionconsists in the estimation of the parameters and the tab-ulation of functions like those above from empirical datae data can be for age at death for individuals as in theexample indicated above but they can also be observa-tions of duration until recovery from an illness of intervalsbetween births of time until breakdown of some piece ofmachinery or of any other positive duration variableSo far we have argued as if the life table is computed for

a group of mutually independent individuals who have allbeen observed in parallel essentially a cohort that is fol-lowed from a signicant common starting point (namelyfrom birth in our mortality example) and which is dimin-ished over time due to decrements (attrition) caused bythe risk in question and also subject to reduction due tocensoring (withdrawals)e corresponding table is thencalled a cohort life table It is more common however toestimate a px schedule from data collected for the mem-bers of a population during a limited time period and touse the mechanics of life-table construction to produce aperiod life table from the px valuesLife table techniques are described in detail in most

introductory textbooks in actuarial statistics7biostatistics7demography and epidemiology See eg Chiang ()Elandt-Johnson and Johnson () Manton and Stallard() Preston et al () For the history of thetopic consult Seal () Smith and Keytz () andDupacircquier ()

About the AuthorFor biography see the entry 7Demography

Cross References7Demographic Analysis A Stochastic Approach7Demography7Event History Analysis7Kaplan-Meier Estimator7Population Projections7Statistics An Overview

References and Further ReadingChiang CL () The life table and its applications Krieger

MalabarDupacircquier J () Lrsquoinvention de la table de mortaliteacute Presses

universitaires de France Paris

Elandt-Johnson RC Johnson NL () Survival models and dataanalysis Wiley New York

Forseacuten L () The efficiency of selected moment methods inGompertzndashMakeham graduation of mortality Scand Actuarial Jndash

Manton KG Stallard E () Recent trends in mortality analysisAcademic Press Orlando

Preston SH Heuveline P Guillot M () Demography Measuringand modeling populations Blackwell Oxford

Seal H () Studies in history of probability and statistics mul-tiple decrements of competing risks Biometrica ()ndash

Smith D Keyfitz N () Mathematical demography selectedpapers Springer Heidelberg

Likelihood

Nancy ReidProfessorUniversity of Toronto Toronto ON Canada

Introductione likelihood function in a statistical model is propor-tional to the density function for the random variable tobe observed in the model Most oen in applications oflikelihood we have a parametric model f (y θ) where theparameter θ is assumed to take values in a subset of Rkand the variable y is assumed to take values in a subset ofRn the likelihood function is dened by

L(θ) = L(θ y) = cf (y θ) ()

where c can depend on y but not on θ In more gen-eral settings where the model is semi-parametric or non-parametric the explicit denition is more dicult becausethe density needs to be dened relative to a dominatingmeasure whichmay not exist seeVan derVaart () andMurphy andVan derVaart ()is article will consideronly nite-dimensional parametric modelsWithin the context of the given parametric model the

likelihood function measures the relative plausibility ofvarious values of θ for a given observed data point y Val-ues of the likelihood function are only meaningful relativeto each other and for this reason are sometimes stan-dardized by the maximum value of the likelihood func-tion although other reference points might be of interestdepending on the contextIf ourmodel is f (y θ) = (ny)θy(minusθ)nminusy y = n

θ isin [ ] then the likelihood function is (any functionproportional to)

L(θ y) = θy( minus θ)nminusy

Likelihood L

L

and can be plotted as a function of θ for any xed valueof y e likelihood function is maximized at θ = ynis model might be appropriate for a sampling schemewhich recorded the number of successes amongn indepen-dent trials that result in success or failure each trial havingthe same probability of success θ Another example is thelikelihood function for the mean and variance parameterswhen sampling from a normal distribution with mean microand variance σ

L(θ y) = expminusn log σ minus (σ )Σ(yi minus micro)

where θ = (micro σ )is could also be plotted as a functionof micro and σ for a given sample y yn and it is not dif-cult to show that this likelihood function only dependson the sample through the sample mean y = nminusΣyi andsample variance s = (n minus )minusΣ(yi minus y) or equivalentlythrough Σyi and Σyi It is a general property of likelihoodfunctions that they depend on the data only through theminimal sucient statistic

Inferencee likelihood function was dened in a seminal paperof Fisher () and has since become the basis for mostmethods of statistical inference One version of likelihoodinference suggested by Fisher is to use some rule suchas L(θ)L(θ) gt k to dene a range of ldquolikelyrdquo or ldquoplau-siblerdquo values of θ Many authors including Royall ()and Edwards () have promoted the use of plots ofthe likelihood function along with interval estimates ofplausible valuesis approach is somewhat limited how-ever as it requires that θ have dimension or possibly or that a likelihood function can be constructed that onlydepends on a component of θ that is of interest see sectionldquo7Nuisance Parametersrdquo belowIn general we would wish to calibrate our inference

for θ by referring to the probabilistic properties of theinferential method One way to do this is to introduce aprobability measure on the unknown parameter θ typi-cally called a prior distribution and use Bayesrsquo rule forconditional probabilities to conclude

π(θ ∣ y) = L(θ y)π(θ)intθL(θ y)π(θ)dθ

where π(θ) is the density for the prior measure and π(θ ∣y) provides a probabilistic assessment of θ aer observingY = y in the model f (y θ) We could then make con-clusions of the form ldquohaving observed successes in trials and assuming π(θ) = the posterior probabilitythat θ gt is rdquo and so on

is is a very brief description of Bayesian inference inwhich probability statements refer to that generated from

the prior through the likelihood to the posterior A majordiculty with this approach is the choice of prior prob-ability function In some applications there may be anaccumulation of previous data that can be incorporatedinto a probability distribution but in general there is notand some rather ad hoc choices are oen made Anotherdiculty is themeaning to be attached to probability state-ments about the parameterInference based on the likelihood function can also be

calibrated with reference to the probability model f (y θ)by examining the distribution ofL(θY) as a random func-tion or more usually by examining the distribution ofvarious derived quantitiesis is the basis for likelihoodinference from a frequentist point of view In particularit can be shown that logL(θY)L(θY) where θ =θ(Y) is the value of θ at which L(θY) is maximized isapproximately distributed as a χk random variable wherek is the dimension of θ To make this precise requires anasymptotic theory for likelihood which is based on a cen-tral limit theorem (see 7Central Limiteorems) for thescore function

U(θY) = partpartθlogL(θY)

If Y = (Y Yn) has independent components thenU(θ) is a sum of n independent components which undermild regularity conditions will be asymptotically normalTo obtain the χ result quoted above it is also necessary toinvestigate the convergence of θ to the true value govern-ing the model f (y θ) Showing this convergence usuallyin probability but sometimes almost surely can be di-cult see Scholz () for a summary of some of the issuesthat ariseAssuming that θ is consistent for θ and that L(θY)

has sucient regularity the follow asymptotic results canbe established

(θ minus θ)T i(θ)(θ minus θ) drarr χk ()

U(θ)T iminus(θ)U(θ) drarr χk ()

ℓ(θ) minus ℓ(θ) drarr χk ()

where i(θ) = Eminusℓprimeprime(θY) θ is the expected Fisher infor-mation function ℓ(θ) = logL(θ) is the log-likelihoodfunction and χk is the 7chi-square distribution with kdegrees of freedom

ese results are all versions of a more general resultthat the log-likelihood function converges to the quadraticform corresponding to a multivariate normal distribution(see 7Multivariate Normal Distributions) under suitablystated limiting conditions ere is a similar asymptoticresult showing that the posterior density is asymptotically

L Likelihood

normal and in fact asymptotically free of the prior distri-bution although this result requires that the prior distribu-tion be a proper probability density ie has integral overthe parameter space equal to

Nuisance ParametersIn models where the dimension of θ is large plotting thelikelihood function is not possible and inference based onthe multivariate normal distribution for θ or the χk dis-tribution of the log-likelihood ratio doesnrsquot lead easily tointerval estimates for components of θ However it is pos-sible to use the likelihood function to construct inferencefor parameters of interest using variousmethods that havebeen proposed to eliminate nuisance parametersSuppose in the model f (y θ) that θ = (ψ λ) where ψ

is a k-dimensional parameter of interest (which will oenbe )e prole log-likelihood function of ψ is

ℓP(ψ) = ℓ(ψ λψ)

where λψ is the constrainedmaximum likelihood estimateit maximizes the likelihood function L(ψ λ) when ψ isheld xede prole log-likelihood function is also calledthe concentrated log-likelihood function especially ineconometrics If the parameter of interest is not expressedexplicitly as a subvector of θ then the constrained maxi-mum likelihood estimate is found using Lagrange multi-pliersIt can be veried under suitable smoothness conditions

that results similar to those at ( ndash ) hold as well for theprole log-likelihood function in particular

ℓP(ψ) minus ℓP(ψ) = ℓ(ψ λ) minus ℓ(ψ λψ)drarr χk

is method of eliminating nuisance parameters is notcompletely satisfactory especially when there are manynuisance parameters in particular it doesnrsquot allow forerrors in estimation of λ For example the prole likeli-hood approach to estimation of σ in the linear regressionmodel (see 7Linear Regression Models) y sim N(Xβ σ )will lead to the estimator σ = Σ(yi minus yi)n whereas theestimator usually preferred has divisor nminusp where p is thedimension of β

us a large literature has developed on improvementsto the prole log-likelihood For Bayesian inference suchimprovements are ldquoautomaticallyrdquo included in the formu-lation of the marginal posterior density for ψ

πM(ψ ∣ y)prop int π(ψ λ ∣ y)dλ

but it is typically quite dicult to specify priors for possiblyhigh-dimensional nuisance parameters For non-Bayesian

inference most modications to the prole log-likelihoodare derived by considering conditional or marginal infer-ence in models that admit factorizations at least approxi-mately like the following

f (y θ) = f(yψ)f(y ∣ y λ) orf (y θ) = f(y ∣ yψ)f(y λ)

A discussion of conditional inference and density factori-sations is given in Reid ()is literature is closely tiedto that on higher order asymptotic theory for likelihoode latter theory builds on saddlepoint and Laplace expan-sions to derive more accurate versions of (ndash) see forexample Severini () and Brazzale et al () edirect likelihood approach of Royall () and others doesnot generalize very well to the nuisance parameter settingalthough Royall and Tsou () present some results inthis direction

Extensions to Likelihoode likelihood function is such an important aspect ofinference based on models that it has been extended toldquolikelihood-likerdquo functions formore complex data settingsExamples include nonparametric and semi-parametriclikelihoods the most famous semi-parametric likelihoodis the proportional hazards model of Cox () Butmany other extensions have been suggested to empiri-cal likelihood (Owen ) which is a type of nonpara-metric likelihood supported on the observed sample toquasi-likelihood (McCullagh ) which starts from thescore function U(θ) and works backwards to an infer-ence function to bootstrap likelihood (Davison et al )and many modications of prole likelihood (Barndor-Nielsen andCox Fraser )ere is recent interestfor multi-dimensional responses Yi in composite likeli-hoods which are products of lower dimensional condi-tional or marginal distributions (Varin ) Reid ()concluded a review article on likelihood by stating

7 From either a Bayesian or frequentist perspective the like-lihood function plays an essential role in inference Themaximum likelihood estimator once regarded on an equalfooting among competing point estimators is now typi-cally the estimator of choice although some refinementis needed in problems with large numbers of nuisanceparameters The likelihood ratio statistic is the basis formost tests of hypotheses and interval estimates The emer-gence of the centrality of the likelihood function for infer-ence partly due to the large increase in computing poweris one of the central developments in the theory of statisticsduring the latter half of the twentieth century

Likelihood L

L

Further Readinge book by Cox and Hinkley () gives a detailedaccount of likelihood inference and principles of statis-tical inference see also Cox () ere are severalbook-length treatments of likelihood inference includingEdwards () Azzalini () Pawitan () and Sev-erini () this last discusses higher order asymptotictheory in detail as does Barndor-Nielsen andCox ()and Brazzale Davison and Reid () A short reviewpaper is Reid () An excellent overview of consis-tency results for maximum likelihood estimators is Scholz() see also Lehmann andCasella () Foundationalissues surrounding likelihood inference are discussed inBerger and Wolpert ()

About the AuthorProfessor Reid is a Past President of the Statistical Societyof Canada (ndash) During (ndash) she servedas the President of the Institute of Mathematical Statis-tics Among many awards she received the Emanuel andCarol Parzen Prize for Statistical Innovation () ldquoforleadership in statistical science for outstanding researchin theoretical statistics and highly accurate inference fromthe likelihood function and for inuential contributionsto statistical methods in biology environmental sciencehigh energy physics and complex social surveysrdquo She wasawarded the Gold Medal Statistical Society of Canada() and FlorenceNightingaleDavidAward Committeeof Presidents of Statistical Societies () She is AssociateEditor of Statistical Science (ndash) Bernoulli (ndash) andMetrika (ndash)

Cross References7Bayesian Analysis or Evidence Based Statistics7Bayesian Statistics7Bayesian Versus Frequentist Statistical Reasoning7Bayesian vs Classical Point Estimation A ComparativeOverview7Chi-Square Test Analysis of Contingency Tables7Empirical Likelihood Approach to Inference fromSample Survey Data7Estimation7Fiducial Inference7General Linear Models7Generalized Linear Models7Generalized Quasi-Likelihood (GQL) Inferences7Mixture Models7Philosophical Foundations of Statistics

7Risk Analysis7Statistical Evidence7Statistical Inference7Statistical Inference An Overview7Statistics An Overview7Statistics Nelderrsquos view7Testing Variance Components in Mixed LinearModels7Uniform Distribution in Statistics

References and Further ReadingAzzalini A () Statistical inference Chapman and Hall LondonBarndorff-Nielsen OE Cox DR () Inference and asymptotics

Chapman and Hall LondonBerger JO Wolpert R () The likelihood principle Institute of

Mathematical Statistics HaywardBirnbaum A () On the foundations of statistical inference Am

Stat Assoc ndashBrazzale AR Davison AC Reid N () Applied asymptotics

Cambridge University Press CambridgeCox DR () Regression models and life tables J R Stat Soc B

ndash (with discussion)Cox DR () Principles of statistical inference Cambridge Uni-

versity Press CambridgeCox DR Hinkley DV () Theoretical statistics Chapman and

Hall LondonDavison AC Hinkley DV Worton B () Bootstrap likelihoods

Biometrika ndashEdwards AF () Likelihood Oxford University Press OxfordFisher RA () On the mathematical foundations of theoretical

statistics Phil Trans R Soc A ndashLehmann EL Casella G () Theory of point estimation nd edn

Springer New YorkMcCullagh P () Quasi-likelihood functions Ann Stat ndashMurphy SA Van der Vaart A () Semiparametric likelihood ratio

inference Ann Stat ndashOwen A () Empirical likelihood ratio confidence intervals for a

single functional Biometrika ndashPawitan Y () In all likelihood Oxford University Press OxfordReid N () The roles of conditioning in inference Stat Sci

ndashReid N () Likelihood J Am Stat Assoc ndashRoyall RM () Statistical evidence a likelihood paradigm

Chapman and Hall LondonRoyall RM Tsou TS () Interpreting statistical evidence using

imperfect models robust adjusted likelihoodfunctions J R StatSoc B

Scholz F () Maximum likelihood estimation In Ency-clopedia of statistical sciences Wiley New York doiesspub Accessed Aug

Severini TA () Likelihood methods in statistics Oxford Univer-sity Press Oxford

Van der Vaart AW () Infinite-dimensional likelihood meth-ods in statistics httpwwwstieltjesorgarchiefbiennialframenodehtml Accessed Aug

Varin C () On composite marginal likelihood Adv Stat Analndash

L Limit Theorems of Probability Theory

Limit Theorems of ProbabilityTheory

Alexandr Alekseevich BorovkovProfessor Head of Probability and Statistics Chair at theNovosibirsk UniversityNovosibirsk University Novosibirsk Russia

Limit eorems of Probability eory is a broad namereferring to the most essential and extensive research areain Probability eory which at the same time has thegreatest impact on the numerous applications of the latterBy its very nature Probability eory is concerned

with asymptotic (limiting) laws that emerge in a long seriesof observations on random events Because of this in theearly twentieth century even the very denition of prob-ability of an event was given by a group of specialists(R von Mises and some others) as the limit of the rela-tive frequency of the occurrence of this event in a longrow of independent random experiments e ldquostabilityrdquoof this frequency (ie that such a limit always exists) waspostulated Aer the s Kolmogorovrsquos axiomatic con-struction of probability theory has prevailed One of themain assertions in this axiomatic theory is the Law of LargeNumbers (LLN) on the convergence of the averages of largenumbers of random variables to their expectation islaw implies the aforementioned stability of the relative fre-quencies and their convergence to the probability of thecorresponding event

e LLN is the simplest limit theorem (LT) of probabil-ity theory elucidating the physical meaning of probabilitye LLN is stated as follows if XXX is a sequenceof iid random variables

Sn =n

sumj=Xj

and the expectation a = EX exists then nminusSnasETHrarr a

(almost surely ie with probability )us the value nacan be called the rst order approximation for the sums Sne Central Limiteorem (CLT) gives one a more preciseapproximation for Sn It says that if σ = E(X minus a) ltinfinthen the distribution of the standardized sum ζn = (Sn minusna)σ

radicn converges as n rarr infin to the standard normal

(Gaussian) lawat is for all x

P(ζn lt x)rarr Φ(x) = radicπ

x

intminusinfin

eminustdt

e quantity nEξ + ζσradicn where ζ is a standard normal

random variable (so that P(ζ lt x) = Φ(x)) can be calledthe second order approximation for Sn

e rst LLN (for the Bernoulli scheme) was provedby Jacob Bernoulli in the late s (published posthu-mously in ) e rst CLT (also for the Bernoullischeme) was established by A de Moivre (rst publishedin and referred nowadays to as the deMoivrendashLaplacetheorem) In the beginning of the nineteenth centuryPS Laplace and CF Gauss contributed to the generaliza-tion of these assertions and appreciation of their enormousapplied importance (in particular for the theory of errorsof observations) while later in that century further break-throughs in both methodology and applicability rangeof the CLT were achieved by PL Chebyshev () andAM Lyapunov ()

e main directions in which the two aforementionedmain LTs have been extended and rened since then are

Relaxing the assumption EX lt infin When the sec-ond moment is innite one needs to assume that theldquotailrdquo P(x) = P(X gt x) + P(X lt minusx) is a func-tion regularly varying at innity such that the limitlimxrarrinfinP(X gt x)P(x) existsen the distribution of thenormalized sum Snσ(n) where σ(n) = Pminus(nminus)Pminus being the generalized inverse of the function P andwe assume that Eξ = when the expectation is niteconverges to one of the so-called stable laws as nrarrinfine7characteristic functions of these laws have simpleclosed-form representations

Relaxing the assumption that the Xjrsquos are identicallydistributed and proceeding to study the so-called tri-angular array scheme where the distributions of thesummands Xj = Xjn forming the sum Sn depend notonly on j but on n as well In this case the class of alllimit laws for the distribution of Sn (under suitable nor-malization) is substantially wider it coincides with theclass of the so-called innitely divisible distributionsAn important special case here is the Poisson limit the-orem on convergence in distribution of the number ofoccurrences of rare events to a Poisson law

Relaxing the assumption of independence of the XjrsquosSeveral types of ldquoweak dependencerdquo assumptions onXjunder which the LLN and CLT still hold true havebeen suggested and investigatedOne should alsomen-tion here the so-called ergodic theorems (see7Ergodiceorem) for a wide class of random sequences andprocesses

Renement of the main LTs and derivation of asymp-totic expansions For instance in the CLT bounds ofthe rate of convergence P(ζn lt x) minus Φ(x) rarr and

Limit Theorems of Probability Theory L

L

asymptotic expansions for this dierence (in the pow-ers of nminus in the case of iid summands) have beenobtained under broad assumptions

Studying large deviation probabilities for the sums Sn(theorems on rare events) If x rarr infin together with nthen the CLT can only assert that P(ζn gt x) rarr eorems on large deviation probabilities aim to nd afunction P(xn) such that

P(ζn gt x)P(xn) rarr as nrarrinfin x rarrinfin

e nature of the function P(xn) essentially dependson the rate of decay of P(X gt x) as x rarrinfin and on theldquodeviation zonerdquo ie on the asymptotic behavior of theratio xn as nrarrinfin

Considering observations X Xn of a more com-plex nature ndash rst of all multivariate random vectorsIf Xj isin Rd then the role of the limit law in the CLTwill be played by a d-dimensional normal (Gaussian)distribution with the covariance matrix E(X minus EX)(X minus EX)T

e variety of application areas of the LLN and CLTis enormous us Mathematical Statistics is based onthese LTs Let Xlowastn = (X Xn) be a sample froma distribution F and Flowastn (u) the corresponding empiricaldistribution functione fundamental GlivenkondashCantellitheorem (see 7Glivenko-Cantelli eorems) stating thatsupu ∣F

lowastn (u)minus F(u)∣

asETHrarr as nrarrinfin is of the same natureas the LLN and basically means that the unknown distri-bution F can be estimated arbitrary well from the randomsample Xlowastn of a large enough size n

e existence of consistent estimators for the unknownparameters a = Eξ and σ = E(X minus a) also follows fromthe LLN since as nrarrinfin

alowast = n

n

sumj=Xj

asETHrarr a (σ )lowast = n

n

sumj=

(Xj minus alowast)

= n

n

sumj=Xj minus (alowast) asETHrarr σ

Under additional moment assumptions on the distri-bution F one can also construct asymptotic condenceintervals for the parameters a and σ as the distributionsof the quantities

radicn(alowast minus a) and

radicn((σ )lowast minus σ ) con-

verge as n rarr infin to the normal onese same can alsobe said about other parameters that are ldquosmoothrdquo enoughfunctionals of the unknown distribution F

e theorem on the 7asymptotic normality andasymptotic eciency of maximum likelihood estimatorsis another classical example of LTsrsquo applications in mathe-matical statistics (see eg Borovkov ) Furthermore in

estimation theory and hypotheses testing one also needstheorems on large deviation probabilities for the respec-tive statistics as it is statistical procedures with small errorprobabilities that are oen required in applicationsIt is worth noting that summation of random variables

is by no means the only situation in which LTs appear inProbabilityeoryGenerally speaking the main objective of Probabil-

ityeory in applications is nding appropriate stochasticmodels for objects under study and then determining thedistributions or parameters one is interested in As a rulethe explicit form of these distributions and parameters isnot known LTs can be used to nd suitable approximationsto the characteristics in questionAt least two possible approaches to this problem should

be noted here

Suppose that the unknown distribution Fθ depends ona parameter θ such that as θ approaches some ldquocriticalrdquovalue θ the distributions Fθ become ldquodegeneraterdquo inone sense or anotheren in a number of cases onecan nd an approximation for Fθ which is valid for thevalues of θ that are close to θ For instance in actuarialstudies7queueing theory and some other applicationsone of the main problems is concerned with the dis-tribution of S = sup

kge(Sk minus θk) under the assumption

that EX = If θ gt then S is a proper random vari-able If however θ rarr then S asETHrarr infin Here we dealwith the so-called ldquotransient phenomenardquo It turns outthat if σ = Var (X) ltinfin then there exists the limit

limθdarrP(θS gt x) = eminusxσ x gt

is (KingmanndashProkhorov) LT enables one to ndapproximations for the distribution of S in situationswhere θ is small

Sometimes one can estimate the ldquotailsrdquo of the unknowndistributions ie their asymptotic behavior at innityis is of importance in those applications where oneneeds to evaluate the probabilities of rare events If theequation Eemicro(Xminusθ) = has a solution micro gt then inthe above example one has

P(S gt x) sim ceminusmicrox x rarrinfin

where c is a known constant If the distribution F of Xis subexponential (in this case Eemicro(Xminusθ) = infin for anymicro gt ) then

P(S gt x) sim θ int

infin

x( minus F(t))dt x rarrinfin

L Linear Mixed Models

isLTenablesone tondapproximations forP(S gt x)for large xFor both approaches the obtained approximations

can be rened

An important part of Probabilityeory is concernedwith LTs for random processes eir main objective isto nd conditions under which random processes con-verge in some sense to some limit process An extensionof the CLT to that context is the so-called FunctionalCLT (aka the DonskerndashProkhorov invariance princi-ple) which states that as n rarr infin the processesζn(t) = (Slfloorntrfloor minus ant)σ

radicntisin[] converge in distribu-

tion to the standard Wiener process w(t)tisin[]e LTs(including large deviation theorems) for a broad class offunctionals of the sequence (7random walk) S Sncan also be classied as LTs for 7stochastic processese same can be said about Law of iterated logarithmwhich states that for an arbitrary ε gt the random walkSkinfink= crosses the boundary (minusε)σ

radick ln ln k innitely

many times but crosses the boundary ( + ε)σradick ln ln k

nitely many times with probability Similar results holdtrue for trajectories of Wiener processes w(t)tisin[] andw(t)tisin[infin)In mathematical statistics a closely related to func-

tional CLT result says that the so-called ldquoempirical processrdquoradicn(Flowastn (u) minus F(u))uisin(minusinfininfin) converges in distribution

to w(F(u))uisin(minusinfininfin) where w(t) = w(t) minus tw() isthe Brownian bridge processis LT implies7asymptoticnormality of a great many estimators that can be repre-sented as smooth functionals of the empirical distributionfunction Flowastn (u)

ere are many other areas in Probability eoryand its applications where various LTs appear and areextensively used For instance convergence theoremsfor 7martingales asymptotics of extinction probabilityof a branching processes and conditional (under non-extinction condition) LTs on a number of particles etc

About the AuthorProfessor Borovkov is Head of Probability and Statis-tics Department of the Sobolev Institute of MathematicsNovosibirsk (RussianAcademy of Sciences) since Heis Head of Probability and Statistics Chair at the Novosi-birsk University since He is Concelour of RussianAcademy of Sciences and fullmember of RussianAcademyof Sciences (Academician) () He was awarded theState Prize of the USSR () the Markov Prize of theRussian Academy of Sciences () and GovernmentPrize in Education () Professor Borovkov is Editor-in-Chief of the journal ldquoSiberianAdvances inMathematicsrdquo

and Associated Editor of journals ldquoeory of Probabil-ity and its Applicationsrdquo ldquoSiberian Mathematical JournalrdquoldquoMathematical Methods of Statisticsrdquo ldquoElectronic Journal ofProbabilityrdquo

Cross References7Almost Sure Convergence of Random Variables7Approximations to Distributions7Asymptotic Normality7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Central Limiteorems7Empirical Processes7Ergodiceorem7Glivenko-Cantellieorems7Large Deviations and Applications7Laws of Large Numbers7Martingale Central Limiteorem7Probabilityeory An Outline7RandomMatrixeory7Strong Approximations in Probability and Statistics7Weak Convergence of Probability Measures

References and Further ReadingBillingsley P () Convergence of probability measures Wiley

New YorkBorovkov AA () Mathematical statistics Gordon amp Breach

AmsterdamGnedenko BV Kolmogorov AN () Limit distributions for sums

of independent random variables Addison-Wesley CambridgeLeacutevy P () Theacuteorie de lrsquoAddition des Variables Aleacuteatories

Gauthier-Villars ParisLoegraveve M (ndash) Probability theory vols I and II th edn

Springer New YorkPetrov VV () Limit theorems of probability theory sequences of

independent random variables ClarendonOxford UniversityPress New York

Linear Mixed Models

GeertMolenberghsProfessorUniversiteit Hasselt amp Katholieke Universiteit LeuvenLeuven Belgium

In observational studies repeated measurements may betaken at almost arbitrary time points resulting in anextremely large number of time points at which only one

Linear Mixed Models L

L

or only a few measurements have been taken Many of theparametric covariance models described so far may thencontain toomany parameters tomake them useful in prac-tice while othermore parsimoniousmodelsmay be basedon assumptions which are too simplistic to be realisticA general and very exible class of parametric models forcontinuous longitudinal data is formulated as follows

yi∣bi sim N(Xiβ + ZibiΣi) ()bi sim N(D) ()

where Xi and Zi are (ni times p) and (ni times q) dimensionalmatrices of known covariates β is a p-dimensional vec-tor of regression parameters called the xed eects D isa general (q times q) covariance matrix and Σi is a (ni times ni)covariance matrix which depends on i only through itsdimension ni ie the set of unknown parameters in Σi willnot depend upon i Finally bi is a vector of subject-specicor random eects

e above model can be interpreted as a linear regres-sion model (see 7Linear Regression Models) for the vec-tor yi of repeated measurements for each unit separatelywhere some of the regression parameters are specic (ran-dom eects bi) while others are not (xed eects β)edistributional assumptions in () with respect to the ran-dom eects can be motivated as follows First E(bi) = implies that the mean of yi still equals Xiβ such that thexed eects in the random-eects model () can also beinterpreted marginally Not only do they reect the eectof changing covariates within specic units they alsomea-sure the marginal eect in the population of changing thesame covariates Second the normality assumption imme-diately implies that marginally yi also follows a normaldistribution with mean vector Xiβ and with covariancematrix Vi = ZiDZprimei + ΣiNote that the random eects in () implicitly imply the

marginal covariance matrix Vi of yi to be of the very spe-cic form Vi = ZiDZprimei + Σi Let us consider two examplesunder the assumption of conditional independence ieassuming Σi = σ Ini First consider the case where therandom eects are univariate and represent unit-specicintercepts is corresponds to covariates Zi which areni-dimensional vectors containing only ones

e marginal model implied by expressions () and() is

yi sim N(XiβVi) Vi = ZiDZprimei + Σi

which can be viewed as a multivariate linear regressionmodel with a very particular parameterization of thecovariance matrix Vi

With respect to the estimation of unit-specic param-eters bi the posterior distribution of bi given the observeddata yi can be shown to be (multivariate) normal withmean vector equal to DZprimeiV

minusi (α)(yi minus Xiβ) Replacing β

and α by their maximum likelihood estimates we obtainthe so-called empirical Bayes estimates bi for the bi A keyproperty of these EB estimates is shrinkage which is bestillustrated by considering the prediction yi equiv Xi β +Zibi ofthe ith prole It can easily be shown that

yi = ΣiVminusi Xi β + (Ini minus ΣiVminusi ) yi

which can be interpreted as a weighted average of thepopulation-averaged prole Xi β and the observed data yiwith weights ΣiVminusi and Ini minusΣiVminusi respectively Note thatthe ldquonumeratorrdquo of ΣiVminusi represents within-unit variabil-ity and the ldquodenominatorrdquo is the overall covariance matrixVi Hence much weight will be given to the overall averageprole if the within-unit variability is large in comparisonto the between-unit variability (modeled by the randomeects) whereasmuchweight will be given to the observeddata if the opposite is trueis phenomenon is referred toas shrinkage toward the average proleXi β An immediateconsequence of shrinkage is that the EB estimates show lessvariability than actually present in the random-eects dis-tribution ie for any linear combination λ of the randomeects

var(λprimebi)le var(λprimebi) = λprimeDλ

About the AuthorGeert Molenberghs is Professor of Biostatistics at Uni-versiteit Hasselt and Katholieke Universiteit Leuven inBelgium He was Joint Editor of Applied Statistics (ndash) and Co-Editor of Biometrics (ndash) Cur-rently he is Co-Editor of Biostatistics (ndash) He wasPresident of the International Biometric Society (ndash) received the Guy Medal in Bronze from the RoyalStatistical Society and the Myrto Lefkopoulou award fromthe Harvard School of Public Health Geert Molenberghsis founding director of the Center for Statistics He isalso the director of the Interuniversity Institute for Bio-statistics and statistical Bioinformatics (I-BioStat) Jointlywith Geert Verbeke Mike Kenward Tomasz BurzykowskiMarc Buyse and Marc Aerts he authored books on lon-gitudinal and incomplete data and on surrogate markerevaluation GeertMolenberghs received several Excellencein Continuing Education Awards of the American Statisti-cal Association for courses at Joint Statistical Meetings

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 3: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

Laws of Large Numbers L

L

occurence of the event through innitelymany replicationsof the experiment For example if a quality control engi-neer asserts that the probability is that a widget pro-duced by her production team meets specications thenshe is asserting that in the long-run of those widgetsmeet specicationse phrase ldquoin the long-runrdquo requiresthe notion of limit as the sample size approaches innitye long-run relative frequency approach for describingthe probability of an event is natural and intuitive but nev-ertheless it raises serious mathematical questions Doesthe limiting relative frequency always exist as the samplesize approaches innity and is the limit the same irrespec-tive of the sequence of experimental outcomes It is easyto see that the answers are negative Indeed in the aboveexample depending on the sequence of experimental out-comes the proportion of widgets meeting specicationscould uctuate repeatedly from near to near as thenumber of widgets sampled approaches innity So in whatsense can it be asserted that the limit exists and is To provide an answer to this question one needs to applya LLN

e LLNs are of two types viz weak LLNs (WLLNs)and strong LLNs (SLLNs) Each type involves a dierentmode of convergence In general a WLLN (resp a SLLN)involves convergence in probability (resp convergencealmost surely (as)) e denitions of these two modesof convergence will now be reviewedLet Unn ge be a sequence of random variables

dened on a probability space (ΩF P) and let c isin R Wesay thatUn converges in probability to c (denotedUn

Prarr c) if

limnrarrinfinP(∣Un minus c∣ gt ε) = for all ε gt

We say that Un converges as to c (denoted Un rarr c as) if

P(ω isin Ω limnrarrinfinUn(ω) = c) =

If Un rarr c as then UnPrarr c the converse is not true in

generale celebrated Kolmogorov SLLN (see eg Chow

and Teicher [] p ) is the following result LetXnn ge be a sequence of independent and identicallydistributed (iid) random variables and let c isin Ren

sumni= Xin

rarr c as if and only if EX = c ()

Using statistical terminology the suciency half of ()asserts that the sample mean converges as to the popula-tionmean as the sample size n approaches innity providedthe population mean exists and is nite is result is of

fundamental importance in statistical science It followsfrom () that

if EX = c isin R then sumni= Xin

Prarr c ()

this result is the KhintchineWLLN (see eg Petrov []p )Next suppose Ann ge is a sequence of independent

events all with the same probability p A special case of theKolmogorov SLLN is the limit result

pn rarr p as ()

where pn = sumni= IAin is the proportion of A An tooccur n ge (Here IAi is the indicator function ofAi i ge )is result is the rst SLLN ever proved and was discov-ered by Emile Borel in Hence with probability thesample proportion pn approaches the population propor-tion p as the sample size n rarr infin It is this SLLN whichthus provides the theoretical justication for the long-runrelative frequency approach to interpreting probabilitiesNote however that the convergence in () is not point-wise on Ω but rather is pointwise on some subset of Ωhaving probability Consequently any interpretation ofp = P(A) via () necessitates that one has a priori anintuitive understanding of the notion of an event havingprobability

e SLLN () is a key component in the proof of theGlivenkondashCantelli theorem (see7Glivenko-Cantellieo-rems) which roughly speaking asserts that with probabil-ity a population distribution function can be uniformlyapproximated by a sample (or empirical) distribution func-tion as the sample size approaches innity is result isreferred to by Reacutenyi ( p ) as the fundamental the-orem of mathematical statistics and by Loegraveve ( p )as the central statistical theoremIn Jacob Bernoulli (ndash) proved the rst

WLLN

pnPrarr p ()

Bernoullirsquos renowned book Ars Conjectandi (e Art ofConjecturing) was published posthumously in and it ishere where the proof of his WLLN was rst published It isinteresting to note that there is over a year gap betweenthe WLLN () of Bernoulli and the corresponding SLLN() of BorelAn interesting example is the following modication

of one of Stout ( p ) Suppose that the quality con-trol engineer referred to above would like to estimate theproportion p of widgets produced by her production team

L Laws of Large Numbers

that meet specications She estimates p by using the pro-portion pn of the rst n widgets produced that meet speci-cations and she is interested in knowing if there will everbe a point in the sequence of examined widgets such thatwith probability (at least) a specied large value pn will bewithin ε of p and stay within ε of p as the sampling contin-ues (where ε gt is a prescribed tolerance)e answer isarmative since () is equivalent to the assertion that fora given ε gt and δ gt there exists a positive integer Nεδsuch that

P (capinfinn=Nεδ [∣pn minus p∣ le ε]) ge minus δ

at is the probability is arbitrarily close to that pn will bearbitrarily close to p simultaneously for all n beyond somepoint If one applied instead the WLLN () then it couldonly be asserted that for a given ε gt and δ gt thereexists a positive integer Nεδ such that

P(∣pn minus p∣ le ε) ge minus δ for all n ge Nεδ

ere are numerous other versions of the LLNs and wewill discuss only a few of them Note that the expressionsin () and () can be rewritten respectively as

sumni= Xi minus ncn

rarr as and sumni= Xi minus ncn

Prarr

thereby suggesting the following denitions A sequenceof random variables Xnn ge is said to obey a generalSLLN (resp WLLN) with centering sequence ann ge and norming sequence bnn ge (where lt bn rarrinfin) if

sumni= Xi minus anbn

rarr as (resp sumni= Xi minus anbn

Prarr )

A famous result of Marcinkiewicz and Zygmund (seeeg Chow and Teicher () p ) extended the Kol-mogorov SLLN as follows Let Xnn ge be a sequence ofiid random variables and let lt p lt en

sumni= Xi minus ncnp

rarr as for some c isin R if and only

if E∣X∣p ltinfin

In such a case necessarily c = EX if p ge whereas c isarbitrary if p lt Feller () extended the MarcinkiewiczndashZygmund

SLLN to the case of a more general norming sequencebnn ge satisfying suitable growth conditions

e followingWLLN is ascribed to Feller by Chow andTeicher ( p ) If Xnn ge is a sequence of iidrandom variables then there exist real numbers ann ge such that

sumni= Xi minus ann

Prarr ()

if and only if

nP(∣X∣ gt n)rarr as nrarrinfin ()

In such a case an minus nE(XI[∣X ∣len])rarr as nrarrinfine condition () is weaker than E∣X∣ lt infin If Xn

n ge is a sequence of iid random variables where X hasprobability density function

f (x) =⎧⎪⎪⎨⎪⎪⎩

cx log ∣x∣ ∣x∣ ge e ∣x∣ lt e

where c is a constant then E∣X∣ = infin and the SLLNsumni= Xin rarr c as fails for every c isin R but () and hencethe WLLN () hold with an = n ge Klass and Teicher () extended the Feller WLLN

to the case of a more general norming sequencebnn ge thereby obtaining a WLLN analog of Fellerrsquos() extension of the MarcinkiewiczndashZygmund SLLNGood references for studying the LLNs are the books

by Reacuteveacutesz () Stout () Loegraveve () Chow andTeicher () and Petrov () While the LLNs havebeen studied extensively in the case of independent sum-mands some of the LLNs presented in these books involvesummands obeying a dependence structure other than thatof independenceA large literature of investigation on the LLNs for

sequences of Banach space valued random elements hasemerged beginning with the pioneering work of Mourier() See themonograph byTaylor () for backgroundmaterial and results up to Excellent references are thebooks by Vakhania Tarieladze and Chobanyan () andLedoux and Talagrand () More recent results are pro-vided by Adler et al () Cantrell and Rosalsky ()and the references in these two articles

About the AuthorProfessor Rosalsky is Associate Editor Journal of AppliedMathematics and Stochastic Analysis (ndashpresent) andAssociate Editor International Journal of Mathematics andMathematical Sciences (ndashpresent) He has collabo-rated with research workers from dierent countriesAt the University of Florida he twice received a TeachingImprovement Program Award and on ve occasions wasnamed an Anderson Scholar Faculty Honoree an honorreserved for facultymembers designated by undergraduatehonor students as being the most eective and inspiring

Cross References7Almost Sure Convergence of Random Variables7BorelndashCantelli Lemma and Its Generalizations

Learning Statistics in a Foreign Language L

L

7Chebyshevrsquos Inequality7Convergence of Random Variables7Ergodiceorem7Estimation An Overview7Expected Value7Foundations of Probability7Glivenko-Cantellieorems7Probabilityeory An Outline7Random Field7Statistics History of7Strong Approximations in Probability and Statistics

References and Further ReadingAdler A Rosalsky A Taylor RL () A weak law for normed

weighted sums of random elements in Rademacher type pBanach spaces J Multivariate Anal ndash

Cantrell A Rosalsky A () A strong law for compactly uni-formly integrable sequences of independent random elementsin Banach spaces Bull Inst Math Acad Sinica ndash

Chow YS Teicher H () Probability theory indepen-dence interchangeability martingales rd edn SpringerNew York

Feller W () A limit theorem for random variables with infinitemoments Am J Math ndash

Klass M Teicher H () Iterated logarithm laws for asymmet-ric random variables barely with or without finite mean AnnProbab ndash

Ledoux M Talagrand M () Probability in Banach spacesisoperimetry and processes Springer Berlin

Loegraveve M () Probability theory vol I th edn Springer New YorkMourier E () Eleacutements aleacuteatoires dans un espace de Banach

Annales de lrsquo Institut Henri Poincareacute ndashPetrov VV () Limit theorems of probability theory sequences

of independent random variables Clarendon OxfordReacutenyi A () Probability theory North-Holland AmsterdamReacuteveacutesz P () The laws of large numbers Academic New YorkStout WF () Almost sure convergence Academic New YorkTaylor RL () Stochastic convergence of weighted sums of ran-

dom elements in linear spaces Lecture notes in mathematicsvol Springer Berlin

Vakhania NN Tarieladze VI Chobanyan SA () Probabilitydistributions on Banach spaces D Reidel Dordrecht Holland

Learning Statistics in a ForeignLanguage

KhidirM AbdelbasitSultan Qaboos University Muscat Sultanate of Oman

Backgrounde Sultanate of Oman is an Arabic-speaking countrywhere the medium of instruction in pre-university edu-cation is Arabic In Sultan Qaboos University (SQU) all

sciences (including Statistics) are taught in English ereason is that most of the scientic literature is in Englishand teaching in the native language may leave graduatesat a disadvantage Since only few instructors speak Ara-bic the university adopts a policy of no communicationin Arabic in classes and oce hours Students are requiredto achieve a minimum level in English (about IELTSscore) before they start their study program Very few stu-dents achieve that level on entry and the majority spendsabout two semesters doing English only

Language and Cultural ProblemsIt is to be expected that students from a non-English-speaking background will face serious diculties whenlearning in English especially in the rst year or two Mostof the literature discusses problems faced by foreign stu-dents pursuing study programs in an English-speakingcountry or a minority in a multi-cultural society (seefor example Coutis P and Wood L Hubbard R Koh E)Such students live (at least while studying) in an English-speaking community with which they have to interact on adaily basisese diculties are more serious for our stu-dents who are studying in their own countrywhere Englishis not the ocial languageey hardly use English outsideclassrooms and avoid talking in class as much as they can

My SQU ExperienceStatistical concepts and methods are most eectivelytaught through real-life examples that the students appre-ciate and understand We use the most popular textbooksin the USA for our courses ese textbooks use thisapproach with US students in mind Our main prob-lems are

Most of the examples and exercises used are com-pletely alien to our studentse discussions meant tomaintain the studentsrsquo interest only serve to put ourso With limited English they have serious dicultiesunderstanding what is explained and hence tend not tolisten to what the instructor is sayingey do not readthe textbooks because they contain pages and pages oflengthy explanations and discussions they cannot fol-low A direct eect is that students may nd the subjectboring and quickly lose interesteir attention thenturns to the art of passing tests instead of acquiring theintended knowledge and skills To pass their tests theyuse both their class and study times looking throughexamples concentrating on what formula to use andwhere to plug the numbers they have to get the answeris way they manage to do the mechanics fairly wellbut the concepts are almost entirely missed

L Least Absolute Residuals Procedure

e problem is worse with introductory probabilitycourses where games of chance are extensively usedas illustrative examples in textbooks Most of our stu-dents have never seen a deck of playing cards and somemay even be oended by discussing card games in aclassroom

e burden of nding strategies to overcome these dicul-ties falls on the instructor Statistical terms and conceptssuch as parameterstatistic sampling distribution unbi-asedness consistency suciency and ideas underlyinghypothesis testing are not easy to get across even in the stu-dentsrsquo own language To do that in a foreign language is areal challenge For the Statistics program to be successfulall (or at least most of the) instructors involved should beup to this challengeis is a time-consuming task with lit-tle reward other than self satisfaction In SQU the problemis compounded further by the fact that most of the instruc-tors are expatriates on short-term contracts who are morelikely to use their time for personal career advancementrather than time-consuming community service jobs

What Can Be DoneFor our rst Statistics course we produced a manual thatcontains very brief notes and many samples of previousquizzes tests and examinations It contains a good col-lection of problems from local culture to motivate thestudentse manual was well received by the students tothe extent that students prefer to practice with examplesfrom the manual rather than the textbookTextbooks written in English that are brief and to the

point are neededese should include examples and exer-cises from the studentsrsquo own culture A student tryingto understand a specic point gets distracted by lengthyexplanations and discouraged by thick textbooks to beginwith In a classroom where studentsrsquo faces clearly indicatethat you have got nothing across it is natural to try explain-ing more using more examples In the end of semesterevaluation of a course I taught a student once wrote ldquoeinstructor explains things more than needed He makessimple points dicultrdquois indicates that when teachingin a foreign language lengthy oral or written explanationsare not helpful A better strategywill be to explain conceptsand techniques briey and provide plenty of examples andexercises that will help the students absorb the material byosmosis e basic statistical concepts can only be eec-tively communicated to students in their own languageFor this reason textbooks should contain a good glossarywhere technical terms and concepts are explained using thelocal language

I expect such textbooks to go a long way to enhancestudentsrsquo understanding of Statistics An internationalproject can be initiated to produce an introductory statis-tics textbook with dierent versions intended for dierentgeographical arease English material will be the samethe examples vary to some extent from area to area andglossaries in local languages Universities in the develop-ing world naturally look at western universities asmodelsand international (western) involvement in such a projectis needed for it to succeede project will be a major con-tribution to the promotion of understanding Statistics andexcellence in statistical education in developing countriese international statistical institute takes pride in sup-porting statistical progress in the developing world isproject can lay the foundation for this progress and henceis worth serious consideration by the institute

Cross References7Online Statistics Education7Promoting Fostering and Development of Statistics inDeveloping Countries7Selection of Appropriate Statistical Methods in Develop-ing Countries7Statistical Literacy Reasoning andinking7Statistics Education

References and Further ReadingCoutis P Wood L () Teaching statistics and academic language

in culturally diverse classrooms httpwwwmathuocgr~ictmProceedingspappdf

Hubbard R () Teaching statistics to students who are learningin a foreign language ICOTS

Koh E () Teaching statistics to students with limited languageskills using MINITAB httparchivesmathutkeduICTCMVOLCpaperpdf

Least Absolute ResidualsProcedureRichardWilliam FarebrotherHonorary Reader in EconometricsVictoria University of Manchester Manchester UK

For i = n let xi xi xiq yi represent the ithobservation on a set of q + variables and suppose that wewish to t a linear model of the form

yi = xiβ + xiβ +⋯ + xiqβq + єi

Least Absolute Residuals Procedure L

L

to these n observations en for p gt the Lp-normtting procedure chooses values for b b bq to min-imise the Lp-norm of the residuals [sumni= ∣ei∣p]

p where fori = n the ith residual is dened by

ei = yi minus xib minus xib minus minus xiqbq

e most familiar Lp-norm tting procedure knownas the 7least squares procedure sets p = and choosesvalues for b b bq tominimize the sum of the squaredresidualssumni= ei A second choice to be discussed in the present article

sets p = and chooses b b bq to minimize the sumof the absolute residualssumni= ∣ei∣A third choice sets p =infin and chooses b b bq to

minimize the largest absolute residualmaxni=∣ei∣Setting ui = ei and vi = if ei ge and ui =

and vi = minusei if ei lt we nd that ei = ui minus vi so thatthe least absolute residuals (LAR) tting problem choosesb b bq to minimize the sum of the absolute residuals

n

sumi=

(ui + vi)

subject to

xib + xib +⋯ + xiqbq +Ui minus vi = yi for i = n

and Ui ge vi ge for i = ne LAR tting problem thus takes the form of a linearprogramming problem and is oen solved by means of avariant of the dual simplex procedureGauss has noted (when q = ) that solutions of this

problem are characterized by the presence of a set of qzero residuals Such solutions are robust to the presence ofoutlying observations Indeed they remain constant undervariations in the other n minus q observations provided thatthese variations do not cause any of the residuals to changetheir signs

e LAR tting procedure corresponds to the maxi-mum likelihood estimator when the є-disturbances followa double exponential (Laplacian) distributionis estima-tor is more robust to the presence of outlying observationsthan is the standard least squares estimator which maxi-mizes the likelihood function when the є-disturbances arenormal (Gaussian)Nevertheless theLAR estimator has anasymptotic normal distribution as it is amember ofHuberrsquosclass ofM-estimators

ere are many variants of the basic LAR proce-dure but the one of greatest historical interest is thatproposed in by the Croatian Jesuit scientist Rugjer(or Rudjer) Josip Boškovic (ndash) (Latin RogeriusJosephusBoscovich Italian RuggieroGiuseppeBoscovich)

In his variant of the standard LAR procedure there are twoexplanatory variables of which the rst is constant xi = and the values of b and b are constrained to satisfy theadding-up condition sumni=(yi minus b minus xib) = usuallyassociated with the least squares procedure developed byGauss in and published by Legendre in Com-puter algorithms implementing this variant of the LARprocedure with q ge variables are still to be found in theliteratureFor an account of recent developments in this area

see the series of volumes edited by Dodge ( ) For a detailed history of the LAR procedureanalyzing the contributions of Boškovic Laplace GaussEdgeworth Turner Bowley and Rhodes see Farebrother() And for a discussion of the geometrical andmechanical representation of the least squares and LARtting procedures see Farebrother ()

About the AuthorRichard William Farebrother was a member of the teach-ing sta of the Department of Econometrics and SocialStatistics in the (Victoria) University of Manchester from until when he took early retirement (he hasbeen blind since ) From until he was anHonorary Reader in Econometrics in the Department ofEconomic Studies of the sameUniversity He has publishedthree books Linear Least Squares Computations in Fitting Linear Relationships A History of the Calculus ofObservations ndash in and Visualizing statisticalModels and Concepts in He has also published morethan research papers in a wide range of subject areasincluding econometric theory computer algorithms sta-tistical distributions statistical inference and the history ofstatistics Dr Farebrother was also visiting Associate Pro-fessor () at Monash University Australia and VisitingProfessor () at the Institute for Advanced Studies inVienna Austria

Cross References7Least Squares7Residuals

References and Further ReadingDodge Y (ed) () Statistical data analysis based on the L-norm

and related methods North-Holland AmsterdamDodge Y (ed) () L-statistical analysis and related methods

North-Holland AmsterdamDodge Y (ed) () L-statistical procedures and related topics

Institute of Mathematical Statistics HaywardDodge Y (ed) () Statistical data analysis based on the L-norm

and related methods Birkhaumluser Basel

L Least Squares

Farebrother RW () Fitting linear relationships a history of thecalculus of observations ndash Springer New York

Farebrother RW () Visualizing statistical models and conceptsMarcel Dekker New York

Least Squares

Czesław StepniakProfessorMaria Curie-Skłodowska University Lublin PolandUniversity of Rzeszoacutew Rzeszoacutew Poland

Least Squares (LS) problem involves some algebraic andnumerical techniques used in ldquosolvingrdquo overdeterminedsystems F(x) asymp b of equations where b isin Rn while F(x)is a column of the form

F(x) =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

f(x)

f(x)

fmminus(x)

fm(x)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

with entries fi = fi(x) i = n where x = (x xp)T e LS problem is linear when each fi is a linear functionand nonlinear ndash if notLinear LS problem refers to a system Ax = b of linear

equations Such a system is overdetermined if n gt p If b notinrange(A) the system has no proper solution and will bedenoted by Ax asymp b In this situation we are seeking for asolution of some optimization probleme name ldquoLeastSquaresrdquo is justied by the l-norm commonly used as ameasure of imprecision

e LS problem has a clear statistical interpretation inregression terms Consider the usual regression model

yi = fi(xi xip β βp) + ei for i = n ()

where xij i = n j = p are some constants givenby experimental design fi i = n are given functionsdepending on unknown parameters βj j = p whileyi i = n are values of these functions observed withsome random errors ei We want to estimate the unknownparameters βi on the basis of the data set xij yi

In linear regression each fi is a linear function of typefi = sump

j= cij(x xn)βj and the model () may bepresented in vector-matrix notation as

y = Xβ + e

where y = (y yn)T e = (e en)T and β =(β βp)T while X is a n times p matrix with entries xijIf e en are not correlated with mean zero and a com-mon (perhaps unknown) variance then the problem ofBest Linear Unbiased Estimation (BLUE) of β reduces tonding a vector β that minimizes the norm ∣∣y minus Xβ ∣∣

=

(y minus Xβ)T(y minus Xβ)Such a vector is said to be the ordinary LS solution of

the overparametrized system Xβ asymp y On the other handthe last one reduces to solving the consistent system

XTXβ = XTy

of linear equations called normal equations In particularif rank(X) = p then the system has a unique solution of theform

β = (XTX)minusXTy

For linear regression yi = α+βxi+ei with one regressorx the BLU estimators of the parameters α and β may bepresented in the convenient form as

β = nsxynsx

and α = y minus βx

where

nsxy =sumixiyi minus

(sumi xi) (sumi yi)n

nsx =sumixi minus

(sumi xi)

n x = sumi xi

nand y = sumi yi

n

For its computation we only need to use a simple pocketcalculator

Example e following table presents the number of resi-dents in thousands (x) and the unemployment rate in (y)for some cities of Poland Estimate the parameters β and α

xi

yi

In this casesumi xi = sumi yi = sumi xi = and sumi xiyi = erefore nsx = and nsxy =minusus β = minus and α = and hence f (x) =minusx +

Leacutevy Processes L

L

If the variancendashcovariance matrix of the error vector ecoincides (except a multiplicative scalar σ ) with a positivedenite matrix V then the Best Linear Unbiased estima-tion reduces to the minimization of (y minusXβ)TV(y minusXβ)called the weighed LS problem Moreover if rank(X) = pthen its solution is given in the form

β = (XTVminusX)minusXTVminusy

It is worth to add that a nonlinear LS problem is morecomplicated and its explicit solution is usually not knownInstead of this some algorithms are suggestedTotal least squares problem e problem has been

posed in recent years in numerical analysis as an alterna-tive for the LS problem in the casewhen all data are aectedby errorsConsider an overdetermined system of n linear equa-

tionsAx asymp b with k unknown xe TLS problem consistsin minimizing the Frobenius norm

∣∣[A b] minus [A b]∣∣Ffor all A isin Rntimesk and b isin range(A) where the Frobeniusnorm is dened by ∣∣(aij)∣∣F = sumij aij Once a minimizing[A b] is found then any x satisfying Ax = b is called a TLSsolution of the initial system Ax asymp b

e trouble is that the minimization problem maynot be solvable or its solution may not be unique As anexample one can set

A =⎡⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎦and b =

⎡⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎦

It is known that the TLS solution (if exists) is alwaysbetter than the ordinary LS in the sense that the cor-rection b minus Ax has smaller l-norm e main tool insolving the TLS problems is the following Singular ValueDecompositionFor any matrix A of n times k with real entries there

exist orthonormal matrices P = [p pn] and

Q = [q qk] such that

PTAQ = diag(σ σm) where σ ge ⋯ ge σm andm = minn k

About the AuthorFor biography see the entry 7Random Variable

Cross References7Absolute Penalty Estimation7Adaptive Linear Regression

7Analysis of Areal and Spatial Interaction Data7Autocorrelation in Regression7Best Linear Unbiased Estimation in Linear Models7Estimation7Gauss-Markoveorem7General Linear Models7Least Absolute Residuals Procedure7Linear Regression Models7Nonlinear Models7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Simple Linear Regression7Statistics History of7Two-Stage Least Squares

References and Further ReadingBjoumlrk A () Numerical methods for least squares problems

SIAM PhiladelphiaKariya T () Generalized least squares Wiley New YorkRao CR Toutenberg H () Linear models least squares and

alternatives Springer New YorkVan Huffel S Vandewalle J () The total least squares problem

computational aspects and analysis SIAM PhiladelphiaWolberg J () Data analysis using the method of least squares

extracting the most information from experiments SpringerNew York

Leacutevy Processes

Mohamed Abdel-HameedProfessorUnited Arab Emirates University Al Ain United ArabEmirates

IntroductionLeacutevy processes have become increasingly popular in engi-neering (reliability dams telecommunication) andmathe-matical nanceeir applications in reliability stems fromthe fact that they provide a realistic model for the degrada-tion of devices while their applications in the mathemati-cal theory of dams as they provide a basis for describingthe water input of dams eir popularity in nance isbecause they describe the nancialmarkets in amore accu-rate way than the celebrated BlackndashScholes model elatter model assumes that the rate of returns on assetsare normally distributed thus the process describing theasset price over time is continuous process In reality theasset prices have jumps or spikes and the asset returnsexhibit fat tails and 7skewness which negates the nor-mality assumption inherited in the BlackndashScholes model

L Leacutevy Processes

Because of the deciencies in the BlackndashScholes modelresearchers in mathematical nance have been trying tondmore suitable models for asset prices Certain types ofLeacutevy processes have been found to provide a good modelfor creep of concrete fatigue crack growth corroded steelgates and chloride ingress into concrete Furthermore cer-tain types of Leacutevy processes have been used to model thewater input in damsIn this entry we will review Leacutevy processes and give

important examples of such processes and state somereferences to their applications

Leacutevy ProcessesA stochastic process X = Xt t ge that has right con-tinuous sample paths with le limits is said to be a Leacutevyprocess if the following hold

X has stationary increments ie for every s t ge thedistribution of Xt+s minus Xt is independent of t

X has independent increments ie for every t sge Xt+s minus Xt is independent of ( Xuu le t)

X is stochastically continuous ie for every t ge andє gt

limsrarrtP(∣Xt minus Xs∣ gt є) = at is to say a Leacutevy process is a stochastically continu-ous process with stationary and independent incrementswhose sample paths are right continuous with le handlimitsIf Φ(z) is the characteristic function of a Leacutevy process

then its characteristic component φ(z) def= ln Φ(z)t is of the

form

iza minus zb+ int

R[exp(izx) minus minus izxI∣x∣lt]ν(dx)

where a isin R b isin R+ and ν is a measure on R satisfyingν() = intR( and x

)ν(dx) ltinfine measure ν characterizes the size and frequency of

the jumps If the measure is innite then the process hasinnitelymany jumps of very small sizes in any small inter-vale constant a dened above is called the dri term ofthe process and b is the variance (volatility) term

e LeacutevyndashIt^o decomposition identify any Leacutevy process

as the sum of three independent processes it is stated asfollowsGiven any a isin R b isin R+ and measure ν on R satisfying

ν() = intR(andx)ν(dx) ltinfin there exists a probability

space (Ω P) on which a Leacutevy process X is denede

process X is the sum of three independent processes()X

()X and

()X where

()X is a Brownianmotionwith dri a and

volatility b (in the sense dened below)()X is a compound

Poisson process and()X is a square integrable martingale

e characteristic components of()X

()X and

()X (denoted

by()φ (z)

()φ (z) and

()φ (z) respectively) are as follows

()φ (z) = iza minus zb

()φ (z)= int∣x∣ge(exp(izx) minus )ν(dx)()φ (z)= int∣x∣lt(exp(izx) minus minus izx)ν(dx)

Examples of the Leacutevy ProcessesThe Brownian MotionA Leacutevy process is said to be a Brownian motion (see7Brownian Motion and Diusions) with dri micro andvolatility rate σ if micro = a b = σ and ν (R)= Brow-nian motion is the only nondeterministic Leacutevy processeswith continuous sample paths

The Inverse Brownian ProcessLet X be a Brownian motion with micro gt and volatility rateσ For any x gt let Tx = inft Xt gt xen Tx is anincreasing Leacutevy process (called inverse Brownianmotion)its Leacutevy measure is given by

υ(dx) = radicπσ x

exp(minusxmicro

σ )

The Compound Poisson Processe compound Poisson process (see 7Poisson Processes)is a Leacutevy process where b = and ν is a nite measure

The Gamma Processe gamma process is a nonnegative increasing Leacutevy pro-cess X where b = a minus int xν(dx) = and its Leacutevymeasure is given by

ν(dx) = αxexp(minusxβ)dx x gt

where α β gt It follows that the mean term (E(X ))and the variance term (V(X )) for the process are equalto αβ and αβ respectively

e following is a simulated sample path of a gammaprocess where α = and β = (Fig )

The Variance Gamma Processe variance gamma process is a Leacutevy process that can berepresented as either the dierence between two indepen-dent gamma processes or as a Brownian process subordi-nated by a gamma processe latter is accomplished by arandom time change replacing the time of the Brownianprocess by a gamma process with a mean term equal to e variance gamma process has three parameters micro ndash the

Life Expectancy L

L

0 2 4 6 8 100

2

4

6

8

10

12

t

x

Leacutevy Processes Fig Gamma process

0 2 4 6 8 10minus08

minus06

minus04

minus02

0

02

04

06

t

x

Brownian motionVariance gamma

Leacutevy Processes Fig Brownian motion and variance gammasample paths

Brownian process dri term σ ndash the volatility of theBrownian process and ν ndash the variance term of the thegamma process

e following are two simulated sample paths one fora Brownian motion with a dri term micro = and volatilityterm σ = and the other is for a variance gamma processwith the same values for the dri term and the volatilityterms and ν = (Fig )

About the AuthorProfessor Mohamed Abdel-Hameed was the chairman ofthe Department of Statistics and Operations ResearchKuwait University (ndash) He was on the editorialboard of Applied Stochastic Models and Data Analysis(ndash) He was the Editor (jointly with E Cinlarand J Quinn) of the text Survival Models Maintenance

Replacement Policies and Accelerated Life Testing (Aca-demic Press ) He is an elected member of the Inter-national Statistical Institute He has published numerouspapers in Statistics and Operations research

Cross References7Non-Uniform Random Variate Generations7Poisson Processes7Statistical Modeling of Financial Markets7Stochastic Processes7Stochastic Processes Classication

References and Further ReadingAbdel-Hameed MS () Life distribution of devices subject to a

Leacutevy wear process Math Oper Res ndashAbdel-Hameed M () Optimal control of a dam using PMλτ

policies and penalty cost when the input process is a com-pound Poisson process with a positive drift J Appl Prob ndash

Black F Scholes MS () The pricing of options and corporateliabilities J Pol Econ ndash

Cont R Tankov P () Financial modeling with jump processesChapman amp HallCRC Press Boca Raton

Madan DB Carr PP Chang EC () The variance gamma processand option pricing Eur Finance Rev ndash

Patel A Kosko B () Stochastic resonance in continuous andspiking Neutron models with Leacutevy noise IEEE Trans NeuralNetw ndash

van Noortwijk JM () A survey of the applications of gammaprocesses in maintenance Reliab Eng Syst Safety ndash

Life Expectancy

Maja Biljan-AugustFull ProfessorUniversity of Rijeka Rijeka Croatia

Life expectancy is dened as the average number of yearsa person is expected to live from age x as determined bystatisticsStatistics on life expectancy are derived from a mathe-

matical model known as the 7life table In order to calcu-late this indicator the mortality rate at each age is assumedto be constant Life expectancy (ex) can be evaluated atany age and in a hypothetical stationary population canbe written in discrete form as

ex =Txlx

where x is age Tx is the number of person-years livedaged x and over and lx is the number of survivors at agex according to the life table

L Life Expectancy

Life expectancy can be calculated for combined sexesor separately for males and femalesere can be signi-cant dierences between sexesLife expectancy at birth (e) is the average number of

years a newborn child can expect to live if currentmortalitytrends remain constant

e =Tl

where T is the total size of the population and l is thenumber of births (the original number of persons in thebirth cohort)Life expectancy declines with age Life expectancy at

birth is highly inuenced by infant mortality rate eparadox of the life table refers to a situation where lifeexpectancy at birth increases for several years aer birth(e lt e lt e and even beyond)e paradox reects thehigher rates of infant and child mortality in populationsin pre-transition and middle stages of the demographictransitionLife expectancy at birth is a summary measure of mor-

tality in a population It is a frequently used indicatorof health standards and socio-economic living standardsLife expectancy is also one of the most commonly usedindicators of social development is indicator is easilycomparable through time and between areas includingcountries Inequalities in life expectancy usually indicateinequalities in health and socio-economic developmentLife expectancy rose throughout human history In

ancient Greece and Rome the average life expectancywas below years between the years and life expectancy at birth rose from about years to aglobal average of years and to more than years inthe richest countries (Riley ) Furthermore in mostindustrialized countries in the early twenty-rst centurylife expectancy averaged at about years (WHO)esechanges called the ldquohealth transitionrdquo are essentially theresult of improvements in public health medicine andnutritionLife expectancy varies signicantly across regions and

continents from life expectancies of about years insome central African populations to life expectancies of years and above in many European countries emore developed regions have an average life expectancy of years while the population of less developed regionsis at birth expected to live an average years less etwo continents that display the most extreme dierencesin life expectancies are North America ( years) andAfrica ( years) where as of recently the gap betweenlife expectancies amounts to years (UN ndash data)

Countries with the highest life expectancies in theworld ( years) are Australia Iceland Italy Switzerlandand Japan ( years) Japanese men and women live anaverage of and years respectively (WHO )In countries with a high rate of HIV infection prin-

cipally in Sub-Saharan Africa the average life expectancyis years and below Some of the worldrsquos lowest lifeexpectancies are in Sierra Leone ( years) Afghanistan( years) Lesotho ( years) and Zimbabwe ( years)In nearly all countries women live longer than men

e worldrsquos average life expectancy at birth is years formales and years for females the gap is about ve yearse female-to-male gap is expected to narrow in the moredeveloped regions andwiden in the less developed regionse Russian Federation has the greatest dierence in lifeexpectancies between the sexes ( years less for men)whereas in Tonga life expectancy for males exceeds thatfor females by years (WHO )Life expectancy is assumed to rise continuously

According to estimation by the UN global life expectancyat birth is likely to rise to an average years by ndashBy life expectancy is expected to vary across countriesfrom to years Long-range United Nations popula-tion projections predict that by on average peoplecan expect to live more than years from (Liberia) upto years (Japan)For more details on the calculation of life expectancy

including continuous notation see Keytz ( )and Preston et al ()

Cross References7Biopharmaceutical Research Statistics in7Biostatistics7Demography7Life Table7Mean Residual Life7Measurement of Economic Progress

References and Further ReadingKeyfitz N () Introduction to the Mathematics of Population

Addison-Wesly MassachusettsKeyfitz N () Applied mathematical demography Springer

New YorkPreston SH Heuveline P Guillot M () Demography measuring

and modelling potion processes Blackwell OxfordRiley JC () Rising life expectancy a global history Cambridge

University Press CambridgeRowland DT () Demographic methods and concepts Oxford

University Press New YorkSiegel JS Swanson DA () The methods and materials of demog-

raphy nd edn Elsevier Academic Press AmsterdamWorld Health Organization () World health statistics

Geneva

Life Table L

L

World Population Prospects The revision Highlights UnitedNations New York

World Population to () United Nations New York

Life Table

JanM HoemProfessor Director EmeritusMax Planck Institute for Demographic Research RostockGermany

e life table is a classical tabular representation of centralfeatures of the distribution function F of a positive variablesay X which normally is taken to represent the lifetime ofa newborn individuale life table was introduced wellbefore modern conventions concerning statistical distri-butions were developed and it comes with some specialterminology and notation as follows Suppose that F has a

density f (x) = ddxF(x) and dene the force of mortality or

death intensity at age x as the function

micro(x) = minus ddxln minus F(x) = f (x)

minus F(x)

Heuristically it is interpreted by the relation micro(x)dx =Px lt X lt x + dx∣X gt x Conversely F(x) =

minus expminusx

intmicro(s)ds e survivor function is dened as

ℓ(x) = ℓ() minus F(x) normally with ℓ() = Inmortality applications ℓ(x) is the expected number of sur-vivors to exact age x out of an original cohort of newborn babiese survival probability is

tpx = PX gt x + t∣X gt x = ℓ(x + t)ℓ(x)

= expminusintt

micro(x + s)ds

and the non-survival probability is (the converse) tqx = minustpx For t = one writes qx = qx and px = px In partic-ular we get ℓ(x + ) = ℓ(x)pxis is a practical recursionformula that permits us to compute all values of ℓ(x) oncewe know the values of px for all relevant x

e life expectancy is eo = EX = int infin ℓ(x)dxℓ()(e subscript in eo indicates that the expected value iscomputed at age (ie for newborn individuals) and thesuperscript o indicates that the computation is made in

the continuous mode)e remaining life expectancy atage x is

eox = E(X minus x∣X gt x) = intinfin

ℓ(x + t)dtℓ(x)

ie it is the expected lifetime remaining to someone whohas attained age xTo turn to the statistical estimation of these various

quantities suppose that the function micro(x) is piecewiseconstant which means that we take it to equal some con-stant say microj over each interval (xj xj+) for some partitionxj of the age axis For a collection Xi of independentobservations of X let Dj be the number of Xi that fall inthe interval (xj xj+) In mortality applications this is thenumber of deaths observed in the given interval For thecohort of the initially newborn Dj is the number of indi-viduals whodie in the interval (called the occurrences in theinterval) If individual i dies in the interval he or she willof course have lived for Xi minus xj time units during the inter-val Individuals who survive the interval will have lived forxj+ minus xj time units in the interval and individuals whodo not survive to age xj will not have lived during thisinterval at all When we aggregate the time units lived in(xj xj+) over all individuals we get a total Rj which iscalled the exposures for the interval the idea being thatindividuals are exposed to the risk of death for as long asthey live in the interval In the simple case where there areno relations between the individual parameters microj the col-lection DjRj constitutes a statistically sucient set ofobservations with a likelihood Λ that satises the relationln Λ = sumjminusmicrojRj+Dj ln microjwhich is easily seen to bemax-imized by microj = DjRje latter fraction is therefore themaximum-likelihood estimator for microj (In some connec-tions an age schedule of mortality will be specied such asthe classical GompertzndashMakeham function microx = a + bcxwhich does represent a relationship between the intensityvalues at the dierent ages x normally for single-year agegroupsMaximum likelihood estimators can then be foundby plugging this functional specication of the intensitiesinto the likelihood function nding the values a b andc that maximize Λ and using a + bcx for the intensity inthe rest of the life table computations Methods that do notamount tomaximum likelihood estimationwill sometimesbe used because they involve simpler computations Withsome luck they provide starting values for the iterative pro-cess that must usually be applied to produce the maximumlikelihood estimators For an example see Forseacuten ())is whole schema can be extended trivially to cover cen-soring (withdrawals) provided the censoringmechanism isunrelated to the mortality process

L Likelihood

If the force of mortality is constant over a single-yearage interval (x x + ) say and is estimated by microx in thisinterval then px = eminusmicrox is an estimator of the single-yearsurvival probability pxis allows us to estimate the sur-vival function recursively for all corresponding ages usingℓ(x + ) = ℓ(x)px for x = and the rest of thelife table computations follow suit Life table constructionconsists in the estimation of the parameters and the tab-ulation of functions like those above from empirical datae data can be for age at death for individuals as in theexample indicated above but they can also be observa-tions of duration until recovery from an illness of intervalsbetween births of time until breakdown of some piece ofmachinery or of any other positive duration variableSo far we have argued as if the life table is computed for

a group of mutually independent individuals who have allbeen observed in parallel essentially a cohort that is fol-lowed from a signicant common starting point (namelyfrom birth in our mortality example) and which is dimin-ished over time due to decrements (attrition) caused bythe risk in question and also subject to reduction due tocensoring (withdrawals)e corresponding table is thencalled a cohort life table It is more common however toestimate a px schedule from data collected for the mem-bers of a population during a limited time period and touse the mechanics of life-table construction to produce aperiod life table from the px valuesLife table techniques are described in detail in most

introductory textbooks in actuarial statistics7biostatistics7demography and epidemiology See eg Chiang ()Elandt-Johnson and Johnson () Manton and Stallard() Preston et al () For the history of thetopic consult Seal () Smith and Keytz () andDupacircquier ()

About the AuthorFor biography see the entry 7Demography

Cross References7Demographic Analysis A Stochastic Approach7Demography7Event History Analysis7Kaplan-Meier Estimator7Population Projections7Statistics An Overview

References and Further ReadingChiang CL () The life table and its applications Krieger

MalabarDupacircquier J () Lrsquoinvention de la table de mortaliteacute Presses

universitaires de France Paris

Elandt-Johnson RC Johnson NL () Survival models and dataanalysis Wiley New York

Forseacuten L () The efficiency of selected moment methods inGompertzndashMakeham graduation of mortality Scand Actuarial Jndash

Manton KG Stallard E () Recent trends in mortality analysisAcademic Press Orlando

Preston SH Heuveline P Guillot M () Demography Measuringand modeling populations Blackwell Oxford

Seal H () Studies in history of probability and statistics mul-tiple decrements of competing risks Biometrica ()ndash

Smith D Keyfitz N () Mathematical demography selectedpapers Springer Heidelberg

Likelihood

Nancy ReidProfessorUniversity of Toronto Toronto ON Canada

Introductione likelihood function in a statistical model is propor-tional to the density function for the random variable tobe observed in the model Most oen in applications oflikelihood we have a parametric model f (y θ) where theparameter θ is assumed to take values in a subset of Rkand the variable y is assumed to take values in a subset ofRn the likelihood function is dened by

L(θ) = L(θ y) = cf (y θ) ()

where c can depend on y but not on θ In more gen-eral settings where the model is semi-parametric or non-parametric the explicit denition is more dicult becausethe density needs to be dened relative to a dominatingmeasure whichmay not exist seeVan derVaart () andMurphy andVan derVaart ()is article will consideronly nite-dimensional parametric modelsWithin the context of the given parametric model the

likelihood function measures the relative plausibility ofvarious values of θ for a given observed data point y Val-ues of the likelihood function are only meaningful relativeto each other and for this reason are sometimes stan-dardized by the maximum value of the likelihood func-tion although other reference points might be of interestdepending on the contextIf ourmodel is f (y θ) = (ny)θy(minusθ)nminusy y = n

θ isin [ ] then the likelihood function is (any functionproportional to)

L(θ y) = θy( minus θ)nminusy

Likelihood L

L

and can be plotted as a function of θ for any xed valueof y e likelihood function is maximized at θ = ynis model might be appropriate for a sampling schemewhich recorded the number of successes amongn indepen-dent trials that result in success or failure each trial havingthe same probability of success θ Another example is thelikelihood function for the mean and variance parameterswhen sampling from a normal distribution with mean microand variance σ

L(θ y) = expminusn log σ minus (σ )Σ(yi minus micro)

where θ = (micro σ )is could also be plotted as a functionof micro and σ for a given sample y yn and it is not dif-cult to show that this likelihood function only dependson the sample through the sample mean y = nminusΣyi andsample variance s = (n minus )minusΣ(yi minus y) or equivalentlythrough Σyi and Σyi It is a general property of likelihoodfunctions that they depend on the data only through theminimal sucient statistic

Inferencee likelihood function was dened in a seminal paperof Fisher () and has since become the basis for mostmethods of statistical inference One version of likelihoodinference suggested by Fisher is to use some rule suchas L(θ)L(θ) gt k to dene a range of ldquolikelyrdquo or ldquoplau-siblerdquo values of θ Many authors including Royall ()and Edwards () have promoted the use of plots ofthe likelihood function along with interval estimates ofplausible valuesis approach is somewhat limited how-ever as it requires that θ have dimension or possibly or that a likelihood function can be constructed that onlydepends on a component of θ that is of interest see sectionldquo7Nuisance Parametersrdquo belowIn general we would wish to calibrate our inference

for θ by referring to the probabilistic properties of theinferential method One way to do this is to introduce aprobability measure on the unknown parameter θ typi-cally called a prior distribution and use Bayesrsquo rule forconditional probabilities to conclude

π(θ ∣ y) = L(θ y)π(θ)intθL(θ y)π(θ)dθ

where π(θ) is the density for the prior measure and π(θ ∣y) provides a probabilistic assessment of θ aer observingY = y in the model f (y θ) We could then make con-clusions of the form ldquohaving observed successes in trials and assuming π(θ) = the posterior probabilitythat θ gt is rdquo and so on

is is a very brief description of Bayesian inference inwhich probability statements refer to that generated from

the prior through the likelihood to the posterior A majordiculty with this approach is the choice of prior prob-ability function In some applications there may be anaccumulation of previous data that can be incorporatedinto a probability distribution but in general there is notand some rather ad hoc choices are oen made Anotherdiculty is themeaning to be attached to probability state-ments about the parameterInference based on the likelihood function can also be

calibrated with reference to the probability model f (y θ)by examining the distribution ofL(θY) as a random func-tion or more usually by examining the distribution ofvarious derived quantitiesis is the basis for likelihoodinference from a frequentist point of view In particularit can be shown that logL(θY)L(θY) where θ =θ(Y) is the value of θ at which L(θY) is maximized isapproximately distributed as a χk random variable wherek is the dimension of θ To make this precise requires anasymptotic theory for likelihood which is based on a cen-tral limit theorem (see 7Central Limiteorems) for thescore function

U(θY) = partpartθlogL(θY)

If Y = (Y Yn) has independent components thenU(θ) is a sum of n independent components which undermild regularity conditions will be asymptotically normalTo obtain the χ result quoted above it is also necessary toinvestigate the convergence of θ to the true value govern-ing the model f (y θ) Showing this convergence usuallyin probability but sometimes almost surely can be di-cult see Scholz () for a summary of some of the issuesthat ariseAssuming that θ is consistent for θ and that L(θY)

has sucient regularity the follow asymptotic results canbe established

(θ minus θ)T i(θ)(θ minus θ) drarr χk ()

U(θ)T iminus(θ)U(θ) drarr χk ()

ℓ(θ) minus ℓ(θ) drarr χk ()

where i(θ) = Eminusℓprimeprime(θY) θ is the expected Fisher infor-mation function ℓ(θ) = logL(θ) is the log-likelihoodfunction and χk is the 7chi-square distribution with kdegrees of freedom

ese results are all versions of a more general resultthat the log-likelihood function converges to the quadraticform corresponding to a multivariate normal distribution(see 7Multivariate Normal Distributions) under suitablystated limiting conditions ere is a similar asymptoticresult showing that the posterior density is asymptotically

L Likelihood

normal and in fact asymptotically free of the prior distri-bution although this result requires that the prior distribu-tion be a proper probability density ie has integral overthe parameter space equal to

Nuisance ParametersIn models where the dimension of θ is large plotting thelikelihood function is not possible and inference based onthe multivariate normal distribution for θ or the χk dis-tribution of the log-likelihood ratio doesnrsquot lead easily tointerval estimates for components of θ However it is pos-sible to use the likelihood function to construct inferencefor parameters of interest using variousmethods that havebeen proposed to eliminate nuisance parametersSuppose in the model f (y θ) that θ = (ψ λ) where ψ

is a k-dimensional parameter of interest (which will oenbe )e prole log-likelihood function of ψ is

ℓP(ψ) = ℓ(ψ λψ)

where λψ is the constrainedmaximum likelihood estimateit maximizes the likelihood function L(ψ λ) when ψ isheld xede prole log-likelihood function is also calledthe concentrated log-likelihood function especially ineconometrics If the parameter of interest is not expressedexplicitly as a subvector of θ then the constrained maxi-mum likelihood estimate is found using Lagrange multi-pliersIt can be veried under suitable smoothness conditions

that results similar to those at ( ndash ) hold as well for theprole log-likelihood function in particular

ℓP(ψ) minus ℓP(ψ) = ℓ(ψ λ) minus ℓ(ψ λψ)drarr χk

is method of eliminating nuisance parameters is notcompletely satisfactory especially when there are manynuisance parameters in particular it doesnrsquot allow forerrors in estimation of λ For example the prole likeli-hood approach to estimation of σ in the linear regressionmodel (see 7Linear Regression Models) y sim N(Xβ σ )will lead to the estimator σ = Σ(yi minus yi)n whereas theestimator usually preferred has divisor nminusp where p is thedimension of β

us a large literature has developed on improvementsto the prole log-likelihood For Bayesian inference suchimprovements are ldquoautomaticallyrdquo included in the formu-lation of the marginal posterior density for ψ

πM(ψ ∣ y)prop int π(ψ λ ∣ y)dλ

but it is typically quite dicult to specify priors for possiblyhigh-dimensional nuisance parameters For non-Bayesian

inference most modications to the prole log-likelihoodare derived by considering conditional or marginal infer-ence in models that admit factorizations at least approxi-mately like the following

f (y θ) = f(yψ)f(y ∣ y λ) orf (y θ) = f(y ∣ yψ)f(y λ)

A discussion of conditional inference and density factori-sations is given in Reid ()is literature is closely tiedto that on higher order asymptotic theory for likelihoode latter theory builds on saddlepoint and Laplace expan-sions to derive more accurate versions of (ndash) see forexample Severini () and Brazzale et al () edirect likelihood approach of Royall () and others doesnot generalize very well to the nuisance parameter settingalthough Royall and Tsou () present some results inthis direction

Extensions to Likelihoode likelihood function is such an important aspect ofinference based on models that it has been extended toldquolikelihood-likerdquo functions formore complex data settingsExamples include nonparametric and semi-parametriclikelihoods the most famous semi-parametric likelihoodis the proportional hazards model of Cox () Butmany other extensions have been suggested to empiri-cal likelihood (Owen ) which is a type of nonpara-metric likelihood supported on the observed sample toquasi-likelihood (McCullagh ) which starts from thescore function U(θ) and works backwards to an infer-ence function to bootstrap likelihood (Davison et al )and many modications of prole likelihood (Barndor-Nielsen andCox Fraser )ere is recent interestfor multi-dimensional responses Yi in composite likeli-hoods which are products of lower dimensional condi-tional or marginal distributions (Varin ) Reid ()concluded a review article on likelihood by stating

7 From either a Bayesian or frequentist perspective the like-lihood function plays an essential role in inference Themaximum likelihood estimator once regarded on an equalfooting among competing point estimators is now typi-cally the estimator of choice although some refinementis needed in problems with large numbers of nuisanceparameters The likelihood ratio statistic is the basis formost tests of hypotheses and interval estimates The emer-gence of the centrality of the likelihood function for infer-ence partly due to the large increase in computing poweris one of the central developments in the theory of statisticsduring the latter half of the twentieth century

Likelihood L

L

Further Readinge book by Cox and Hinkley () gives a detailedaccount of likelihood inference and principles of statis-tical inference see also Cox () ere are severalbook-length treatments of likelihood inference includingEdwards () Azzalini () Pawitan () and Sev-erini () this last discusses higher order asymptotictheory in detail as does Barndor-Nielsen andCox ()and Brazzale Davison and Reid () A short reviewpaper is Reid () An excellent overview of consis-tency results for maximum likelihood estimators is Scholz() see also Lehmann andCasella () Foundationalissues surrounding likelihood inference are discussed inBerger and Wolpert ()

About the AuthorProfessor Reid is a Past President of the Statistical Societyof Canada (ndash) During (ndash) she servedas the President of the Institute of Mathematical Statis-tics Among many awards she received the Emanuel andCarol Parzen Prize for Statistical Innovation () ldquoforleadership in statistical science for outstanding researchin theoretical statistics and highly accurate inference fromthe likelihood function and for inuential contributionsto statistical methods in biology environmental sciencehigh energy physics and complex social surveysrdquo She wasawarded the Gold Medal Statistical Society of Canada() and FlorenceNightingaleDavidAward Committeeof Presidents of Statistical Societies () She is AssociateEditor of Statistical Science (ndash) Bernoulli (ndash) andMetrika (ndash)

Cross References7Bayesian Analysis or Evidence Based Statistics7Bayesian Statistics7Bayesian Versus Frequentist Statistical Reasoning7Bayesian vs Classical Point Estimation A ComparativeOverview7Chi-Square Test Analysis of Contingency Tables7Empirical Likelihood Approach to Inference fromSample Survey Data7Estimation7Fiducial Inference7General Linear Models7Generalized Linear Models7Generalized Quasi-Likelihood (GQL) Inferences7Mixture Models7Philosophical Foundations of Statistics

7Risk Analysis7Statistical Evidence7Statistical Inference7Statistical Inference An Overview7Statistics An Overview7Statistics Nelderrsquos view7Testing Variance Components in Mixed LinearModels7Uniform Distribution in Statistics

References and Further ReadingAzzalini A () Statistical inference Chapman and Hall LondonBarndorff-Nielsen OE Cox DR () Inference and asymptotics

Chapman and Hall LondonBerger JO Wolpert R () The likelihood principle Institute of

Mathematical Statistics HaywardBirnbaum A () On the foundations of statistical inference Am

Stat Assoc ndashBrazzale AR Davison AC Reid N () Applied asymptotics

Cambridge University Press CambridgeCox DR () Regression models and life tables J R Stat Soc B

ndash (with discussion)Cox DR () Principles of statistical inference Cambridge Uni-

versity Press CambridgeCox DR Hinkley DV () Theoretical statistics Chapman and

Hall LondonDavison AC Hinkley DV Worton B () Bootstrap likelihoods

Biometrika ndashEdwards AF () Likelihood Oxford University Press OxfordFisher RA () On the mathematical foundations of theoretical

statistics Phil Trans R Soc A ndashLehmann EL Casella G () Theory of point estimation nd edn

Springer New YorkMcCullagh P () Quasi-likelihood functions Ann Stat ndashMurphy SA Van der Vaart A () Semiparametric likelihood ratio

inference Ann Stat ndashOwen A () Empirical likelihood ratio confidence intervals for a

single functional Biometrika ndashPawitan Y () In all likelihood Oxford University Press OxfordReid N () The roles of conditioning in inference Stat Sci

ndashReid N () Likelihood J Am Stat Assoc ndashRoyall RM () Statistical evidence a likelihood paradigm

Chapman and Hall LondonRoyall RM Tsou TS () Interpreting statistical evidence using

imperfect models robust adjusted likelihoodfunctions J R StatSoc B

Scholz F () Maximum likelihood estimation In Ency-clopedia of statistical sciences Wiley New York doiesspub Accessed Aug

Severini TA () Likelihood methods in statistics Oxford Univer-sity Press Oxford

Van der Vaart AW () Infinite-dimensional likelihood meth-ods in statistics httpwwwstieltjesorgarchiefbiennialframenodehtml Accessed Aug

Varin C () On composite marginal likelihood Adv Stat Analndash

L Limit Theorems of Probability Theory

Limit Theorems of ProbabilityTheory

Alexandr Alekseevich BorovkovProfessor Head of Probability and Statistics Chair at theNovosibirsk UniversityNovosibirsk University Novosibirsk Russia

Limit eorems of Probability eory is a broad namereferring to the most essential and extensive research areain Probability eory which at the same time has thegreatest impact on the numerous applications of the latterBy its very nature Probability eory is concerned

with asymptotic (limiting) laws that emerge in a long seriesof observations on random events Because of this in theearly twentieth century even the very denition of prob-ability of an event was given by a group of specialists(R von Mises and some others) as the limit of the rela-tive frequency of the occurrence of this event in a longrow of independent random experiments e ldquostabilityrdquoof this frequency (ie that such a limit always exists) waspostulated Aer the s Kolmogorovrsquos axiomatic con-struction of probability theory has prevailed One of themain assertions in this axiomatic theory is the Law of LargeNumbers (LLN) on the convergence of the averages of largenumbers of random variables to their expectation islaw implies the aforementioned stability of the relative fre-quencies and their convergence to the probability of thecorresponding event

e LLN is the simplest limit theorem (LT) of probabil-ity theory elucidating the physical meaning of probabilitye LLN is stated as follows if XXX is a sequenceof iid random variables

Sn =n

sumj=Xj

and the expectation a = EX exists then nminusSnasETHrarr a

(almost surely ie with probability )us the value nacan be called the rst order approximation for the sums Sne Central Limiteorem (CLT) gives one a more preciseapproximation for Sn It says that if σ = E(X minus a) ltinfinthen the distribution of the standardized sum ζn = (Sn minusna)σ

radicn converges as n rarr infin to the standard normal

(Gaussian) lawat is for all x

P(ζn lt x)rarr Φ(x) = radicπ

x

intminusinfin

eminustdt

e quantity nEξ + ζσradicn where ζ is a standard normal

random variable (so that P(ζ lt x) = Φ(x)) can be calledthe second order approximation for Sn

e rst LLN (for the Bernoulli scheme) was provedby Jacob Bernoulli in the late s (published posthu-mously in ) e rst CLT (also for the Bernoullischeme) was established by A de Moivre (rst publishedin and referred nowadays to as the deMoivrendashLaplacetheorem) In the beginning of the nineteenth centuryPS Laplace and CF Gauss contributed to the generaliza-tion of these assertions and appreciation of their enormousapplied importance (in particular for the theory of errorsof observations) while later in that century further break-throughs in both methodology and applicability rangeof the CLT were achieved by PL Chebyshev () andAM Lyapunov ()

e main directions in which the two aforementionedmain LTs have been extended and rened since then are

Relaxing the assumption EX lt infin When the sec-ond moment is innite one needs to assume that theldquotailrdquo P(x) = P(X gt x) + P(X lt minusx) is a func-tion regularly varying at innity such that the limitlimxrarrinfinP(X gt x)P(x) existsen the distribution of thenormalized sum Snσ(n) where σ(n) = Pminus(nminus)Pminus being the generalized inverse of the function P andwe assume that Eξ = when the expectation is niteconverges to one of the so-called stable laws as nrarrinfine7characteristic functions of these laws have simpleclosed-form representations

Relaxing the assumption that the Xjrsquos are identicallydistributed and proceeding to study the so-called tri-angular array scheme where the distributions of thesummands Xj = Xjn forming the sum Sn depend notonly on j but on n as well In this case the class of alllimit laws for the distribution of Sn (under suitable nor-malization) is substantially wider it coincides with theclass of the so-called innitely divisible distributionsAn important special case here is the Poisson limit the-orem on convergence in distribution of the number ofoccurrences of rare events to a Poisson law

Relaxing the assumption of independence of the XjrsquosSeveral types of ldquoweak dependencerdquo assumptions onXjunder which the LLN and CLT still hold true havebeen suggested and investigatedOne should alsomen-tion here the so-called ergodic theorems (see7Ergodiceorem) for a wide class of random sequences andprocesses

Renement of the main LTs and derivation of asymp-totic expansions For instance in the CLT bounds ofthe rate of convergence P(ζn lt x) minus Φ(x) rarr and

Limit Theorems of Probability Theory L

L

asymptotic expansions for this dierence (in the pow-ers of nminus in the case of iid summands) have beenobtained under broad assumptions

Studying large deviation probabilities for the sums Sn(theorems on rare events) If x rarr infin together with nthen the CLT can only assert that P(ζn gt x) rarr eorems on large deviation probabilities aim to nd afunction P(xn) such that

P(ζn gt x)P(xn) rarr as nrarrinfin x rarrinfin

e nature of the function P(xn) essentially dependson the rate of decay of P(X gt x) as x rarrinfin and on theldquodeviation zonerdquo ie on the asymptotic behavior of theratio xn as nrarrinfin

Considering observations X Xn of a more com-plex nature ndash rst of all multivariate random vectorsIf Xj isin Rd then the role of the limit law in the CLTwill be played by a d-dimensional normal (Gaussian)distribution with the covariance matrix E(X minus EX)(X minus EX)T

e variety of application areas of the LLN and CLTis enormous us Mathematical Statistics is based onthese LTs Let Xlowastn = (X Xn) be a sample froma distribution F and Flowastn (u) the corresponding empiricaldistribution functione fundamental GlivenkondashCantellitheorem (see 7Glivenko-Cantelli eorems) stating thatsupu ∣F

lowastn (u)minus F(u)∣

asETHrarr as nrarrinfin is of the same natureas the LLN and basically means that the unknown distri-bution F can be estimated arbitrary well from the randomsample Xlowastn of a large enough size n

e existence of consistent estimators for the unknownparameters a = Eξ and σ = E(X minus a) also follows fromthe LLN since as nrarrinfin

alowast = n

n

sumj=Xj

asETHrarr a (σ )lowast = n

n

sumj=

(Xj minus alowast)

= n

n

sumj=Xj minus (alowast) asETHrarr σ

Under additional moment assumptions on the distri-bution F one can also construct asymptotic condenceintervals for the parameters a and σ as the distributionsof the quantities

radicn(alowast minus a) and

radicn((σ )lowast minus σ ) con-

verge as n rarr infin to the normal onese same can alsobe said about other parameters that are ldquosmoothrdquo enoughfunctionals of the unknown distribution F

e theorem on the 7asymptotic normality andasymptotic eciency of maximum likelihood estimatorsis another classical example of LTsrsquo applications in mathe-matical statistics (see eg Borovkov ) Furthermore in

estimation theory and hypotheses testing one also needstheorems on large deviation probabilities for the respec-tive statistics as it is statistical procedures with small errorprobabilities that are oen required in applicationsIt is worth noting that summation of random variables

is by no means the only situation in which LTs appear inProbabilityeoryGenerally speaking the main objective of Probabil-

ityeory in applications is nding appropriate stochasticmodels for objects under study and then determining thedistributions or parameters one is interested in As a rulethe explicit form of these distributions and parameters isnot known LTs can be used to nd suitable approximationsto the characteristics in questionAt least two possible approaches to this problem should

be noted here

Suppose that the unknown distribution Fθ depends ona parameter θ such that as θ approaches some ldquocriticalrdquovalue θ the distributions Fθ become ldquodegeneraterdquo inone sense or anotheren in a number of cases onecan nd an approximation for Fθ which is valid for thevalues of θ that are close to θ For instance in actuarialstudies7queueing theory and some other applicationsone of the main problems is concerned with the dis-tribution of S = sup

kge(Sk minus θk) under the assumption

that EX = If θ gt then S is a proper random vari-able If however θ rarr then S asETHrarr infin Here we dealwith the so-called ldquotransient phenomenardquo It turns outthat if σ = Var (X) ltinfin then there exists the limit

limθdarrP(θS gt x) = eminusxσ x gt

is (KingmanndashProkhorov) LT enables one to ndapproximations for the distribution of S in situationswhere θ is small

Sometimes one can estimate the ldquotailsrdquo of the unknowndistributions ie their asymptotic behavior at innityis is of importance in those applications where oneneeds to evaluate the probabilities of rare events If theequation Eemicro(Xminusθ) = has a solution micro gt then inthe above example one has

P(S gt x) sim ceminusmicrox x rarrinfin

where c is a known constant If the distribution F of Xis subexponential (in this case Eemicro(Xminusθ) = infin for anymicro gt ) then

P(S gt x) sim θ int

infin

x( minus F(t))dt x rarrinfin

L Linear Mixed Models

isLTenablesone tondapproximations forP(S gt x)for large xFor both approaches the obtained approximations

can be rened

An important part of Probabilityeory is concernedwith LTs for random processes eir main objective isto nd conditions under which random processes con-verge in some sense to some limit process An extensionof the CLT to that context is the so-called FunctionalCLT (aka the DonskerndashProkhorov invariance princi-ple) which states that as n rarr infin the processesζn(t) = (Slfloorntrfloor minus ant)σ

radicntisin[] converge in distribu-

tion to the standard Wiener process w(t)tisin[]e LTs(including large deviation theorems) for a broad class offunctionals of the sequence (7random walk) S Sncan also be classied as LTs for 7stochastic processese same can be said about Law of iterated logarithmwhich states that for an arbitrary ε gt the random walkSkinfink= crosses the boundary (minusε)σ

radick ln ln k innitely

many times but crosses the boundary ( + ε)σradick ln ln k

nitely many times with probability Similar results holdtrue for trajectories of Wiener processes w(t)tisin[] andw(t)tisin[infin)In mathematical statistics a closely related to func-

tional CLT result says that the so-called ldquoempirical processrdquoradicn(Flowastn (u) minus F(u))uisin(minusinfininfin) converges in distribution

to w(F(u))uisin(minusinfininfin) where w(t) = w(t) minus tw() isthe Brownian bridge processis LT implies7asymptoticnormality of a great many estimators that can be repre-sented as smooth functionals of the empirical distributionfunction Flowastn (u)

ere are many other areas in Probability eoryand its applications where various LTs appear and areextensively used For instance convergence theoremsfor 7martingales asymptotics of extinction probabilityof a branching processes and conditional (under non-extinction condition) LTs on a number of particles etc

About the AuthorProfessor Borovkov is Head of Probability and Statis-tics Department of the Sobolev Institute of MathematicsNovosibirsk (RussianAcademy of Sciences) since Heis Head of Probability and Statistics Chair at the Novosi-birsk University since He is Concelour of RussianAcademy of Sciences and fullmember of RussianAcademyof Sciences (Academician) () He was awarded theState Prize of the USSR () the Markov Prize of theRussian Academy of Sciences () and GovernmentPrize in Education () Professor Borovkov is Editor-in-Chief of the journal ldquoSiberianAdvances inMathematicsrdquo

and Associated Editor of journals ldquoeory of Probabil-ity and its Applicationsrdquo ldquoSiberian Mathematical JournalrdquoldquoMathematical Methods of Statisticsrdquo ldquoElectronic Journal ofProbabilityrdquo

Cross References7Almost Sure Convergence of Random Variables7Approximations to Distributions7Asymptotic Normality7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Central Limiteorems7Empirical Processes7Ergodiceorem7Glivenko-Cantellieorems7Large Deviations and Applications7Laws of Large Numbers7Martingale Central Limiteorem7Probabilityeory An Outline7RandomMatrixeory7Strong Approximations in Probability and Statistics7Weak Convergence of Probability Measures

References and Further ReadingBillingsley P () Convergence of probability measures Wiley

New YorkBorovkov AA () Mathematical statistics Gordon amp Breach

AmsterdamGnedenko BV Kolmogorov AN () Limit distributions for sums

of independent random variables Addison-Wesley CambridgeLeacutevy P () Theacuteorie de lrsquoAddition des Variables Aleacuteatories

Gauthier-Villars ParisLoegraveve M (ndash) Probability theory vols I and II th edn

Springer New YorkPetrov VV () Limit theorems of probability theory sequences of

independent random variables ClarendonOxford UniversityPress New York

Linear Mixed Models

GeertMolenberghsProfessorUniversiteit Hasselt amp Katholieke Universiteit LeuvenLeuven Belgium

In observational studies repeated measurements may betaken at almost arbitrary time points resulting in anextremely large number of time points at which only one

Linear Mixed Models L

L

or only a few measurements have been taken Many of theparametric covariance models described so far may thencontain toomany parameters tomake them useful in prac-tice while othermore parsimoniousmodelsmay be basedon assumptions which are too simplistic to be realisticA general and very exible class of parametric models forcontinuous longitudinal data is formulated as follows

yi∣bi sim N(Xiβ + ZibiΣi) ()bi sim N(D) ()

where Xi and Zi are (ni times p) and (ni times q) dimensionalmatrices of known covariates β is a p-dimensional vec-tor of regression parameters called the xed eects D isa general (q times q) covariance matrix and Σi is a (ni times ni)covariance matrix which depends on i only through itsdimension ni ie the set of unknown parameters in Σi willnot depend upon i Finally bi is a vector of subject-specicor random eects

e above model can be interpreted as a linear regres-sion model (see 7Linear Regression Models) for the vec-tor yi of repeated measurements for each unit separatelywhere some of the regression parameters are specic (ran-dom eects bi) while others are not (xed eects β)edistributional assumptions in () with respect to the ran-dom eects can be motivated as follows First E(bi) = implies that the mean of yi still equals Xiβ such that thexed eects in the random-eects model () can also beinterpreted marginally Not only do they reect the eectof changing covariates within specic units they alsomea-sure the marginal eect in the population of changing thesame covariates Second the normality assumption imme-diately implies that marginally yi also follows a normaldistribution with mean vector Xiβ and with covariancematrix Vi = ZiDZprimei + ΣiNote that the random eects in () implicitly imply the

marginal covariance matrix Vi of yi to be of the very spe-cic form Vi = ZiDZprimei + Σi Let us consider two examplesunder the assumption of conditional independence ieassuming Σi = σ Ini First consider the case where therandom eects are univariate and represent unit-specicintercepts is corresponds to covariates Zi which areni-dimensional vectors containing only ones

e marginal model implied by expressions () and() is

yi sim N(XiβVi) Vi = ZiDZprimei + Σi

which can be viewed as a multivariate linear regressionmodel with a very particular parameterization of thecovariance matrix Vi

With respect to the estimation of unit-specic param-eters bi the posterior distribution of bi given the observeddata yi can be shown to be (multivariate) normal withmean vector equal to DZprimeiV

minusi (α)(yi minus Xiβ) Replacing β

and α by their maximum likelihood estimates we obtainthe so-called empirical Bayes estimates bi for the bi A keyproperty of these EB estimates is shrinkage which is bestillustrated by considering the prediction yi equiv Xi β +Zibi ofthe ith prole It can easily be shown that

yi = ΣiVminusi Xi β + (Ini minus ΣiVminusi ) yi

which can be interpreted as a weighted average of thepopulation-averaged prole Xi β and the observed data yiwith weights ΣiVminusi and Ini minusΣiVminusi respectively Note thatthe ldquonumeratorrdquo of ΣiVminusi represents within-unit variabil-ity and the ldquodenominatorrdquo is the overall covariance matrixVi Hence much weight will be given to the overall averageprole if the within-unit variability is large in comparisonto the between-unit variability (modeled by the randomeects) whereasmuchweight will be given to the observeddata if the opposite is trueis phenomenon is referred toas shrinkage toward the average proleXi β An immediateconsequence of shrinkage is that the EB estimates show lessvariability than actually present in the random-eects dis-tribution ie for any linear combination λ of the randomeects

var(λprimebi)le var(λprimebi) = λprimeDλ

About the AuthorGeert Molenberghs is Professor of Biostatistics at Uni-versiteit Hasselt and Katholieke Universiteit Leuven inBelgium He was Joint Editor of Applied Statistics (ndash) and Co-Editor of Biometrics (ndash) Cur-rently he is Co-Editor of Biostatistics (ndash) He wasPresident of the International Biometric Society (ndash) received the Guy Medal in Bronze from the RoyalStatistical Society and the Myrto Lefkopoulou award fromthe Harvard School of Public Health Geert Molenberghsis founding director of the Center for Statistics He isalso the director of the Interuniversity Institute for Bio-statistics and statistical Bioinformatics (I-BioStat) Jointlywith Geert Verbeke Mike Kenward Tomasz BurzykowskiMarc Buyse and Marc Aerts he authored books on lon-gitudinal and incomplete data and on surrogate markerevaluation GeertMolenberghs received several Excellencein Continuing Education Awards of the American Statisti-cal Association for courses at Joint Statistical Meetings

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 4: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

L Laws of Large Numbers

that meet specications She estimates p by using the pro-portion pn of the rst n widgets produced that meet speci-cations and she is interested in knowing if there will everbe a point in the sequence of examined widgets such thatwith probability (at least) a specied large value pn will bewithin ε of p and stay within ε of p as the sampling contin-ues (where ε gt is a prescribed tolerance)e answer isarmative since () is equivalent to the assertion that fora given ε gt and δ gt there exists a positive integer Nεδsuch that

P (capinfinn=Nεδ [∣pn minus p∣ le ε]) ge minus δ

at is the probability is arbitrarily close to that pn will bearbitrarily close to p simultaneously for all n beyond somepoint If one applied instead the WLLN () then it couldonly be asserted that for a given ε gt and δ gt thereexists a positive integer Nεδ such that

P(∣pn minus p∣ le ε) ge minus δ for all n ge Nεδ

ere are numerous other versions of the LLNs and wewill discuss only a few of them Note that the expressionsin () and () can be rewritten respectively as

sumni= Xi minus ncn

rarr as and sumni= Xi minus ncn

Prarr

thereby suggesting the following denitions A sequenceof random variables Xnn ge is said to obey a generalSLLN (resp WLLN) with centering sequence ann ge and norming sequence bnn ge (where lt bn rarrinfin) if

sumni= Xi minus anbn

rarr as (resp sumni= Xi minus anbn

Prarr )

A famous result of Marcinkiewicz and Zygmund (seeeg Chow and Teicher () p ) extended the Kol-mogorov SLLN as follows Let Xnn ge be a sequence ofiid random variables and let lt p lt en

sumni= Xi minus ncnp

rarr as for some c isin R if and only

if E∣X∣p ltinfin

In such a case necessarily c = EX if p ge whereas c isarbitrary if p lt Feller () extended the MarcinkiewiczndashZygmund

SLLN to the case of a more general norming sequencebnn ge satisfying suitable growth conditions

e followingWLLN is ascribed to Feller by Chow andTeicher ( p ) If Xnn ge is a sequence of iidrandom variables then there exist real numbers ann ge such that

sumni= Xi minus ann

Prarr ()

if and only if

nP(∣X∣ gt n)rarr as nrarrinfin ()

In such a case an minus nE(XI[∣X ∣len])rarr as nrarrinfine condition () is weaker than E∣X∣ lt infin If Xn

n ge is a sequence of iid random variables where X hasprobability density function

f (x) =⎧⎪⎪⎨⎪⎪⎩

cx log ∣x∣ ∣x∣ ge e ∣x∣ lt e

where c is a constant then E∣X∣ = infin and the SLLNsumni= Xin rarr c as fails for every c isin R but () and hencethe WLLN () hold with an = n ge Klass and Teicher () extended the Feller WLLN

to the case of a more general norming sequencebnn ge thereby obtaining a WLLN analog of Fellerrsquos() extension of the MarcinkiewiczndashZygmund SLLNGood references for studying the LLNs are the books

by Reacuteveacutesz () Stout () Loegraveve () Chow andTeicher () and Petrov () While the LLNs havebeen studied extensively in the case of independent sum-mands some of the LLNs presented in these books involvesummands obeying a dependence structure other than thatof independenceA large literature of investigation on the LLNs for

sequences of Banach space valued random elements hasemerged beginning with the pioneering work of Mourier() See themonograph byTaylor () for backgroundmaterial and results up to Excellent references are thebooks by Vakhania Tarieladze and Chobanyan () andLedoux and Talagrand () More recent results are pro-vided by Adler et al () Cantrell and Rosalsky ()and the references in these two articles

About the AuthorProfessor Rosalsky is Associate Editor Journal of AppliedMathematics and Stochastic Analysis (ndashpresent) andAssociate Editor International Journal of Mathematics andMathematical Sciences (ndashpresent) He has collabo-rated with research workers from dierent countriesAt the University of Florida he twice received a TeachingImprovement Program Award and on ve occasions wasnamed an Anderson Scholar Faculty Honoree an honorreserved for facultymembers designated by undergraduatehonor students as being the most eective and inspiring

Cross References7Almost Sure Convergence of Random Variables7BorelndashCantelli Lemma and Its Generalizations

Learning Statistics in a Foreign Language L

L

7Chebyshevrsquos Inequality7Convergence of Random Variables7Ergodiceorem7Estimation An Overview7Expected Value7Foundations of Probability7Glivenko-Cantellieorems7Probabilityeory An Outline7Random Field7Statistics History of7Strong Approximations in Probability and Statistics

References and Further ReadingAdler A Rosalsky A Taylor RL () A weak law for normed

weighted sums of random elements in Rademacher type pBanach spaces J Multivariate Anal ndash

Cantrell A Rosalsky A () A strong law for compactly uni-formly integrable sequences of independent random elementsin Banach spaces Bull Inst Math Acad Sinica ndash

Chow YS Teicher H () Probability theory indepen-dence interchangeability martingales rd edn SpringerNew York

Feller W () A limit theorem for random variables with infinitemoments Am J Math ndash

Klass M Teicher H () Iterated logarithm laws for asymmet-ric random variables barely with or without finite mean AnnProbab ndash

Ledoux M Talagrand M () Probability in Banach spacesisoperimetry and processes Springer Berlin

Loegraveve M () Probability theory vol I th edn Springer New YorkMourier E () Eleacutements aleacuteatoires dans un espace de Banach

Annales de lrsquo Institut Henri Poincareacute ndashPetrov VV () Limit theorems of probability theory sequences

of independent random variables Clarendon OxfordReacutenyi A () Probability theory North-Holland AmsterdamReacuteveacutesz P () The laws of large numbers Academic New YorkStout WF () Almost sure convergence Academic New YorkTaylor RL () Stochastic convergence of weighted sums of ran-

dom elements in linear spaces Lecture notes in mathematicsvol Springer Berlin

Vakhania NN Tarieladze VI Chobanyan SA () Probabilitydistributions on Banach spaces D Reidel Dordrecht Holland

Learning Statistics in a ForeignLanguage

KhidirM AbdelbasitSultan Qaboos University Muscat Sultanate of Oman

Backgrounde Sultanate of Oman is an Arabic-speaking countrywhere the medium of instruction in pre-university edu-cation is Arabic In Sultan Qaboos University (SQU) all

sciences (including Statistics) are taught in English ereason is that most of the scientic literature is in Englishand teaching in the native language may leave graduatesat a disadvantage Since only few instructors speak Ara-bic the university adopts a policy of no communicationin Arabic in classes and oce hours Students are requiredto achieve a minimum level in English (about IELTSscore) before they start their study program Very few stu-dents achieve that level on entry and the majority spendsabout two semesters doing English only

Language and Cultural ProblemsIt is to be expected that students from a non-English-speaking background will face serious diculties whenlearning in English especially in the rst year or two Mostof the literature discusses problems faced by foreign stu-dents pursuing study programs in an English-speakingcountry or a minority in a multi-cultural society (seefor example Coutis P and Wood L Hubbard R Koh E)Such students live (at least while studying) in an English-speaking community with which they have to interact on adaily basisese diculties are more serious for our stu-dents who are studying in their own countrywhere Englishis not the ocial languageey hardly use English outsideclassrooms and avoid talking in class as much as they can

My SQU ExperienceStatistical concepts and methods are most eectivelytaught through real-life examples that the students appre-ciate and understand We use the most popular textbooksin the USA for our courses ese textbooks use thisapproach with US students in mind Our main prob-lems are

Most of the examples and exercises used are com-pletely alien to our studentse discussions meant tomaintain the studentsrsquo interest only serve to put ourso With limited English they have serious dicultiesunderstanding what is explained and hence tend not tolisten to what the instructor is sayingey do not readthe textbooks because they contain pages and pages oflengthy explanations and discussions they cannot fol-low A direct eect is that students may nd the subjectboring and quickly lose interesteir attention thenturns to the art of passing tests instead of acquiring theintended knowledge and skills To pass their tests theyuse both their class and study times looking throughexamples concentrating on what formula to use andwhere to plug the numbers they have to get the answeris way they manage to do the mechanics fairly wellbut the concepts are almost entirely missed

L Least Absolute Residuals Procedure

e problem is worse with introductory probabilitycourses where games of chance are extensively usedas illustrative examples in textbooks Most of our stu-dents have never seen a deck of playing cards and somemay even be oended by discussing card games in aclassroom

e burden of nding strategies to overcome these dicul-ties falls on the instructor Statistical terms and conceptssuch as parameterstatistic sampling distribution unbi-asedness consistency suciency and ideas underlyinghypothesis testing are not easy to get across even in the stu-dentsrsquo own language To do that in a foreign language is areal challenge For the Statistics program to be successfulall (or at least most of the) instructors involved should beup to this challengeis is a time-consuming task with lit-tle reward other than self satisfaction In SQU the problemis compounded further by the fact that most of the instruc-tors are expatriates on short-term contracts who are morelikely to use their time for personal career advancementrather than time-consuming community service jobs

What Can Be DoneFor our rst Statistics course we produced a manual thatcontains very brief notes and many samples of previousquizzes tests and examinations It contains a good col-lection of problems from local culture to motivate thestudentse manual was well received by the students tothe extent that students prefer to practice with examplesfrom the manual rather than the textbookTextbooks written in English that are brief and to the

point are neededese should include examples and exer-cises from the studentsrsquo own culture A student tryingto understand a specic point gets distracted by lengthyexplanations and discouraged by thick textbooks to beginwith In a classroom where studentsrsquo faces clearly indicatethat you have got nothing across it is natural to try explain-ing more using more examples In the end of semesterevaluation of a course I taught a student once wrote ldquoeinstructor explains things more than needed He makessimple points dicultrdquois indicates that when teachingin a foreign language lengthy oral or written explanationsare not helpful A better strategywill be to explain conceptsand techniques briey and provide plenty of examples andexercises that will help the students absorb the material byosmosis e basic statistical concepts can only be eec-tively communicated to students in their own languageFor this reason textbooks should contain a good glossarywhere technical terms and concepts are explained using thelocal language

I expect such textbooks to go a long way to enhancestudentsrsquo understanding of Statistics An internationalproject can be initiated to produce an introductory statis-tics textbook with dierent versions intended for dierentgeographical arease English material will be the samethe examples vary to some extent from area to area andglossaries in local languages Universities in the develop-ing world naturally look at western universities asmodelsand international (western) involvement in such a projectis needed for it to succeede project will be a major con-tribution to the promotion of understanding Statistics andexcellence in statistical education in developing countriese international statistical institute takes pride in sup-porting statistical progress in the developing world isproject can lay the foundation for this progress and henceis worth serious consideration by the institute

Cross References7Online Statistics Education7Promoting Fostering and Development of Statistics inDeveloping Countries7Selection of Appropriate Statistical Methods in Develop-ing Countries7Statistical Literacy Reasoning andinking7Statistics Education

References and Further ReadingCoutis P Wood L () Teaching statistics and academic language

in culturally diverse classrooms httpwwwmathuocgr~ictmProceedingspappdf

Hubbard R () Teaching statistics to students who are learningin a foreign language ICOTS

Koh E () Teaching statistics to students with limited languageskills using MINITAB httparchivesmathutkeduICTCMVOLCpaperpdf

Least Absolute ResidualsProcedureRichardWilliam FarebrotherHonorary Reader in EconometricsVictoria University of Manchester Manchester UK

For i = n let xi xi xiq yi represent the ithobservation on a set of q + variables and suppose that wewish to t a linear model of the form

yi = xiβ + xiβ +⋯ + xiqβq + єi

Least Absolute Residuals Procedure L

L

to these n observations en for p gt the Lp-normtting procedure chooses values for b b bq to min-imise the Lp-norm of the residuals [sumni= ∣ei∣p]

p where fori = n the ith residual is dened by

ei = yi minus xib minus xib minus minus xiqbq

e most familiar Lp-norm tting procedure knownas the 7least squares procedure sets p = and choosesvalues for b b bq tominimize the sum of the squaredresidualssumni= ei A second choice to be discussed in the present article

sets p = and chooses b b bq to minimize the sumof the absolute residualssumni= ∣ei∣A third choice sets p =infin and chooses b b bq to

minimize the largest absolute residualmaxni=∣ei∣Setting ui = ei and vi = if ei ge and ui =

and vi = minusei if ei lt we nd that ei = ui minus vi so thatthe least absolute residuals (LAR) tting problem choosesb b bq to minimize the sum of the absolute residuals

n

sumi=

(ui + vi)

subject to

xib + xib +⋯ + xiqbq +Ui minus vi = yi for i = n

and Ui ge vi ge for i = ne LAR tting problem thus takes the form of a linearprogramming problem and is oen solved by means of avariant of the dual simplex procedureGauss has noted (when q = ) that solutions of this

problem are characterized by the presence of a set of qzero residuals Such solutions are robust to the presence ofoutlying observations Indeed they remain constant undervariations in the other n minus q observations provided thatthese variations do not cause any of the residuals to changetheir signs

e LAR tting procedure corresponds to the maxi-mum likelihood estimator when the є-disturbances followa double exponential (Laplacian) distributionis estima-tor is more robust to the presence of outlying observationsthan is the standard least squares estimator which maxi-mizes the likelihood function when the є-disturbances arenormal (Gaussian)Nevertheless theLAR estimator has anasymptotic normal distribution as it is amember ofHuberrsquosclass ofM-estimators

ere are many variants of the basic LAR proce-dure but the one of greatest historical interest is thatproposed in by the Croatian Jesuit scientist Rugjer(or Rudjer) Josip Boškovic (ndash) (Latin RogeriusJosephusBoscovich Italian RuggieroGiuseppeBoscovich)

In his variant of the standard LAR procedure there are twoexplanatory variables of which the rst is constant xi = and the values of b and b are constrained to satisfy theadding-up condition sumni=(yi minus b minus xib) = usuallyassociated with the least squares procedure developed byGauss in and published by Legendre in Com-puter algorithms implementing this variant of the LARprocedure with q ge variables are still to be found in theliteratureFor an account of recent developments in this area

see the series of volumes edited by Dodge ( ) For a detailed history of the LAR procedureanalyzing the contributions of Boškovic Laplace GaussEdgeworth Turner Bowley and Rhodes see Farebrother() And for a discussion of the geometrical andmechanical representation of the least squares and LARtting procedures see Farebrother ()

About the AuthorRichard William Farebrother was a member of the teach-ing sta of the Department of Econometrics and SocialStatistics in the (Victoria) University of Manchester from until when he took early retirement (he hasbeen blind since ) From until he was anHonorary Reader in Econometrics in the Department ofEconomic Studies of the sameUniversity He has publishedthree books Linear Least Squares Computations in Fitting Linear Relationships A History of the Calculus ofObservations ndash in and Visualizing statisticalModels and Concepts in He has also published morethan research papers in a wide range of subject areasincluding econometric theory computer algorithms sta-tistical distributions statistical inference and the history ofstatistics Dr Farebrother was also visiting Associate Pro-fessor () at Monash University Australia and VisitingProfessor () at the Institute for Advanced Studies inVienna Austria

Cross References7Least Squares7Residuals

References and Further ReadingDodge Y (ed) () Statistical data analysis based on the L-norm

and related methods North-Holland AmsterdamDodge Y (ed) () L-statistical analysis and related methods

North-Holland AmsterdamDodge Y (ed) () L-statistical procedures and related topics

Institute of Mathematical Statistics HaywardDodge Y (ed) () Statistical data analysis based on the L-norm

and related methods Birkhaumluser Basel

L Least Squares

Farebrother RW () Fitting linear relationships a history of thecalculus of observations ndash Springer New York

Farebrother RW () Visualizing statistical models and conceptsMarcel Dekker New York

Least Squares

Czesław StepniakProfessorMaria Curie-Skłodowska University Lublin PolandUniversity of Rzeszoacutew Rzeszoacutew Poland

Least Squares (LS) problem involves some algebraic andnumerical techniques used in ldquosolvingrdquo overdeterminedsystems F(x) asymp b of equations where b isin Rn while F(x)is a column of the form

F(x) =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

f(x)

f(x)

fmminus(x)

fm(x)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

with entries fi = fi(x) i = n where x = (x xp)T e LS problem is linear when each fi is a linear functionand nonlinear ndash if notLinear LS problem refers to a system Ax = b of linear

equations Such a system is overdetermined if n gt p If b notinrange(A) the system has no proper solution and will bedenoted by Ax asymp b In this situation we are seeking for asolution of some optimization probleme name ldquoLeastSquaresrdquo is justied by the l-norm commonly used as ameasure of imprecision

e LS problem has a clear statistical interpretation inregression terms Consider the usual regression model

yi = fi(xi xip β βp) + ei for i = n ()

where xij i = n j = p are some constants givenby experimental design fi i = n are given functionsdepending on unknown parameters βj j = p whileyi i = n are values of these functions observed withsome random errors ei We want to estimate the unknownparameters βi on the basis of the data set xij yi

In linear regression each fi is a linear function of typefi = sump

j= cij(x xn)βj and the model () may bepresented in vector-matrix notation as

y = Xβ + e

where y = (y yn)T e = (e en)T and β =(β βp)T while X is a n times p matrix with entries xijIf e en are not correlated with mean zero and a com-mon (perhaps unknown) variance then the problem ofBest Linear Unbiased Estimation (BLUE) of β reduces tonding a vector β that minimizes the norm ∣∣y minus Xβ ∣∣

=

(y minus Xβ)T(y minus Xβ)Such a vector is said to be the ordinary LS solution of

the overparametrized system Xβ asymp y On the other handthe last one reduces to solving the consistent system

XTXβ = XTy

of linear equations called normal equations In particularif rank(X) = p then the system has a unique solution of theform

β = (XTX)minusXTy

For linear regression yi = α+βxi+ei with one regressorx the BLU estimators of the parameters α and β may bepresented in the convenient form as

β = nsxynsx

and α = y minus βx

where

nsxy =sumixiyi minus

(sumi xi) (sumi yi)n

nsx =sumixi minus

(sumi xi)

n x = sumi xi

nand y = sumi yi

n

For its computation we only need to use a simple pocketcalculator

Example e following table presents the number of resi-dents in thousands (x) and the unemployment rate in (y)for some cities of Poland Estimate the parameters β and α

xi

yi

In this casesumi xi = sumi yi = sumi xi = and sumi xiyi = erefore nsx = and nsxy =minusus β = minus and α = and hence f (x) =minusx +

Leacutevy Processes L

L

If the variancendashcovariance matrix of the error vector ecoincides (except a multiplicative scalar σ ) with a positivedenite matrix V then the Best Linear Unbiased estima-tion reduces to the minimization of (y minusXβ)TV(y minusXβ)called the weighed LS problem Moreover if rank(X) = pthen its solution is given in the form

β = (XTVminusX)minusXTVminusy

It is worth to add that a nonlinear LS problem is morecomplicated and its explicit solution is usually not knownInstead of this some algorithms are suggestedTotal least squares problem e problem has been

posed in recent years in numerical analysis as an alterna-tive for the LS problem in the casewhen all data are aectedby errorsConsider an overdetermined system of n linear equa-

tionsAx asymp b with k unknown xe TLS problem consistsin minimizing the Frobenius norm

∣∣[A b] minus [A b]∣∣Ffor all A isin Rntimesk and b isin range(A) where the Frobeniusnorm is dened by ∣∣(aij)∣∣F = sumij aij Once a minimizing[A b] is found then any x satisfying Ax = b is called a TLSsolution of the initial system Ax asymp b

e trouble is that the minimization problem maynot be solvable or its solution may not be unique As anexample one can set

A =⎡⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎦and b =

⎡⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎦

It is known that the TLS solution (if exists) is alwaysbetter than the ordinary LS in the sense that the cor-rection b minus Ax has smaller l-norm e main tool insolving the TLS problems is the following Singular ValueDecompositionFor any matrix A of n times k with real entries there

exist orthonormal matrices P = [p pn] and

Q = [q qk] such that

PTAQ = diag(σ σm) where σ ge ⋯ ge σm andm = minn k

About the AuthorFor biography see the entry 7Random Variable

Cross References7Absolute Penalty Estimation7Adaptive Linear Regression

7Analysis of Areal and Spatial Interaction Data7Autocorrelation in Regression7Best Linear Unbiased Estimation in Linear Models7Estimation7Gauss-Markoveorem7General Linear Models7Least Absolute Residuals Procedure7Linear Regression Models7Nonlinear Models7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Simple Linear Regression7Statistics History of7Two-Stage Least Squares

References and Further ReadingBjoumlrk A () Numerical methods for least squares problems

SIAM PhiladelphiaKariya T () Generalized least squares Wiley New YorkRao CR Toutenberg H () Linear models least squares and

alternatives Springer New YorkVan Huffel S Vandewalle J () The total least squares problem

computational aspects and analysis SIAM PhiladelphiaWolberg J () Data analysis using the method of least squares

extracting the most information from experiments SpringerNew York

Leacutevy Processes

Mohamed Abdel-HameedProfessorUnited Arab Emirates University Al Ain United ArabEmirates

IntroductionLeacutevy processes have become increasingly popular in engi-neering (reliability dams telecommunication) andmathe-matical nanceeir applications in reliability stems fromthe fact that they provide a realistic model for the degrada-tion of devices while their applications in the mathemati-cal theory of dams as they provide a basis for describingthe water input of dams eir popularity in nance isbecause they describe the nancialmarkets in amore accu-rate way than the celebrated BlackndashScholes model elatter model assumes that the rate of returns on assetsare normally distributed thus the process describing theasset price over time is continuous process In reality theasset prices have jumps or spikes and the asset returnsexhibit fat tails and 7skewness which negates the nor-mality assumption inherited in the BlackndashScholes model

L Leacutevy Processes

Because of the deciencies in the BlackndashScholes modelresearchers in mathematical nance have been trying tondmore suitable models for asset prices Certain types ofLeacutevy processes have been found to provide a good modelfor creep of concrete fatigue crack growth corroded steelgates and chloride ingress into concrete Furthermore cer-tain types of Leacutevy processes have been used to model thewater input in damsIn this entry we will review Leacutevy processes and give

important examples of such processes and state somereferences to their applications

Leacutevy ProcessesA stochastic process X = Xt t ge that has right con-tinuous sample paths with le limits is said to be a Leacutevyprocess if the following hold

X has stationary increments ie for every s t ge thedistribution of Xt+s minus Xt is independent of t

X has independent increments ie for every t sge Xt+s minus Xt is independent of ( Xuu le t)

X is stochastically continuous ie for every t ge andє gt

limsrarrtP(∣Xt minus Xs∣ gt є) = at is to say a Leacutevy process is a stochastically continu-ous process with stationary and independent incrementswhose sample paths are right continuous with le handlimitsIf Φ(z) is the characteristic function of a Leacutevy process

then its characteristic component φ(z) def= ln Φ(z)t is of the

form

iza minus zb+ int

R[exp(izx) minus minus izxI∣x∣lt]ν(dx)

where a isin R b isin R+ and ν is a measure on R satisfyingν() = intR( and x

)ν(dx) ltinfine measure ν characterizes the size and frequency of

the jumps If the measure is innite then the process hasinnitelymany jumps of very small sizes in any small inter-vale constant a dened above is called the dri term ofthe process and b is the variance (volatility) term

e LeacutevyndashIt^o decomposition identify any Leacutevy process

as the sum of three independent processes it is stated asfollowsGiven any a isin R b isin R+ and measure ν on R satisfying

ν() = intR(andx)ν(dx) ltinfin there exists a probability

space (Ω P) on which a Leacutevy process X is denede

process X is the sum of three independent processes()X

()X and

()X where

()X is a Brownianmotionwith dri a and

volatility b (in the sense dened below)()X is a compound

Poisson process and()X is a square integrable martingale

e characteristic components of()X

()X and

()X (denoted

by()φ (z)

()φ (z) and

()φ (z) respectively) are as follows

()φ (z) = iza minus zb

()φ (z)= int∣x∣ge(exp(izx) minus )ν(dx)()φ (z)= int∣x∣lt(exp(izx) minus minus izx)ν(dx)

Examples of the Leacutevy ProcessesThe Brownian MotionA Leacutevy process is said to be a Brownian motion (see7Brownian Motion and Diusions) with dri micro andvolatility rate σ if micro = a b = σ and ν (R)= Brow-nian motion is the only nondeterministic Leacutevy processeswith continuous sample paths

The Inverse Brownian ProcessLet X be a Brownian motion with micro gt and volatility rateσ For any x gt let Tx = inft Xt gt xen Tx is anincreasing Leacutevy process (called inverse Brownianmotion)its Leacutevy measure is given by

υ(dx) = radicπσ x

exp(minusxmicro

σ )

The Compound Poisson Processe compound Poisson process (see 7Poisson Processes)is a Leacutevy process where b = and ν is a nite measure

The Gamma Processe gamma process is a nonnegative increasing Leacutevy pro-cess X where b = a minus int xν(dx) = and its Leacutevymeasure is given by

ν(dx) = αxexp(minusxβ)dx x gt

where α β gt It follows that the mean term (E(X ))and the variance term (V(X )) for the process are equalto αβ and αβ respectively

e following is a simulated sample path of a gammaprocess where α = and β = (Fig )

The Variance Gamma Processe variance gamma process is a Leacutevy process that can berepresented as either the dierence between two indepen-dent gamma processes or as a Brownian process subordi-nated by a gamma processe latter is accomplished by arandom time change replacing the time of the Brownianprocess by a gamma process with a mean term equal to e variance gamma process has three parameters micro ndash the

Life Expectancy L

L

0 2 4 6 8 100

2

4

6

8

10

12

t

x

Leacutevy Processes Fig Gamma process

0 2 4 6 8 10minus08

minus06

minus04

minus02

0

02

04

06

t

x

Brownian motionVariance gamma

Leacutevy Processes Fig Brownian motion and variance gammasample paths

Brownian process dri term σ ndash the volatility of theBrownian process and ν ndash the variance term of the thegamma process

e following are two simulated sample paths one fora Brownian motion with a dri term micro = and volatilityterm σ = and the other is for a variance gamma processwith the same values for the dri term and the volatilityterms and ν = (Fig )

About the AuthorProfessor Mohamed Abdel-Hameed was the chairman ofthe Department of Statistics and Operations ResearchKuwait University (ndash) He was on the editorialboard of Applied Stochastic Models and Data Analysis(ndash) He was the Editor (jointly with E Cinlarand J Quinn) of the text Survival Models Maintenance

Replacement Policies and Accelerated Life Testing (Aca-demic Press ) He is an elected member of the Inter-national Statistical Institute He has published numerouspapers in Statistics and Operations research

Cross References7Non-Uniform Random Variate Generations7Poisson Processes7Statistical Modeling of Financial Markets7Stochastic Processes7Stochastic Processes Classication

References and Further ReadingAbdel-Hameed MS () Life distribution of devices subject to a

Leacutevy wear process Math Oper Res ndashAbdel-Hameed M () Optimal control of a dam using PMλτ

policies and penalty cost when the input process is a com-pound Poisson process with a positive drift J Appl Prob ndash

Black F Scholes MS () The pricing of options and corporateliabilities J Pol Econ ndash

Cont R Tankov P () Financial modeling with jump processesChapman amp HallCRC Press Boca Raton

Madan DB Carr PP Chang EC () The variance gamma processand option pricing Eur Finance Rev ndash

Patel A Kosko B () Stochastic resonance in continuous andspiking Neutron models with Leacutevy noise IEEE Trans NeuralNetw ndash

van Noortwijk JM () A survey of the applications of gammaprocesses in maintenance Reliab Eng Syst Safety ndash

Life Expectancy

Maja Biljan-AugustFull ProfessorUniversity of Rijeka Rijeka Croatia

Life expectancy is dened as the average number of yearsa person is expected to live from age x as determined bystatisticsStatistics on life expectancy are derived from a mathe-

matical model known as the 7life table In order to calcu-late this indicator the mortality rate at each age is assumedto be constant Life expectancy (ex) can be evaluated atany age and in a hypothetical stationary population canbe written in discrete form as

ex =Txlx

where x is age Tx is the number of person-years livedaged x and over and lx is the number of survivors at agex according to the life table

L Life Expectancy

Life expectancy can be calculated for combined sexesor separately for males and femalesere can be signi-cant dierences between sexesLife expectancy at birth (e) is the average number of

years a newborn child can expect to live if currentmortalitytrends remain constant

e =Tl

where T is the total size of the population and l is thenumber of births (the original number of persons in thebirth cohort)Life expectancy declines with age Life expectancy at

birth is highly inuenced by infant mortality rate eparadox of the life table refers to a situation where lifeexpectancy at birth increases for several years aer birth(e lt e lt e and even beyond)e paradox reects thehigher rates of infant and child mortality in populationsin pre-transition and middle stages of the demographictransitionLife expectancy at birth is a summary measure of mor-

tality in a population It is a frequently used indicatorof health standards and socio-economic living standardsLife expectancy is also one of the most commonly usedindicators of social development is indicator is easilycomparable through time and between areas includingcountries Inequalities in life expectancy usually indicateinequalities in health and socio-economic developmentLife expectancy rose throughout human history In

ancient Greece and Rome the average life expectancywas below years between the years and life expectancy at birth rose from about years to aglobal average of years and to more than years inthe richest countries (Riley ) Furthermore in mostindustrialized countries in the early twenty-rst centurylife expectancy averaged at about years (WHO)esechanges called the ldquohealth transitionrdquo are essentially theresult of improvements in public health medicine andnutritionLife expectancy varies signicantly across regions and

continents from life expectancies of about years insome central African populations to life expectancies of years and above in many European countries emore developed regions have an average life expectancy of years while the population of less developed regionsis at birth expected to live an average years less etwo continents that display the most extreme dierencesin life expectancies are North America ( years) andAfrica ( years) where as of recently the gap betweenlife expectancies amounts to years (UN ndash data)

Countries with the highest life expectancies in theworld ( years) are Australia Iceland Italy Switzerlandand Japan ( years) Japanese men and women live anaverage of and years respectively (WHO )In countries with a high rate of HIV infection prin-

cipally in Sub-Saharan Africa the average life expectancyis years and below Some of the worldrsquos lowest lifeexpectancies are in Sierra Leone ( years) Afghanistan( years) Lesotho ( years) and Zimbabwe ( years)In nearly all countries women live longer than men

e worldrsquos average life expectancy at birth is years formales and years for females the gap is about ve yearse female-to-male gap is expected to narrow in the moredeveloped regions andwiden in the less developed regionse Russian Federation has the greatest dierence in lifeexpectancies between the sexes ( years less for men)whereas in Tonga life expectancy for males exceeds thatfor females by years (WHO )Life expectancy is assumed to rise continuously

According to estimation by the UN global life expectancyat birth is likely to rise to an average years by ndashBy life expectancy is expected to vary across countriesfrom to years Long-range United Nations popula-tion projections predict that by on average peoplecan expect to live more than years from (Liberia) upto years (Japan)For more details on the calculation of life expectancy

including continuous notation see Keytz ( )and Preston et al ()

Cross References7Biopharmaceutical Research Statistics in7Biostatistics7Demography7Life Table7Mean Residual Life7Measurement of Economic Progress

References and Further ReadingKeyfitz N () Introduction to the Mathematics of Population

Addison-Wesly MassachusettsKeyfitz N () Applied mathematical demography Springer

New YorkPreston SH Heuveline P Guillot M () Demography measuring

and modelling potion processes Blackwell OxfordRiley JC () Rising life expectancy a global history Cambridge

University Press CambridgeRowland DT () Demographic methods and concepts Oxford

University Press New YorkSiegel JS Swanson DA () The methods and materials of demog-

raphy nd edn Elsevier Academic Press AmsterdamWorld Health Organization () World health statistics

Geneva

Life Table L

L

World Population Prospects The revision Highlights UnitedNations New York

World Population to () United Nations New York

Life Table

JanM HoemProfessor Director EmeritusMax Planck Institute for Demographic Research RostockGermany

e life table is a classical tabular representation of centralfeatures of the distribution function F of a positive variablesay X which normally is taken to represent the lifetime ofa newborn individuale life table was introduced wellbefore modern conventions concerning statistical distri-butions were developed and it comes with some specialterminology and notation as follows Suppose that F has a

density f (x) = ddxF(x) and dene the force of mortality or

death intensity at age x as the function

micro(x) = minus ddxln minus F(x) = f (x)

minus F(x)

Heuristically it is interpreted by the relation micro(x)dx =Px lt X lt x + dx∣X gt x Conversely F(x) =

minus expminusx

intmicro(s)ds e survivor function is dened as

ℓ(x) = ℓ() minus F(x) normally with ℓ() = Inmortality applications ℓ(x) is the expected number of sur-vivors to exact age x out of an original cohort of newborn babiese survival probability is

tpx = PX gt x + t∣X gt x = ℓ(x + t)ℓ(x)

= expminusintt

micro(x + s)ds

and the non-survival probability is (the converse) tqx = minustpx For t = one writes qx = qx and px = px In partic-ular we get ℓ(x + ) = ℓ(x)pxis is a practical recursionformula that permits us to compute all values of ℓ(x) oncewe know the values of px for all relevant x

e life expectancy is eo = EX = int infin ℓ(x)dxℓ()(e subscript in eo indicates that the expected value iscomputed at age (ie for newborn individuals) and thesuperscript o indicates that the computation is made in

the continuous mode)e remaining life expectancy atage x is

eox = E(X minus x∣X gt x) = intinfin

ℓ(x + t)dtℓ(x)

ie it is the expected lifetime remaining to someone whohas attained age xTo turn to the statistical estimation of these various

quantities suppose that the function micro(x) is piecewiseconstant which means that we take it to equal some con-stant say microj over each interval (xj xj+) for some partitionxj of the age axis For a collection Xi of independentobservations of X let Dj be the number of Xi that fall inthe interval (xj xj+) In mortality applications this is thenumber of deaths observed in the given interval For thecohort of the initially newborn Dj is the number of indi-viduals whodie in the interval (called the occurrences in theinterval) If individual i dies in the interval he or she willof course have lived for Xi minus xj time units during the inter-val Individuals who survive the interval will have lived forxj+ minus xj time units in the interval and individuals whodo not survive to age xj will not have lived during thisinterval at all When we aggregate the time units lived in(xj xj+) over all individuals we get a total Rj which iscalled the exposures for the interval the idea being thatindividuals are exposed to the risk of death for as long asthey live in the interval In the simple case where there areno relations between the individual parameters microj the col-lection DjRj constitutes a statistically sucient set ofobservations with a likelihood Λ that satises the relationln Λ = sumjminusmicrojRj+Dj ln microjwhich is easily seen to bemax-imized by microj = DjRje latter fraction is therefore themaximum-likelihood estimator for microj (In some connec-tions an age schedule of mortality will be specied such asthe classical GompertzndashMakeham function microx = a + bcxwhich does represent a relationship between the intensityvalues at the dierent ages x normally for single-year agegroupsMaximum likelihood estimators can then be foundby plugging this functional specication of the intensitiesinto the likelihood function nding the values a b andc that maximize Λ and using a + bcx for the intensity inthe rest of the life table computations Methods that do notamount tomaximum likelihood estimationwill sometimesbe used because they involve simpler computations Withsome luck they provide starting values for the iterative pro-cess that must usually be applied to produce the maximumlikelihood estimators For an example see Forseacuten ())is whole schema can be extended trivially to cover cen-soring (withdrawals) provided the censoringmechanism isunrelated to the mortality process

L Likelihood

If the force of mortality is constant over a single-yearage interval (x x + ) say and is estimated by microx in thisinterval then px = eminusmicrox is an estimator of the single-yearsurvival probability pxis allows us to estimate the sur-vival function recursively for all corresponding ages usingℓ(x + ) = ℓ(x)px for x = and the rest of thelife table computations follow suit Life table constructionconsists in the estimation of the parameters and the tab-ulation of functions like those above from empirical datae data can be for age at death for individuals as in theexample indicated above but they can also be observa-tions of duration until recovery from an illness of intervalsbetween births of time until breakdown of some piece ofmachinery or of any other positive duration variableSo far we have argued as if the life table is computed for

a group of mutually independent individuals who have allbeen observed in parallel essentially a cohort that is fol-lowed from a signicant common starting point (namelyfrom birth in our mortality example) and which is dimin-ished over time due to decrements (attrition) caused bythe risk in question and also subject to reduction due tocensoring (withdrawals)e corresponding table is thencalled a cohort life table It is more common however toestimate a px schedule from data collected for the mem-bers of a population during a limited time period and touse the mechanics of life-table construction to produce aperiod life table from the px valuesLife table techniques are described in detail in most

introductory textbooks in actuarial statistics7biostatistics7demography and epidemiology See eg Chiang ()Elandt-Johnson and Johnson () Manton and Stallard() Preston et al () For the history of thetopic consult Seal () Smith and Keytz () andDupacircquier ()

About the AuthorFor biography see the entry 7Demography

Cross References7Demographic Analysis A Stochastic Approach7Demography7Event History Analysis7Kaplan-Meier Estimator7Population Projections7Statistics An Overview

References and Further ReadingChiang CL () The life table and its applications Krieger

MalabarDupacircquier J () Lrsquoinvention de la table de mortaliteacute Presses

universitaires de France Paris

Elandt-Johnson RC Johnson NL () Survival models and dataanalysis Wiley New York

Forseacuten L () The efficiency of selected moment methods inGompertzndashMakeham graduation of mortality Scand Actuarial Jndash

Manton KG Stallard E () Recent trends in mortality analysisAcademic Press Orlando

Preston SH Heuveline P Guillot M () Demography Measuringand modeling populations Blackwell Oxford

Seal H () Studies in history of probability and statistics mul-tiple decrements of competing risks Biometrica ()ndash

Smith D Keyfitz N () Mathematical demography selectedpapers Springer Heidelberg

Likelihood

Nancy ReidProfessorUniversity of Toronto Toronto ON Canada

Introductione likelihood function in a statistical model is propor-tional to the density function for the random variable tobe observed in the model Most oen in applications oflikelihood we have a parametric model f (y θ) where theparameter θ is assumed to take values in a subset of Rkand the variable y is assumed to take values in a subset ofRn the likelihood function is dened by

L(θ) = L(θ y) = cf (y θ) ()

where c can depend on y but not on θ In more gen-eral settings where the model is semi-parametric or non-parametric the explicit denition is more dicult becausethe density needs to be dened relative to a dominatingmeasure whichmay not exist seeVan derVaart () andMurphy andVan derVaart ()is article will consideronly nite-dimensional parametric modelsWithin the context of the given parametric model the

likelihood function measures the relative plausibility ofvarious values of θ for a given observed data point y Val-ues of the likelihood function are only meaningful relativeto each other and for this reason are sometimes stan-dardized by the maximum value of the likelihood func-tion although other reference points might be of interestdepending on the contextIf ourmodel is f (y θ) = (ny)θy(minusθ)nminusy y = n

θ isin [ ] then the likelihood function is (any functionproportional to)

L(θ y) = θy( minus θ)nminusy

Likelihood L

L

and can be plotted as a function of θ for any xed valueof y e likelihood function is maximized at θ = ynis model might be appropriate for a sampling schemewhich recorded the number of successes amongn indepen-dent trials that result in success or failure each trial havingthe same probability of success θ Another example is thelikelihood function for the mean and variance parameterswhen sampling from a normal distribution with mean microand variance σ

L(θ y) = expminusn log σ minus (σ )Σ(yi minus micro)

where θ = (micro σ )is could also be plotted as a functionof micro and σ for a given sample y yn and it is not dif-cult to show that this likelihood function only dependson the sample through the sample mean y = nminusΣyi andsample variance s = (n minus )minusΣ(yi minus y) or equivalentlythrough Σyi and Σyi It is a general property of likelihoodfunctions that they depend on the data only through theminimal sucient statistic

Inferencee likelihood function was dened in a seminal paperof Fisher () and has since become the basis for mostmethods of statistical inference One version of likelihoodinference suggested by Fisher is to use some rule suchas L(θ)L(θ) gt k to dene a range of ldquolikelyrdquo or ldquoplau-siblerdquo values of θ Many authors including Royall ()and Edwards () have promoted the use of plots ofthe likelihood function along with interval estimates ofplausible valuesis approach is somewhat limited how-ever as it requires that θ have dimension or possibly or that a likelihood function can be constructed that onlydepends on a component of θ that is of interest see sectionldquo7Nuisance Parametersrdquo belowIn general we would wish to calibrate our inference

for θ by referring to the probabilistic properties of theinferential method One way to do this is to introduce aprobability measure on the unknown parameter θ typi-cally called a prior distribution and use Bayesrsquo rule forconditional probabilities to conclude

π(θ ∣ y) = L(θ y)π(θ)intθL(θ y)π(θ)dθ

where π(θ) is the density for the prior measure and π(θ ∣y) provides a probabilistic assessment of θ aer observingY = y in the model f (y θ) We could then make con-clusions of the form ldquohaving observed successes in trials and assuming π(θ) = the posterior probabilitythat θ gt is rdquo and so on

is is a very brief description of Bayesian inference inwhich probability statements refer to that generated from

the prior through the likelihood to the posterior A majordiculty with this approach is the choice of prior prob-ability function In some applications there may be anaccumulation of previous data that can be incorporatedinto a probability distribution but in general there is notand some rather ad hoc choices are oen made Anotherdiculty is themeaning to be attached to probability state-ments about the parameterInference based on the likelihood function can also be

calibrated with reference to the probability model f (y θ)by examining the distribution ofL(θY) as a random func-tion or more usually by examining the distribution ofvarious derived quantitiesis is the basis for likelihoodinference from a frequentist point of view In particularit can be shown that logL(θY)L(θY) where θ =θ(Y) is the value of θ at which L(θY) is maximized isapproximately distributed as a χk random variable wherek is the dimension of θ To make this precise requires anasymptotic theory for likelihood which is based on a cen-tral limit theorem (see 7Central Limiteorems) for thescore function

U(θY) = partpartθlogL(θY)

If Y = (Y Yn) has independent components thenU(θ) is a sum of n independent components which undermild regularity conditions will be asymptotically normalTo obtain the χ result quoted above it is also necessary toinvestigate the convergence of θ to the true value govern-ing the model f (y θ) Showing this convergence usuallyin probability but sometimes almost surely can be di-cult see Scholz () for a summary of some of the issuesthat ariseAssuming that θ is consistent for θ and that L(θY)

has sucient regularity the follow asymptotic results canbe established

(θ minus θ)T i(θ)(θ minus θ) drarr χk ()

U(θ)T iminus(θ)U(θ) drarr χk ()

ℓ(θ) minus ℓ(θ) drarr χk ()

where i(θ) = Eminusℓprimeprime(θY) θ is the expected Fisher infor-mation function ℓ(θ) = logL(θ) is the log-likelihoodfunction and χk is the 7chi-square distribution with kdegrees of freedom

ese results are all versions of a more general resultthat the log-likelihood function converges to the quadraticform corresponding to a multivariate normal distribution(see 7Multivariate Normal Distributions) under suitablystated limiting conditions ere is a similar asymptoticresult showing that the posterior density is asymptotically

L Likelihood

normal and in fact asymptotically free of the prior distri-bution although this result requires that the prior distribu-tion be a proper probability density ie has integral overthe parameter space equal to

Nuisance ParametersIn models where the dimension of θ is large plotting thelikelihood function is not possible and inference based onthe multivariate normal distribution for θ or the χk dis-tribution of the log-likelihood ratio doesnrsquot lead easily tointerval estimates for components of θ However it is pos-sible to use the likelihood function to construct inferencefor parameters of interest using variousmethods that havebeen proposed to eliminate nuisance parametersSuppose in the model f (y θ) that θ = (ψ λ) where ψ

is a k-dimensional parameter of interest (which will oenbe )e prole log-likelihood function of ψ is

ℓP(ψ) = ℓ(ψ λψ)

where λψ is the constrainedmaximum likelihood estimateit maximizes the likelihood function L(ψ λ) when ψ isheld xede prole log-likelihood function is also calledthe concentrated log-likelihood function especially ineconometrics If the parameter of interest is not expressedexplicitly as a subvector of θ then the constrained maxi-mum likelihood estimate is found using Lagrange multi-pliersIt can be veried under suitable smoothness conditions

that results similar to those at ( ndash ) hold as well for theprole log-likelihood function in particular

ℓP(ψ) minus ℓP(ψ) = ℓ(ψ λ) minus ℓ(ψ λψ)drarr χk

is method of eliminating nuisance parameters is notcompletely satisfactory especially when there are manynuisance parameters in particular it doesnrsquot allow forerrors in estimation of λ For example the prole likeli-hood approach to estimation of σ in the linear regressionmodel (see 7Linear Regression Models) y sim N(Xβ σ )will lead to the estimator σ = Σ(yi minus yi)n whereas theestimator usually preferred has divisor nminusp where p is thedimension of β

us a large literature has developed on improvementsto the prole log-likelihood For Bayesian inference suchimprovements are ldquoautomaticallyrdquo included in the formu-lation of the marginal posterior density for ψ

πM(ψ ∣ y)prop int π(ψ λ ∣ y)dλ

but it is typically quite dicult to specify priors for possiblyhigh-dimensional nuisance parameters For non-Bayesian

inference most modications to the prole log-likelihoodare derived by considering conditional or marginal infer-ence in models that admit factorizations at least approxi-mately like the following

f (y θ) = f(yψ)f(y ∣ y λ) orf (y θ) = f(y ∣ yψ)f(y λ)

A discussion of conditional inference and density factori-sations is given in Reid ()is literature is closely tiedto that on higher order asymptotic theory for likelihoode latter theory builds on saddlepoint and Laplace expan-sions to derive more accurate versions of (ndash) see forexample Severini () and Brazzale et al () edirect likelihood approach of Royall () and others doesnot generalize very well to the nuisance parameter settingalthough Royall and Tsou () present some results inthis direction

Extensions to Likelihoode likelihood function is such an important aspect ofinference based on models that it has been extended toldquolikelihood-likerdquo functions formore complex data settingsExamples include nonparametric and semi-parametriclikelihoods the most famous semi-parametric likelihoodis the proportional hazards model of Cox () Butmany other extensions have been suggested to empiri-cal likelihood (Owen ) which is a type of nonpara-metric likelihood supported on the observed sample toquasi-likelihood (McCullagh ) which starts from thescore function U(θ) and works backwards to an infer-ence function to bootstrap likelihood (Davison et al )and many modications of prole likelihood (Barndor-Nielsen andCox Fraser )ere is recent interestfor multi-dimensional responses Yi in composite likeli-hoods which are products of lower dimensional condi-tional or marginal distributions (Varin ) Reid ()concluded a review article on likelihood by stating

7 From either a Bayesian or frequentist perspective the like-lihood function plays an essential role in inference Themaximum likelihood estimator once regarded on an equalfooting among competing point estimators is now typi-cally the estimator of choice although some refinementis needed in problems with large numbers of nuisanceparameters The likelihood ratio statistic is the basis formost tests of hypotheses and interval estimates The emer-gence of the centrality of the likelihood function for infer-ence partly due to the large increase in computing poweris one of the central developments in the theory of statisticsduring the latter half of the twentieth century

Likelihood L

L

Further Readinge book by Cox and Hinkley () gives a detailedaccount of likelihood inference and principles of statis-tical inference see also Cox () ere are severalbook-length treatments of likelihood inference includingEdwards () Azzalini () Pawitan () and Sev-erini () this last discusses higher order asymptotictheory in detail as does Barndor-Nielsen andCox ()and Brazzale Davison and Reid () A short reviewpaper is Reid () An excellent overview of consis-tency results for maximum likelihood estimators is Scholz() see also Lehmann andCasella () Foundationalissues surrounding likelihood inference are discussed inBerger and Wolpert ()

About the AuthorProfessor Reid is a Past President of the Statistical Societyof Canada (ndash) During (ndash) she servedas the President of the Institute of Mathematical Statis-tics Among many awards she received the Emanuel andCarol Parzen Prize for Statistical Innovation () ldquoforleadership in statistical science for outstanding researchin theoretical statistics and highly accurate inference fromthe likelihood function and for inuential contributionsto statistical methods in biology environmental sciencehigh energy physics and complex social surveysrdquo She wasawarded the Gold Medal Statistical Society of Canada() and FlorenceNightingaleDavidAward Committeeof Presidents of Statistical Societies () She is AssociateEditor of Statistical Science (ndash) Bernoulli (ndash) andMetrika (ndash)

Cross References7Bayesian Analysis or Evidence Based Statistics7Bayesian Statistics7Bayesian Versus Frequentist Statistical Reasoning7Bayesian vs Classical Point Estimation A ComparativeOverview7Chi-Square Test Analysis of Contingency Tables7Empirical Likelihood Approach to Inference fromSample Survey Data7Estimation7Fiducial Inference7General Linear Models7Generalized Linear Models7Generalized Quasi-Likelihood (GQL) Inferences7Mixture Models7Philosophical Foundations of Statistics

7Risk Analysis7Statistical Evidence7Statistical Inference7Statistical Inference An Overview7Statistics An Overview7Statistics Nelderrsquos view7Testing Variance Components in Mixed LinearModels7Uniform Distribution in Statistics

References and Further ReadingAzzalini A () Statistical inference Chapman and Hall LondonBarndorff-Nielsen OE Cox DR () Inference and asymptotics

Chapman and Hall LondonBerger JO Wolpert R () The likelihood principle Institute of

Mathematical Statistics HaywardBirnbaum A () On the foundations of statistical inference Am

Stat Assoc ndashBrazzale AR Davison AC Reid N () Applied asymptotics

Cambridge University Press CambridgeCox DR () Regression models and life tables J R Stat Soc B

ndash (with discussion)Cox DR () Principles of statistical inference Cambridge Uni-

versity Press CambridgeCox DR Hinkley DV () Theoretical statistics Chapman and

Hall LondonDavison AC Hinkley DV Worton B () Bootstrap likelihoods

Biometrika ndashEdwards AF () Likelihood Oxford University Press OxfordFisher RA () On the mathematical foundations of theoretical

statistics Phil Trans R Soc A ndashLehmann EL Casella G () Theory of point estimation nd edn

Springer New YorkMcCullagh P () Quasi-likelihood functions Ann Stat ndashMurphy SA Van der Vaart A () Semiparametric likelihood ratio

inference Ann Stat ndashOwen A () Empirical likelihood ratio confidence intervals for a

single functional Biometrika ndashPawitan Y () In all likelihood Oxford University Press OxfordReid N () The roles of conditioning in inference Stat Sci

ndashReid N () Likelihood J Am Stat Assoc ndashRoyall RM () Statistical evidence a likelihood paradigm

Chapman and Hall LondonRoyall RM Tsou TS () Interpreting statistical evidence using

imperfect models robust adjusted likelihoodfunctions J R StatSoc B

Scholz F () Maximum likelihood estimation In Ency-clopedia of statistical sciences Wiley New York doiesspub Accessed Aug

Severini TA () Likelihood methods in statistics Oxford Univer-sity Press Oxford

Van der Vaart AW () Infinite-dimensional likelihood meth-ods in statistics httpwwwstieltjesorgarchiefbiennialframenodehtml Accessed Aug

Varin C () On composite marginal likelihood Adv Stat Analndash

L Limit Theorems of Probability Theory

Limit Theorems of ProbabilityTheory

Alexandr Alekseevich BorovkovProfessor Head of Probability and Statistics Chair at theNovosibirsk UniversityNovosibirsk University Novosibirsk Russia

Limit eorems of Probability eory is a broad namereferring to the most essential and extensive research areain Probability eory which at the same time has thegreatest impact on the numerous applications of the latterBy its very nature Probability eory is concerned

with asymptotic (limiting) laws that emerge in a long seriesof observations on random events Because of this in theearly twentieth century even the very denition of prob-ability of an event was given by a group of specialists(R von Mises and some others) as the limit of the rela-tive frequency of the occurrence of this event in a longrow of independent random experiments e ldquostabilityrdquoof this frequency (ie that such a limit always exists) waspostulated Aer the s Kolmogorovrsquos axiomatic con-struction of probability theory has prevailed One of themain assertions in this axiomatic theory is the Law of LargeNumbers (LLN) on the convergence of the averages of largenumbers of random variables to their expectation islaw implies the aforementioned stability of the relative fre-quencies and their convergence to the probability of thecorresponding event

e LLN is the simplest limit theorem (LT) of probabil-ity theory elucidating the physical meaning of probabilitye LLN is stated as follows if XXX is a sequenceof iid random variables

Sn =n

sumj=Xj

and the expectation a = EX exists then nminusSnasETHrarr a

(almost surely ie with probability )us the value nacan be called the rst order approximation for the sums Sne Central Limiteorem (CLT) gives one a more preciseapproximation for Sn It says that if σ = E(X minus a) ltinfinthen the distribution of the standardized sum ζn = (Sn minusna)σ

radicn converges as n rarr infin to the standard normal

(Gaussian) lawat is for all x

P(ζn lt x)rarr Φ(x) = radicπ

x

intminusinfin

eminustdt

e quantity nEξ + ζσradicn where ζ is a standard normal

random variable (so that P(ζ lt x) = Φ(x)) can be calledthe second order approximation for Sn

e rst LLN (for the Bernoulli scheme) was provedby Jacob Bernoulli in the late s (published posthu-mously in ) e rst CLT (also for the Bernoullischeme) was established by A de Moivre (rst publishedin and referred nowadays to as the deMoivrendashLaplacetheorem) In the beginning of the nineteenth centuryPS Laplace and CF Gauss contributed to the generaliza-tion of these assertions and appreciation of their enormousapplied importance (in particular for the theory of errorsof observations) while later in that century further break-throughs in both methodology and applicability rangeof the CLT were achieved by PL Chebyshev () andAM Lyapunov ()

e main directions in which the two aforementionedmain LTs have been extended and rened since then are

Relaxing the assumption EX lt infin When the sec-ond moment is innite one needs to assume that theldquotailrdquo P(x) = P(X gt x) + P(X lt minusx) is a func-tion regularly varying at innity such that the limitlimxrarrinfinP(X gt x)P(x) existsen the distribution of thenormalized sum Snσ(n) where σ(n) = Pminus(nminus)Pminus being the generalized inverse of the function P andwe assume that Eξ = when the expectation is niteconverges to one of the so-called stable laws as nrarrinfine7characteristic functions of these laws have simpleclosed-form representations

Relaxing the assumption that the Xjrsquos are identicallydistributed and proceeding to study the so-called tri-angular array scheme where the distributions of thesummands Xj = Xjn forming the sum Sn depend notonly on j but on n as well In this case the class of alllimit laws for the distribution of Sn (under suitable nor-malization) is substantially wider it coincides with theclass of the so-called innitely divisible distributionsAn important special case here is the Poisson limit the-orem on convergence in distribution of the number ofoccurrences of rare events to a Poisson law

Relaxing the assumption of independence of the XjrsquosSeveral types of ldquoweak dependencerdquo assumptions onXjunder which the LLN and CLT still hold true havebeen suggested and investigatedOne should alsomen-tion here the so-called ergodic theorems (see7Ergodiceorem) for a wide class of random sequences andprocesses

Renement of the main LTs and derivation of asymp-totic expansions For instance in the CLT bounds ofthe rate of convergence P(ζn lt x) minus Φ(x) rarr and

Limit Theorems of Probability Theory L

L

asymptotic expansions for this dierence (in the pow-ers of nminus in the case of iid summands) have beenobtained under broad assumptions

Studying large deviation probabilities for the sums Sn(theorems on rare events) If x rarr infin together with nthen the CLT can only assert that P(ζn gt x) rarr eorems on large deviation probabilities aim to nd afunction P(xn) such that

P(ζn gt x)P(xn) rarr as nrarrinfin x rarrinfin

e nature of the function P(xn) essentially dependson the rate of decay of P(X gt x) as x rarrinfin and on theldquodeviation zonerdquo ie on the asymptotic behavior of theratio xn as nrarrinfin

Considering observations X Xn of a more com-plex nature ndash rst of all multivariate random vectorsIf Xj isin Rd then the role of the limit law in the CLTwill be played by a d-dimensional normal (Gaussian)distribution with the covariance matrix E(X minus EX)(X minus EX)T

e variety of application areas of the LLN and CLTis enormous us Mathematical Statistics is based onthese LTs Let Xlowastn = (X Xn) be a sample froma distribution F and Flowastn (u) the corresponding empiricaldistribution functione fundamental GlivenkondashCantellitheorem (see 7Glivenko-Cantelli eorems) stating thatsupu ∣F

lowastn (u)minus F(u)∣

asETHrarr as nrarrinfin is of the same natureas the LLN and basically means that the unknown distri-bution F can be estimated arbitrary well from the randomsample Xlowastn of a large enough size n

e existence of consistent estimators for the unknownparameters a = Eξ and σ = E(X minus a) also follows fromthe LLN since as nrarrinfin

alowast = n

n

sumj=Xj

asETHrarr a (σ )lowast = n

n

sumj=

(Xj minus alowast)

= n

n

sumj=Xj minus (alowast) asETHrarr σ

Under additional moment assumptions on the distri-bution F one can also construct asymptotic condenceintervals for the parameters a and σ as the distributionsof the quantities

radicn(alowast minus a) and

radicn((σ )lowast minus σ ) con-

verge as n rarr infin to the normal onese same can alsobe said about other parameters that are ldquosmoothrdquo enoughfunctionals of the unknown distribution F

e theorem on the 7asymptotic normality andasymptotic eciency of maximum likelihood estimatorsis another classical example of LTsrsquo applications in mathe-matical statistics (see eg Borovkov ) Furthermore in

estimation theory and hypotheses testing one also needstheorems on large deviation probabilities for the respec-tive statistics as it is statistical procedures with small errorprobabilities that are oen required in applicationsIt is worth noting that summation of random variables

is by no means the only situation in which LTs appear inProbabilityeoryGenerally speaking the main objective of Probabil-

ityeory in applications is nding appropriate stochasticmodels for objects under study and then determining thedistributions or parameters one is interested in As a rulethe explicit form of these distributions and parameters isnot known LTs can be used to nd suitable approximationsto the characteristics in questionAt least two possible approaches to this problem should

be noted here

Suppose that the unknown distribution Fθ depends ona parameter θ such that as θ approaches some ldquocriticalrdquovalue θ the distributions Fθ become ldquodegeneraterdquo inone sense or anotheren in a number of cases onecan nd an approximation for Fθ which is valid for thevalues of θ that are close to θ For instance in actuarialstudies7queueing theory and some other applicationsone of the main problems is concerned with the dis-tribution of S = sup

kge(Sk minus θk) under the assumption

that EX = If θ gt then S is a proper random vari-able If however θ rarr then S asETHrarr infin Here we dealwith the so-called ldquotransient phenomenardquo It turns outthat if σ = Var (X) ltinfin then there exists the limit

limθdarrP(θS gt x) = eminusxσ x gt

is (KingmanndashProkhorov) LT enables one to ndapproximations for the distribution of S in situationswhere θ is small

Sometimes one can estimate the ldquotailsrdquo of the unknowndistributions ie their asymptotic behavior at innityis is of importance in those applications where oneneeds to evaluate the probabilities of rare events If theequation Eemicro(Xminusθ) = has a solution micro gt then inthe above example one has

P(S gt x) sim ceminusmicrox x rarrinfin

where c is a known constant If the distribution F of Xis subexponential (in this case Eemicro(Xminusθ) = infin for anymicro gt ) then

P(S gt x) sim θ int

infin

x( minus F(t))dt x rarrinfin

L Linear Mixed Models

isLTenablesone tondapproximations forP(S gt x)for large xFor both approaches the obtained approximations

can be rened

An important part of Probabilityeory is concernedwith LTs for random processes eir main objective isto nd conditions under which random processes con-verge in some sense to some limit process An extensionof the CLT to that context is the so-called FunctionalCLT (aka the DonskerndashProkhorov invariance princi-ple) which states that as n rarr infin the processesζn(t) = (Slfloorntrfloor minus ant)σ

radicntisin[] converge in distribu-

tion to the standard Wiener process w(t)tisin[]e LTs(including large deviation theorems) for a broad class offunctionals of the sequence (7random walk) S Sncan also be classied as LTs for 7stochastic processese same can be said about Law of iterated logarithmwhich states that for an arbitrary ε gt the random walkSkinfink= crosses the boundary (minusε)σ

radick ln ln k innitely

many times but crosses the boundary ( + ε)σradick ln ln k

nitely many times with probability Similar results holdtrue for trajectories of Wiener processes w(t)tisin[] andw(t)tisin[infin)In mathematical statistics a closely related to func-

tional CLT result says that the so-called ldquoempirical processrdquoradicn(Flowastn (u) minus F(u))uisin(minusinfininfin) converges in distribution

to w(F(u))uisin(minusinfininfin) where w(t) = w(t) minus tw() isthe Brownian bridge processis LT implies7asymptoticnormality of a great many estimators that can be repre-sented as smooth functionals of the empirical distributionfunction Flowastn (u)

ere are many other areas in Probability eoryand its applications where various LTs appear and areextensively used For instance convergence theoremsfor 7martingales asymptotics of extinction probabilityof a branching processes and conditional (under non-extinction condition) LTs on a number of particles etc

About the AuthorProfessor Borovkov is Head of Probability and Statis-tics Department of the Sobolev Institute of MathematicsNovosibirsk (RussianAcademy of Sciences) since Heis Head of Probability and Statistics Chair at the Novosi-birsk University since He is Concelour of RussianAcademy of Sciences and fullmember of RussianAcademyof Sciences (Academician) () He was awarded theState Prize of the USSR () the Markov Prize of theRussian Academy of Sciences () and GovernmentPrize in Education () Professor Borovkov is Editor-in-Chief of the journal ldquoSiberianAdvances inMathematicsrdquo

and Associated Editor of journals ldquoeory of Probabil-ity and its Applicationsrdquo ldquoSiberian Mathematical JournalrdquoldquoMathematical Methods of Statisticsrdquo ldquoElectronic Journal ofProbabilityrdquo

Cross References7Almost Sure Convergence of Random Variables7Approximations to Distributions7Asymptotic Normality7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Central Limiteorems7Empirical Processes7Ergodiceorem7Glivenko-Cantellieorems7Large Deviations and Applications7Laws of Large Numbers7Martingale Central Limiteorem7Probabilityeory An Outline7RandomMatrixeory7Strong Approximations in Probability and Statistics7Weak Convergence of Probability Measures

References and Further ReadingBillingsley P () Convergence of probability measures Wiley

New YorkBorovkov AA () Mathematical statistics Gordon amp Breach

AmsterdamGnedenko BV Kolmogorov AN () Limit distributions for sums

of independent random variables Addison-Wesley CambridgeLeacutevy P () Theacuteorie de lrsquoAddition des Variables Aleacuteatories

Gauthier-Villars ParisLoegraveve M (ndash) Probability theory vols I and II th edn

Springer New YorkPetrov VV () Limit theorems of probability theory sequences of

independent random variables ClarendonOxford UniversityPress New York

Linear Mixed Models

GeertMolenberghsProfessorUniversiteit Hasselt amp Katholieke Universiteit LeuvenLeuven Belgium

In observational studies repeated measurements may betaken at almost arbitrary time points resulting in anextremely large number of time points at which only one

Linear Mixed Models L

L

or only a few measurements have been taken Many of theparametric covariance models described so far may thencontain toomany parameters tomake them useful in prac-tice while othermore parsimoniousmodelsmay be basedon assumptions which are too simplistic to be realisticA general and very exible class of parametric models forcontinuous longitudinal data is formulated as follows

yi∣bi sim N(Xiβ + ZibiΣi) ()bi sim N(D) ()

where Xi and Zi are (ni times p) and (ni times q) dimensionalmatrices of known covariates β is a p-dimensional vec-tor of regression parameters called the xed eects D isa general (q times q) covariance matrix and Σi is a (ni times ni)covariance matrix which depends on i only through itsdimension ni ie the set of unknown parameters in Σi willnot depend upon i Finally bi is a vector of subject-specicor random eects

e above model can be interpreted as a linear regres-sion model (see 7Linear Regression Models) for the vec-tor yi of repeated measurements for each unit separatelywhere some of the regression parameters are specic (ran-dom eects bi) while others are not (xed eects β)edistributional assumptions in () with respect to the ran-dom eects can be motivated as follows First E(bi) = implies that the mean of yi still equals Xiβ such that thexed eects in the random-eects model () can also beinterpreted marginally Not only do they reect the eectof changing covariates within specic units they alsomea-sure the marginal eect in the population of changing thesame covariates Second the normality assumption imme-diately implies that marginally yi also follows a normaldistribution with mean vector Xiβ and with covariancematrix Vi = ZiDZprimei + ΣiNote that the random eects in () implicitly imply the

marginal covariance matrix Vi of yi to be of the very spe-cic form Vi = ZiDZprimei + Σi Let us consider two examplesunder the assumption of conditional independence ieassuming Σi = σ Ini First consider the case where therandom eects are univariate and represent unit-specicintercepts is corresponds to covariates Zi which areni-dimensional vectors containing only ones

e marginal model implied by expressions () and() is

yi sim N(XiβVi) Vi = ZiDZprimei + Σi

which can be viewed as a multivariate linear regressionmodel with a very particular parameterization of thecovariance matrix Vi

With respect to the estimation of unit-specic param-eters bi the posterior distribution of bi given the observeddata yi can be shown to be (multivariate) normal withmean vector equal to DZprimeiV

minusi (α)(yi minus Xiβ) Replacing β

and α by their maximum likelihood estimates we obtainthe so-called empirical Bayes estimates bi for the bi A keyproperty of these EB estimates is shrinkage which is bestillustrated by considering the prediction yi equiv Xi β +Zibi ofthe ith prole It can easily be shown that

yi = ΣiVminusi Xi β + (Ini minus ΣiVminusi ) yi

which can be interpreted as a weighted average of thepopulation-averaged prole Xi β and the observed data yiwith weights ΣiVminusi and Ini minusΣiVminusi respectively Note thatthe ldquonumeratorrdquo of ΣiVminusi represents within-unit variabil-ity and the ldquodenominatorrdquo is the overall covariance matrixVi Hence much weight will be given to the overall averageprole if the within-unit variability is large in comparisonto the between-unit variability (modeled by the randomeects) whereasmuchweight will be given to the observeddata if the opposite is trueis phenomenon is referred toas shrinkage toward the average proleXi β An immediateconsequence of shrinkage is that the EB estimates show lessvariability than actually present in the random-eects dis-tribution ie for any linear combination λ of the randomeects

var(λprimebi)le var(λprimebi) = λprimeDλ

About the AuthorGeert Molenberghs is Professor of Biostatistics at Uni-versiteit Hasselt and Katholieke Universiteit Leuven inBelgium He was Joint Editor of Applied Statistics (ndash) and Co-Editor of Biometrics (ndash) Cur-rently he is Co-Editor of Biostatistics (ndash) He wasPresident of the International Biometric Society (ndash) received the Guy Medal in Bronze from the RoyalStatistical Society and the Myrto Lefkopoulou award fromthe Harvard School of Public Health Geert Molenberghsis founding director of the Center for Statistics He isalso the director of the Interuniversity Institute for Bio-statistics and statistical Bioinformatics (I-BioStat) Jointlywith Geert Verbeke Mike Kenward Tomasz BurzykowskiMarc Buyse and Marc Aerts he authored books on lon-gitudinal and incomplete data and on surrogate markerevaluation GeertMolenberghs received several Excellencein Continuing Education Awards of the American Statisti-cal Association for courses at Joint Statistical Meetings

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 5: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

Learning Statistics in a Foreign Language L

L

7Chebyshevrsquos Inequality7Convergence of Random Variables7Ergodiceorem7Estimation An Overview7Expected Value7Foundations of Probability7Glivenko-Cantellieorems7Probabilityeory An Outline7Random Field7Statistics History of7Strong Approximations in Probability and Statistics

References and Further ReadingAdler A Rosalsky A Taylor RL () A weak law for normed

weighted sums of random elements in Rademacher type pBanach spaces J Multivariate Anal ndash

Cantrell A Rosalsky A () A strong law for compactly uni-formly integrable sequences of independent random elementsin Banach spaces Bull Inst Math Acad Sinica ndash

Chow YS Teicher H () Probability theory indepen-dence interchangeability martingales rd edn SpringerNew York

Feller W () A limit theorem for random variables with infinitemoments Am J Math ndash

Klass M Teicher H () Iterated logarithm laws for asymmet-ric random variables barely with or without finite mean AnnProbab ndash

Ledoux M Talagrand M () Probability in Banach spacesisoperimetry and processes Springer Berlin

Loegraveve M () Probability theory vol I th edn Springer New YorkMourier E () Eleacutements aleacuteatoires dans un espace de Banach

Annales de lrsquo Institut Henri Poincareacute ndashPetrov VV () Limit theorems of probability theory sequences

of independent random variables Clarendon OxfordReacutenyi A () Probability theory North-Holland AmsterdamReacuteveacutesz P () The laws of large numbers Academic New YorkStout WF () Almost sure convergence Academic New YorkTaylor RL () Stochastic convergence of weighted sums of ran-

dom elements in linear spaces Lecture notes in mathematicsvol Springer Berlin

Vakhania NN Tarieladze VI Chobanyan SA () Probabilitydistributions on Banach spaces D Reidel Dordrecht Holland

Learning Statistics in a ForeignLanguage

KhidirM AbdelbasitSultan Qaboos University Muscat Sultanate of Oman

Backgrounde Sultanate of Oman is an Arabic-speaking countrywhere the medium of instruction in pre-university edu-cation is Arabic In Sultan Qaboos University (SQU) all

sciences (including Statistics) are taught in English ereason is that most of the scientic literature is in Englishand teaching in the native language may leave graduatesat a disadvantage Since only few instructors speak Ara-bic the university adopts a policy of no communicationin Arabic in classes and oce hours Students are requiredto achieve a minimum level in English (about IELTSscore) before they start their study program Very few stu-dents achieve that level on entry and the majority spendsabout two semesters doing English only

Language and Cultural ProblemsIt is to be expected that students from a non-English-speaking background will face serious diculties whenlearning in English especially in the rst year or two Mostof the literature discusses problems faced by foreign stu-dents pursuing study programs in an English-speakingcountry or a minority in a multi-cultural society (seefor example Coutis P and Wood L Hubbard R Koh E)Such students live (at least while studying) in an English-speaking community with which they have to interact on adaily basisese diculties are more serious for our stu-dents who are studying in their own countrywhere Englishis not the ocial languageey hardly use English outsideclassrooms and avoid talking in class as much as they can

My SQU ExperienceStatistical concepts and methods are most eectivelytaught through real-life examples that the students appre-ciate and understand We use the most popular textbooksin the USA for our courses ese textbooks use thisapproach with US students in mind Our main prob-lems are

Most of the examples and exercises used are com-pletely alien to our studentse discussions meant tomaintain the studentsrsquo interest only serve to put ourso With limited English they have serious dicultiesunderstanding what is explained and hence tend not tolisten to what the instructor is sayingey do not readthe textbooks because they contain pages and pages oflengthy explanations and discussions they cannot fol-low A direct eect is that students may nd the subjectboring and quickly lose interesteir attention thenturns to the art of passing tests instead of acquiring theintended knowledge and skills To pass their tests theyuse both their class and study times looking throughexamples concentrating on what formula to use andwhere to plug the numbers they have to get the answeris way they manage to do the mechanics fairly wellbut the concepts are almost entirely missed

L Least Absolute Residuals Procedure

e problem is worse with introductory probabilitycourses where games of chance are extensively usedas illustrative examples in textbooks Most of our stu-dents have never seen a deck of playing cards and somemay even be oended by discussing card games in aclassroom

e burden of nding strategies to overcome these dicul-ties falls on the instructor Statistical terms and conceptssuch as parameterstatistic sampling distribution unbi-asedness consistency suciency and ideas underlyinghypothesis testing are not easy to get across even in the stu-dentsrsquo own language To do that in a foreign language is areal challenge For the Statistics program to be successfulall (or at least most of the) instructors involved should beup to this challengeis is a time-consuming task with lit-tle reward other than self satisfaction In SQU the problemis compounded further by the fact that most of the instruc-tors are expatriates on short-term contracts who are morelikely to use their time for personal career advancementrather than time-consuming community service jobs

What Can Be DoneFor our rst Statistics course we produced a manual thatcontains very brief notes and many samples of previousquizzes tests and examinations It contains a good col-lection of problems from local culture to motivate thestudentse manual was well received by the students tothe extent that students prefer to practice with examplesfrom the manual rather than the textbookTextbooks written in English that are brief and to the

point are neededese should include examples and exer-cises from the studentsrsquo own culture A student tryingto understand a specic point gets distracted by lengthyexplanations and discouraged by thick textbooks to beginwith In a classroom where studentsrsquo faces clearly indicatethat you have got nothing across it is natural to try explain-ing more using more examples In the end of semesterevaluation of a course I taught a student once wrote ldquoeinstructor explains things more than needed He makessimple points dicultrdquois indicates that when teachingin a foreign language lengthy oral or written explanationsare not helpful A better strategywill be to explain conceptsand techniques briey and provide plenty of examples andexercises that will help the students absorb the material byosmosis e basic statistical concepts can only be eec-tively communicated to students in their own languageFor this reason textbooks should contain a good glossarywhere technical terms and concepts are explained using thelocal language

I expect such textbooks to go a long way to enhancestudentsrsquo understanding of Statistics An internationalproject can be initiated to produce an introductory statis-tics textbook with dierent versions intended for dierentgeographical arease English material will be the samethe examples vary to some extent from area to area andglossaries in local languages Universities in the develop-ing world naturally look at western universities asmodelsand international (western) involvement in such a projectis needed for it to succeede project will be a major con-tribution to the promotion of understanding Statistics andexcellence in statistical education in developing countriese international statistical institute takes pride in sup-porting statistical progress in the developing world isproject can lay the foundation for this progress and henceis worth serious consideration by the institute

Cross References7Online Statistics Education7Promoting Fostering and Development of Statistics inDeveloping Countries7Selection of Appropriate Statistical Methods in Develop-ing Countries7Statistical Literacy Reasoning andinking7Statistics Education

References and Further ReadingCoutis P Wood L () Teaching statistics and academic language

in culturally diverse classrooms httpwwwmathuocgr~ictmProceedingspappdf

Hubbard R () Teaching statistics to students who are learningin a foreign language ICOTS

Koh E () Teaching statistics to students with limited languageskills using MINITAB httparchivesmathutkeduICTCMVOLCpaperpdf

Least Absolute ResidualsProcedureRichardWilliam FarebrotherHonorary Reader in EconometricsVictoria University of Manchester Manchester UK

For i = n let xi xi xiq yi represent the ithobservation on a set of q + variables and suppose that wewish to t a linear model of the form

yi = xiβ + xiβ +⋯ + xiqβq + єi

Least Absolute Residuals Procedure L

L

to these n observations en for p gt the Lp-normtting procedure chooses values for b b bq to min-imise the Lp-norm of the residuals [sumni= ∣ei∣p]

p where fori = n the ith residual is dened by

ei = yi minus xib minus xib minus minus xiqbq

e most familiar Lp-norm tting procedure knownas the 7least squares procedure sets p = and choosesvalues for b b bq tominimize the sum of the squaredresidualssumni= ei A second choice to be discussed in the present article

sets p = and chooses b b bq to minimize the sumof the absolute residualssumni= ∣ei∣A third choice sets p =infin and chooses b b bq to

minimize the largest absolute residualmaxni=∣ei∣Setting ui = ei and vi = if ei ge and ui =

and vi = minusei if ei lt we nd that ei = ui minus vi so thatthe least absolute residuals (LAR) tting problem choosesb b bq to minimize the sum of the absolute residuals

n

sumi=

(ui + vi)

subject to

xib + xib +⋯ + xiqbq +Ui minus vi = yi for i = n

and Ui ge vi ge for i = ne LAR tting problem thus takes the form of a linearprogramming problem and is oen solved by means of avariant of the dual simplex procedureGauss has noted (when q = ) that solutions of this

problem are characterized by the presence of a set of qzero residuals Such solutions are robust to the presence ofoutlying observations Indeed they remain constant undervariations in the other n minus q observations provided thatthese variations do not cause any of the residuals to changetheir signs

e LAR tting procedure corresponds to the maxi-mum likelihood estimator when the є-disturbances followa double exponential (Laplacian) distributionis estima-tor is more robust to the presence of outlying observationsthan is the standard least squares estimator which maxi-mizes the likelihood function when the є-disturbances arenormal (Gaussian)Nevertheless theLAR estimator has anasymptotic normal distribution as it is amember ofHuberrsquosclass ofM-estimators

ere are many variants of the basic LAR proce-dure but the one of greatest historical interest is thatproposed in by the Croatian Jesuit scientist Rugjer(or Rudjer) Josip Boškovic (ndash) (Latin RogeriusJosephusBoscovich Italian RuggieroGiuseppeBoscovich)

In his variant of the standard LAR procedure there are twoexplanatory variables of which the rst is constant xi = and the values of b and b are constrained to satisfy theadding-up condition sumni=(yi minus b minus xib) = usuallyassociated with the least squares procedure developed byGauss in and published by Legendre in Com-puter algorithms implementing this variant of the LARprocedure with q ge variables are still to be found in theliteratureFor an account of recent developments in this area

see the series of volumes edited by Dodge ( ) For a detailed history of the LAR procedureanalyzing the contributions of Boškovic Laplace GaussEdgeworth Turner Bowley and Rhodes see Farebrother() And for a discussion of the geometrical andmechanical representation of the least squares and LARtting procedures see Farebrother ()

About the AuthorRichard William Farebrother was a member of the teach-ing sta of the Department of Econometrics and SocialStatistics in the (Victoria) University of Manchester from until when he took early retirement (he hasbeen blind since ) From until he was anHonorary Reader in Econometrics in the Department ofEconomic Studies of the sameUniversity He has publishedthree books Linear Least Squares Computations in Fitting Linear Relationships A History of the Calculus ofObservations ndash in and Visualizing statisticalModels and Concepts in He has also published morethan research papers in a wide range of subject areasincluding econometric theory computer algorithms sta-tistical distributions statistical inference and the history ofstatistics Dr Farebrother was also visiting Associate Pro-fessor () at Monash University Australia and VisitingProfessor () at the Institute for Advanced Studies inVienna Austria

Cross References7Least Squares7Residuals

References and Further ReadingDodge Y (ed) () Statistical data analysis based on the L-norm

and related methods North-Holland AmsterdamDodge Y (ed) () L-statistical analysis and related methods

North-Holland AmsterdamDodge Y (ed) () L-statistical procedures and related topics

Institute of Mathematical Statistics HaywardDodge Y (ed) () Statistical data analysis based on the L-norm

and related methods Birkhaumluser Basel

L Least Squares

Farebrother RW () Fitting linear relationships a history of thecalculus of observations ndash Springer New York

Farebrother RW () Visualizing statistical models and conceptsMarcel Dekker New York

Least Squares

Czesław StepniakProfessorMaria Curie-Skłodowska University Lublin PolandUniversity of Rzeszoacutew Rzeszoacutew Poland

Least Squares (LS) problem involves some algebraic andnumerical techniques used in ldquosolvingrdquo overdeterminedsystems F(x) asymp b of equations where b isin Rn while F(x)is a column of the form

F(x) =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

f(x)

f(x)

fmminus(x)

fm(x)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

with entries fi = fi(x) i = n where x = (x xp)T e LS problem is linear when each fi is a linear functionand nonlinear ndash if notLinear LS problem refers to a system Ax = b of linear

equations Such a system is overdetermined if n gt p If b notinrange(A) the system has no proper solution and will bedenoted by Ax asymp b In this situation we are seeking for asolution of some optimization probleme name ldquoLeastSquaresrdquo is justied by the l-norm commonly used as ameasure of imprecision

e LS problem has a clear statistical interpretation inregression terms Consider the usual regression model

yi = fi(xi xip β βp) + ei for i = n ()

where xij i = n j = p are some constants givenby experimental design fi i = n are given functionsdepending on unknown parameters βj j = p whileyi i = n are values of these functions observed withsome random errors ei We want to estimate the unknownparameters βi on the basis of the data set xij yi

In linear regression each fi is a linear function of typefi = sump

j= cij(x xn)βj and the model () may bepresented in vector-matrix notation as

y = Xβ + e

where y = (y yn)T e = (e en)T and β =(β βp)T while X is a n times p matrix with entries xijIf e en are not correlated with mean zero and a com-mon (perhaps unknown) variance then the problem ofBest Linear Unbiased Estimation (BLUE) of β reduces tonding a vector β that minimizes the norm ∣∣y minus Xβ ∣∣

=

(y minus Xβ)T(y minus Xβ)Such a vector is said to be the ordinary LS solution of

the overparametrized system Xβ asymp y On the other handthe last one reduces to solving the consistent system

XTXβ = XTy

of linear equations called normal equations In particularif rank(X) = p then the system has a unique solution of theform

β = (XTX)minusXTy

For linear regression yi = α+βxi+ei with one regressorx the BLU estimators of the parameters α and β may bepresented in the convenient form as

β = nsxynsx

and α = y minus βx

where

nsxy =sumixiyi minus

(sumi xi) (sumi yi)n

nsx =sumixi minus

(sumi xi)

n x = sumi xi

nand y = sumi yi

n

For its computation we only need to use a simple pocketcalculator

Example e following table presents the number of resi-dents in thousands (x) and the unemployment rate in (y)for some cities of Poland Estimate the parameters β and α

xi

yi

In this casesumi xi = sumi yi = sumi xi = and sumi xiyi = erefore nsx = and nsxy =minusus β = minus and α = and hence f (x) =minusx +

Leacutevy Processes L

L

If the variancendashcovariance matrix of the error vector ecoincides (except a multiplicative scalar σ ) with a positivedenite matrix V then the Best Linear Unbiased estima-tion reduces to the minimization of (y minusXβ)TV(y minusXβ)called the weighed LS problem Moreover if rank(X) = pthen its solution is given in the form

β = (XTVminusX)minusXTVminusy

It is worth to add that a nonlinear LS problem is morecomplicated and its explicit solution is usually not knownInstead of this some algorithms are suggestedTotal least squares problem e problem has been

posed in recent years in numerical analysis as an alterna-tive for the LS problem in the casewhen all data are aectedby errorsConsider an overdetermined system of n linear equa-

tionsAx asymp b with k unknown xe TLS problem consistsin minimizing the Frobenius norm

∣∣[A b] minus [A b]∣∣Ffor all A isin Rntimesk and b isin range(A) where the Frobeniusnorm is dened by ∣∣(aij)∣∣F = sumij aij Once a minimizing[A b] is found then any x satisfying Ax = b is called a TLSsolution of the initial system Ax asymp b

e trouble is that the minimization problem maynot be solvable or its solution may not be unique As anexample one can set

A =⎡⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎦and b =

⎡⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎦

It is known that the TLS solution (if exists) is alwaysbetter than the ordinary LS in the sense that the cor-rection b minus Ax has smaller l-norm e main tool insolving the TLS problems is the following Singular ValueDecompositionFor any matrix A of n times k with real entries there

exist orthonormal matrices P = [p pn] and

Q = [q qk] such that

PTAQ = diag(σ σm) where σ ge ⋯ ge σm andm = minn k

About the AuthorFor biography see the entry 7Random Variable

Cross References7Absolute Penalty Estimation7Adaptive Linear Regression

7Analysis of Areal and Spatial Interaction Data7Autocorrelation in Regression7Best Linear Unbiased Estimation in Linear Models7Estimation7Gauss-Markoveorem7General Linear Models7Least Absolute Residuals Procedure7Linear Regression Models7Nonlinear Models7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Simple Linear Regression7Statistics History of7Two-Stage Least Squares

References and Further ReadingBjoumlrk A () Numerical methods for least squares problems

SIAM PhiladelphiaKariya T () Generalized least squares Wiley New YorkRao CR Toutenberg H () Linear models least squares and

alternatives Springer New YorkVan Huffel S Vandewalle J () The total least squares problem

computational aspects and analysis SIAM PhiladelphiaWolberg J () Data analysis using the method of least squares

extracting the most information from experiments SpringerNew York

Leacutevy Processes

Mohamed Abdel-HameedProfessorUnited Arab Emirates University Al Ain United ArabEmirates

IntroductionLeacutevy processes have become increasingly popular in engi-neering (reliability dams telecommunication) andmathe-matical nanceeir applications in reliability stems fromthe fact that they provide a realistic model for the degrada-tion of devices while their applications in the mathemati-cal theory of dams as they provide a basis for describingthe water input of dams eir popularity in nance isbecause they describe the nancialmarkets in amore accu-rate way than the celebrated BlackndashScholes model elatter model assumes that the rate of returns on assetsare normally distributed thus the process describing theasset price over time is continuous process In reality theasset prices have jumps or spikes and the asset returnsexhibit fat tails and 7skewness which negates the nor-mality assumption inherited in the BlackndashScholes model

L Leacutevy Processes

Because of the deciencies in the BlackndashScholes modelresearchers in mathematical nance have been trying tondmore suitable models for asset prices Certain types ofLeacutevy processes have been found to provide a good modelfor creep of concrete fatigue crack growth corroded steelgates and chloride ingress into concrete Furthermore cer-tain types of Leacutevy processes have been used to model thewater input in damsIn this entry we will review Leacutevy processes and give

important examples of such processes and state somereferences to their applications

Leacutevy ProcessesA stochastic process X = Xt t ge that has right con-tinuous sample paths with le limits is said to be a Leacutevyprocess if the following hold

X has stationary increments ie for every s t ge thedistribution of Xt+s minus Xt is independent of t

X has independent increments ie for every t sge Xt+s minus Xt is independent of ( Xuu le t)

X is stochastically continuous ie for every t ge andє gt

limsrarrtP(∣Xt minus Xs∣ gt є) = at is to say a Leacutevy process is a stochastically continu-ous process with stationary and independent incrementswhose sample paths are right continuous with le handlimitsIf Φ(z) is the characteristic function of a Leacutevy process

then its characteristic component φ(z) def= ln Φ(z)t is of the

form

iza minus zb+ int

R[exp(izx) minus minus izxI∣x∣lt]ν(dx)

where a isin R b isin R+ and ν is a measure on R satisfyingν() = intR( and x

)ν(dx) ltinfine measure ν characterizes the size and frequency of

the jumps If the measure is innite then the process hasinnitelymany jumps of very small sizes in any small inter-vale constant a dened above is called the dri term ofthe process and b is the variance (volatility) term

e LeacutevyndashIt^o decomposition identify any Leacutevy process

as the sum of three independent processes it is stated asfollowsGiven any a isin R b isin R+ and measure ν on R satisfying

ν() = intR(andx)ν(dx) ltinfin there exists a probability

space (Ω P) on which a Leacutevy process X is denede

process X is the sum of three independent processes()X

()X and

()X where

()X is a Brownianmotionwith dri a and

volatility b (in the sense dened below)()X is a compound

Poisson process and()X is a square integrable martingale

e characteristic components of()X

()X and

()X (denoted

by()φ (z)

()φ (z) and

()φ (z) respectively) are as follows

()φ (z) = iza minus zb

()φ (z)= int∣x∣ge(exp(izx) minus )ν(dx)()φ (z)= int∣x∣lt(exp(izx) minus minus izx)ν(dx)

Examples of the Leacutevy ProcessesThe Brownian MotionA Leacutevy process is said to be a Brownian motion (see7Brownian Motion and Diusions) with dri micro andvolatility rate σ if micro = a b = σ and ν (R)= Brow-nian motion is the only nondeterministic Leacutevy processeswith continuous sample paths

The Inverse Brownian ProcessLet X be a Brownian motion with micro gt and volatility rateσ For any x gt let Tx = inft Xt gt xen Tx is anincreasing Leacutevy process (called inverse Brownianmotion)its Leacutevy measure is given by

υ(dx) = radicπσ x

exp(minusxmicro

σ )

The Compound Poisson Processe compound Poisson process (see 7Poisson Processes)is a Leacutevy process where b = and ν is a nite measure

The Gamma Processe gamma process is a nonnegative increasing Leacutevy pro-cess X where b = a minus int xν(dx) = and its Leacutevymeasure is given by

ν(dx) = αxexp(minusxβ)dx x gt

where α β gt It follows that the mean term (E(X ))and the variance term (V(X )) for the process are equalto αβ and αβ respectively

e following is a simulated sample path of a gammaprocess where α = and β = (Fig )

The Variance Gamma Processe variance gamma process is a Leacutevy process that can berepresented as either the dierence between two indepen-dent gamma processes or as a Brownian process subordi-nated by a gamma processe latter is accomplished by arandom time change replacing the time of the Brownianprocess by a gamma process with a mean term equal to e variance gamma process has three parameters micro ndash the

Life Expectancy L

L

0 2 4 6 8 100

2

4

6

8

10

12

t

x

Leacutevy Processes Fig Gamma process

0 2 4 6 8 10minus08

minus06

minus04

minus02

0

02

04

06

t

x

Brownian motionVariance gamma

Leacutevy Processes Fig Brownian motion and variance gammasample paths

Brownian process dri term σ ndash the volatility of theBrownian process and ν ndash the variance term of the thegamma process

e following are two simulated sample paths one fora Brownian motion with a dri term micro = and volatilityterm σ = and the other is for a variance gamma processwith the same values for the dri term and the volatilityterms and ν = (Fig )

About the AuthorProfessor Mohamed Abdel-Hameed was the chairman ofthe Department of Statistics and Operations ResearchKuwait University (ndash) He was on the editorialboard of Applied Stochastic Models and Data Analysis(ndash) He was the Editor (jointly with E Cinlarand J Quinn) of the text Survival Models Maintenance

Replacement Policies and Accelerated Life Testing (Aca-demic Press ) He is an elected member of the Inter-national Statistical Institute He has published numerouspapers in Statistics and Operations research

Cross References7Non-Uniform Random Variate Generations7Poisson Processes7Statistical Modeling of Financial Markets7Stochastic Processes7Stochastic Processes Classication

References and Further ReadingAbdel-Hameed MS () Life distribution of devices subject to a

Leacutevy wear process Math Oper Res ndashAbdel-Hameed M () Optimal control of a dam using PMλτ

policies and penalty cost when the input process is a com-pound Poisson process with a positive drift J Appl Prob ndash

Black F Scholes MS () The pricing of options and corporateliabilities J Pol Econ ndash

Cont R Tankov P () Financial modeling with jump processesChapman amp HallCRC Press Boca Raton

Madan DB Carr PP Chang EC () The variance gamma processand option pricing Eur Finance Rev ndash

Patel A Kosko B () Stochastic resonance in continuous andspiking Neutron models with Leacutevy noise IEEE Trans NeuralNetw ndash

van Noortwijk JM () A survey of the applications of gammaprocesses in maintenance Reliab Eng Syst Safety ndash

Life Expectancy

Maja Biljan-AugustFull ProfessorUniversity of Rijeka Rijeka Croatia

Life expectancy is dened as the average number of yearsa person is expected to live from age x as determined bystatisticsStatistics on life expectancy are derived from a mathe-

matical model known as the 7life table In order to calcu-late this indicator the mortality rate at each age is assumedto be constant Life expectancy (ex) can be evaluated atany age and in a hypothetical stationary population canbe written in discrete form as

ex =Txlx

where x is age Tx is the number of person-years livedaged x and over and lx is the number of survivors at agex according to the life table

L Life Expectancy

Life expectancy can be calculated for combined sexesor separately for males and femalesere can be signi-cant dierences between sexesLife expectancy at birth (e) is the average number of

years a newborn child can expect to live if currentmortalitytrends remain constant

e =Tl

where T is the total size of the population and l is thenumber of births (the original number of persons in thebirth cohort)Life expectancy declines with age Life expectancy at

birth is highly inuenced by infant mortality rate eparadox of the life table refers to a situation where lifeexpectancy at birth increases for several years aer birth(e lt e lt e and even beyond)e paradox reects thehigher rates of infant and child mortality in populationsin pre-transition and middle stages of the demographictransitionLife expectancy at birth is a summary measure of mor-

tality in a population It is a frequently used indicatorof health standards and socio-economic living standardsLife expectancy is also one of the most commonly usedindicators of social development is indicator is easilycomparable through time and between areas includingcountries Inequalities in life expectancy usually indicateinequalities in health and socio-economic developmentLife expectancy rose throughout human history In

ancient Greece and Rome the average life expectancywas below years between the years and life expectancy at birth rose from about years to aglobal average of years and to more than years inthe richest countries (Riley ) Furthermore in mostindustrialized countries in the early twenty-rst centurylife expectancy averaged at about years (WHO)esechanges called the ldquohealth transitionrdquo are essentially theresult of improvements in public health medicine andnutritionLife expectancy varies signicantly across regions and

continents from life expectancies of about years insome central African populations to life expectancies of years and above in many European countries emore developed regions have an average life expectancy of years while the population of less developed regionsis at birth expected to live an average years less etwo continents that display the most extreme dierencesin life expectancies are North America ( years) andAfrica ( years) where as of recently the gap betweenlife expectancies amounts to years (UN ndash data)

Countries with the highest life expectancies in theworld ( years) are Australia Iceland Italy Switzerlandand Japan ( years) Japanese men and women live anaverage of and years respectively (WHO )In countries with a high rate of HIV infection prin-

cipally in Sub-Saharan Africa the average life expectancyis years and below Some of the worldrsquos lowest lifeexpectancies are in Sierra Leone ( years) Afghanistan( years) Lesotho ( years) and Zimbabwe ( years)In nearly all countries women live longer than men

e worldrsquos average life expectancy at birth is years formales and years for females the gap is about ve yearse female-to-male gap is expected to narrow in the moredeveloped regions andwiden in the less developed regionse Russian Federation has the greatest dierence in lifeexpectancies between the sexes ( years less for men)whereas in Tonga life expectancy for males exceeds thatfor females by years (WHO )Life expectancy is assumed to rise continuously

According to estimation by the UN global life expectancyat birth is likely to rise to an average years by ndashBy life expectancy is expected to vary across countriesfrom to years Long-range United Nations popula-tion projections predict that by on average peoplecan expect to live more than years from (Liberia) upto years (Japan)For more details on the calculation of life expectancy

including continuous notation see Keytz ( )and Preston et al ()

Cross References7Biopharmaceutical Research Statistics in7Biostatistics7Demography7Life Table7Mean Residual Life7Measurement of Economic Progress

References and Further ReadingKeyfitz N () Introduction to the Mathematics of Population

Addison-Wesly MassachusettsKeyfitz N () Applied mathematical demography Springer

New YorkPreston SH Heuveline P Guillot M () Demography measuring

and modelling potion processes Blackwell OxfordRiley JC () Rising life expectancy a global history Cambridge

University Press CambridgeRowland DT () Demographic methods and concepts Oxford

University Press New YorkSiegel JS Swanson DA () The methods and materials of demog-

raphy nd edn Elsevier Academic Press AmsterdamWorld Health Organization () World health statistics

Geneva

Life Table L

L

World Population Prospects The revision Highlights UnitedNations New York

World Population to () United Nations New York

Life Table

JanM HoemProfessor Director EmeritusMax Planck Institute for Demographic Research RostockGermany

e life table is a classical tabular representation of centralfeatures of the distribution function F of a positive variablesay X which normally is taken to represent the lifetime ofa newborn individuale life table was introduced wellbefore modern conventions concerning statistical distri-butions were developed and it comes with some specialterminology and notation as follows Suppose that F has a

density f (x) = ddxF(x) and dene the force of mortality or

death intensity at age x as the function

micro(x) = minus ddxln minus F(x) = f (x)

minus F(x)

Heuristically it is interpreted by the relation micro(x)dx =Px lt X lt x + dx∣X gt x Conversely F(x) =

minus expminusx

intmicro(s)ds e survivor function is dened as

ℓ(x) = ℓ() minus F(x) normally with ℓ() = Inmortality applications ℓ(x) is the expected number of sur-vivors to exact age x out of an original cohort of newborn babiese survival probability is

tpx = PX gt x + t∣X gt x = ℓ(x + t)ℓ(x)

= expminusintt

micro(x + s)ds

and the non-survival probability is (the converse) tqx = minustpx For t = one writes qx = qx and px = px In partic-ular we get ℓ(x + ) = ℓ(x)pxis is a practical recursionformula that permits us to compute all values of ℓ(x) oncewe know the values of px for all relevant x

e life expectancy is eo = EX = int infin ℓ(x)dxℓ()(e subscript in eo indicates that the expected value iscomputed at age (ie for newborn individuals) and thesuperscript o indicates that the computation is made in

the continuous mode)e remaining life expectancy atage x is

eox = E(X minus x∣X gt x) = intinfin

ℓ(x + t)dtℓ(x)

ie it is the expected lifetime remaining to someone whohas attained age xTo turn to the statistical estimation of these various

quantities suppose that the function micro(x) is piecewiseconstant which means that we take it to equal some con-stant say microj over each interval (xj xj+) for some partitionxj of the age axis For a collection Xi of independentobservations of X let Dj be the number of Xi that fall inthe interval (xj xj+) In mortality applications this is thenumber of deaths observed in the given interval For thecohort of the initially newborn Dj is the number of indi-viduals whodie in the interval (called the occurrences in theinterval) If individual i dies in the interval he or she willof course have lived for Xi minus xj time units during the inter-val Individuals who survive the interval will have lived forxj+ minus xj time units in the interval and individuals whodo not survive to age xj will not have lived during thisinterval at all When we aggregate the time units lived in(xj xj+) over all individuals we get a total Rj which iscalled the exposures for the interval the idea being thatindividuals are exposed to the risk of death for as long asthey live in the interval In the simple case where there areno relations between the individual parameters microj the col-lection DjRj constitutes a statistically sucient set ofobservations with a likelihood Λ that satises the relationln Λ = sumjminusmicrojRj+Dj ln microjwhich is easily seen to bemax-imized by microj = DjRje latter fraction is therefore themaximum-likelihood estimator for microj (In some connec-tions an age schedule of mortality will be specied such asthe classical GompertzndashMakeham function microx = a + bcxwhich does represent a relationship between the intensityvalues at the dierent ages x normally for single-year agegroupsMaximum likelihood estimators can then be foundby plugging this functional specication of the intensitiesinto the likelihood function nding the values a b andc that maximize Λ and using a + bcx for the intensity inthe rest of the life table computations Methods that do notamount tomaximum likelihood estimationwill sometimesbe used because they involve simpler computations Withsome luck they provide starting values for the iterative pro-cess that must usually be applied to produce the maximumlikelihood estimators For an example see Forseacuten ())is whole schema can be extended trivially to cover cen-soring (withdrawals) provided the censoringmechanism isunrelated to the mortality process

L Likelihood

If the force of mortality is constant over a single-yearage interval (x x + ) say and is estimated by microx in thisinterval then px = eminusmicrox is an estimator of the single-yearsurvival probability pxis allows us to estimate the sur-vival function recursively for all corresponding ages usingℓ(x + ) = ℓ(x)px for x = and the rest of thelife table computations follow suit Life table constructionconsists in the estimation of the parameters and the tab-ulation of functions like those above from empirical datae data can be for age at death for individuals as in theexample indicated above but they can also be observa-tions of duration until recovery from an illness of intervalsbetween births of time until breakdown of some piece ofmachinery or of any other positive duration variableSo far we have argued as if the life table is computed for

a group of mutually independent individuals who have allbeen observed in parallel essentially a cohort that is fol-lowed from a signicant common starting point (namelyfrom birth in our mortality example) and which is dimin-ished over time due to decrements (attrition) caused bythe risk in question and also subject to reduction due tocensoring (withdrawals)e corresponding table is thencalled a cohort life table It is more common however toestimate a px schedule from data collected for the mem-bers of a population during a limited time period and touse the mechanics of life-table construction to produce aperiod life table from the px valuesLife table techniques are described in detail in most

introductory textbooks in actuarial statistics7biostatistics7demography and epidemiology See eg Chiang ()Elandt-Johnson and Johnson () Manton and Stallard() Preston et al () For the history of thetopic consult Seal () Smith and Keytz () andDupacircquier ()

About the AuthorFor biography see the entry 7Demography

Cross References7Demographic Analysis A Stochastic Approach7Demography7Event History Analysis7Kaplan-Meier Estimator7Population Projections7Statistics An Overview

References and Further ReadingChiang CL () The life table and its applications Krieger

MalabarDupacircquier J () Lrsquoinvention de la table de mortaliteacute Presses

universitaires de France Paris

Elandt-Johnson RC Johnson NL () Survival models and dataanalysis Wiley New York

Forseacuten L () The efficiency of selected moment methods inGompertzndashMakeham graduation of mortality Scand Actuarial Jndash

Manton KG Stallard E () Recent trends in mortality analysisAcademic Press Orlando

Preston SH Heuveline P Guillot M () Demography Measuringand modeling populations Blackwell Oxford

Seal H () Studies in history of probability and statistics mul-tiple decrements of competing risks Biometrica ()ndash

Smith D Keyfitz N () Mathematical demography selectedpapers Springer Heidelberg

Likelihood

Nancy ReidProfessorUniversity of Toronto Toronto ON Canada

Introductione likelihood function in a statistical model is propor-tional to the density function for the random variable tobe observed in the model Most oen in applications oflikelihood we have a parametric model f (y θ) where theparameter θ is assumed to take values in a subset of Rkand the variable y is assumed to take values in a subset ofRn the likelihood function is dened by

L(θ) = L(θ y) = cf (y θ) ()

where c can depend on y but not on θ In more gen-eral settings where the model is semi-parametric or non-parametric the explicit denition is more dicult becausethe density needs to be dened relative to a dominatingmeasure whichmay not exist seeVan derVaart () andMurphy andVan derVaart ()is article will consideronly nite-dimensional parametric modelsWithin the context of the given parametric model the

likelihood function measures the relative plausibility ofvarious values of θ for a given observed data point y Val-ues of the likelihood function are only meaningful relativeto each other and for this reason are sometimes stan-dardized by the maximum value of the likelihood func-tion although other reference points might be of interestdepending on the contextIf ourmodel is f (y θ) = (ny)θy(minusθ)nminusy y = n

θ isin [ ] then the likelihood function is (any functionproportional to)

L(θ y) = θy( minus θ)nminusy

Likelihood L

L

and can be plotted as a function of θ for any xed valueof y e likelihood function is maximized at θ = ynis model might be appropriate for a sampling schemewhich recorded the number of successes amongn indepen-dent trials that result in success or failure each trial havingthe same probability of success θ Another example is thelikelihood function for the mean and variance parameterswhen sampling from a normal distribution with mean microand variance σ

L(θ y) = expminusn log σ minus (σ )Σ(yi minus micro)

where θ = (micro σ )is could also be plotted as a functionof micro and σ for a given sample y yn and it is not dif-cult to show that this likelihood function only dependson the sample through the sample mean y = nminusΣyi andsample variance s = (n minus )minusΣ(yi minus y) or equivalentlythrough Σyi and Σyi It is a general property of likelihoodfunctions that they depend on the data only through theminimal sucient statistic

Inferencee likelihood function was dened in a seminal paperof Fisher () and has since become the basis for mostmethods of statistical inference One version of likelihoodinference suggested by Fisher is to use some rule suchas L(θ)L(θ) gt k to dene a range of ldquolikelyrdquo or ldquoplau-siblerdquo values of θ Many authors including Royall ()and Edwards () have promoted the use of plots ofthe likelihood function along with interval estimates ofplausible valuesis approach is somewhat limited how-ever as it requires that θ have dimension or possibly or that a likelihood function can be constructed that onlydepends on a component of θ that is of interest see sectionldquo7Nuisance Parametersrdquo belowIn general we would wish to calibrate our inference

for θ by referring to the probabilistic properties of theinferential method One way to do this is to introduce aprobability measure on the unknown parameter θ typi-cally called a prior distribution and use Bayesrsquo rule forconditional probabilities to conclude

π(θ ∣ y) = L(θ y)π(θ)intθL(θ y)π(θ)dθ

where π(θ) is the density for the prior measure and π(θ ∣y) provides a probabilistic assessment of θ aer observingY = y in the model f (y θ) We could then make con-clusions of the form ldquohaving observed successes in trials and assuming π(θ) = the posterior probabilitythat θ gt is rdquo and so on

is is a very brief description of Bayesian inference inwhich probability statements refer to that generated from

the prior through the likelihood to the posterior A majordiculty with this approach is the choice of prior prob-ability function In some applications there may be anaccumulation of previous data that can be incorporatedinto a probability distribution but in general there is notand some rather ad hoc choices are oen made Anotherdiculty is themeaning to be attached to probability state-ments about the parameterInference based on the likelihood function can also be

calibrated with reference to the probability model f (y θ)by examining the distribution ofL(θY) as a random func-tion or more usually by examining the distribution ofvarious derived quantitiesis is the basis for likelihoodinference from a frequentist point of view In particularit can be shown that logL(θY)L(θY) where θ =θ(Y) is the value of θ at which L(θY) is maximized isapproximately distributed as a χk random variable wherek is the dimension of θ To make this precise requires anasymptotic theory for likelihood which is based on a cen-tral limit theorem (see 7Central Limiteorems) for thescore function

U(θY) = partpartθlogL(θY)

If Y = (Y Yn) has independent components thenU(θ) is a sum of n independent components which undermild regularity conditions will be asymptotically normalTo obtain the χ result quoted above it is also necessary toinvestigate the convergence of θ to the true value govern-ing the model f (y θ) Showing this convergence usuallyin probability but sometimes almost surely can be di-cult see Scholz () for a summary of some of the issuesthat ariseAssuming that θ is consistent for θ and that L(θY)

has sucient regularity the follow asymptotic results canbe established

(θ minus θ)T i(θ)(θ minus θ) drarr χk ()

U(θ)T iminus(θ)U(θ) drarr χk ()

ℓ(θ) minus ℓ(θ) drarr χk ()

where i(θ) = Eminusℓprimeprime(θY) θ is the expected Fisher infor-mation function ℓ(θ) = logL(θ) is the log-likelihoodfunction and χk is the 7chi-square distribution with kdegrees of freedom

ese results are all versions of a more general resultthat the log-likelihood function converges to the quadraticform corresponding to a multivariate normal distribution(see 7Multivariate Normal Distributions) under suitablystated limiting conditions ere is a similar asymptoticresult showing that the posterior density is asymptotically

L Likelihood

normal and in fact asymptotically free of the prior distri-bution although this result requires that the prior distribu-tion be a proper probability density ie has integral overthe parameter space equal to

Nuisance ParametersIn models where the dimension of θ is large plotting thelikelihood function is not possible and inference based onthe multivariate normal distribution for θ or the χk dis-tribution of the log-likelihood ratio doesnrsquot lead easily tointerval estimates for components of θ However it is pos-sible to use the likelihood function to construct inferencefor parameters of interest using variousmethods that havebeen proposed to eliminate nuisance parametersSuppose in the model f (y θ) that θ = (ψ λ) where ψ

is a k-dimensional parameter of interest (which will oenbe )e prole log-likelihood function of ψ is

ℓP(ψ) = ℓ(ψ λψ)

where λψ is the constrainedmaximum likelihood estimateit maximizes the likelihood function L(ψ λ) when ψ isheld xede prole log-likelihood function is also calledthe concentrated log-likelihood function especially ineconometrics If the parameter of interest is not expressedexplicitly as a subvector of θ then the constrained maxi-mum likelihood estimate is found using Lagrange multi-pliersIt can be veried under suitable smoothness conditions

that results similar to those at ( ndash ) hold as well for theprole log-likelihood function in particular

ℓP(ψ) minus ℓP(ψ) = ℓ(ψ λ) minus ℓ(ψ λψ)drarr χk

is method of eliminating nuisance parameters is notcompletely satisfactory especially when there are manynuisance parameters in particular it doesnrsquot allow forerrors in estimation of λ For example the prole likeli-hood approach to estimation of σ in the linear regressionmodel (see 7Linear Regression Models) y sim N(Xβ σ )will lead to the estimator σ = Σ(yi minus yi)n whereas theestimator usually preferred has divisor nminusp where p is thedimension of β

us a large literature has developed on improvementsto the prole log-likelihood For Bayesian inference suchimprovements are ldquoautomaticallyrdquo included in the formu-lation of the marginal posterior density for ψ

πM(ψ ∣ y)prop int π(ψ λ ∣ y)dλ

but it is typically quite dicult to specify priors for possiblyhigh-dimensional nuisance parameters For non-Bayesian

inference most modications to the prole log-likelihoodare derived by considering conditional or marginal infer-ence in models that admit factorizations at least approxi-mately like the following

f (y θ) = f(yψ)f(y ∣ y λ) orf (y θ) = f(y ∣ yψ)f(y λ)

A discussion of conditional inference and density factori-sations is given in Reid ()is literature is closely tiedto that on higher order asymptotic theory for likelihoode latter theory builds on saddlepoint and Laplace expan-sions to derive more accurate versions of (ndash) see forexample Severini () and Brazzale et al () edirect likelihood approach of Royall () and others doesnot generalize very well to the nuisance parameter settingalthough Royall and Tsou () present some results inthis direction

Extensions to Likelihoode likelihood function is such an important aspect ofinference based on models that it has been extended toldquolikelihood-likerdquo functions formore complex data settingsExamples include nonparametric and semi-parametriclikelihoods the most famous semi-parametric likelihoodis the proportional hazards model of Cox () Butmany other extensions have been suggested to empiri-cal likelihood (Owen ) which is a type of nonpara-metric likelihood supported on the observed sample toquasi-likelihood (McCullagh ) which starts from thescore function U(θ) and works backwards to an infer-ence function to bootstrap likelihood (Davison et al )and many modications of prole likelihood (Barndor-Nielsen andCox Fraser )ere is recent interestfor multi-dimensional responses Yi in composite likeli-hoods which are products of lower dimensional condi-tional or marginal distributions (Varin ) Reid ()concluded a review article on likelihood by stating

7 From either a Bayesian or frequentist perspective the like-lihood function plays an essential role in inference Themaximum likelihood estimator once regarded on an equalfooting among competing point estimators is now typi-cally the estimator of choice although some refinementis needed in problems with large numbers of nuisanceparameters The likelihood ratio statistic is the basis formost tests of hypotheses and interval estimates The emer-gence of the centrality of the likelihood function for infer-ence partly due to the large increase in computing poweris one of the central developments in the theory of statisticsduring the latter half of the twentieth century

Likelihood L

L

Further Readinge book by Cox and Hinkley () gives a detailedaccount of likelihood inference and principles of statis-tical inference see also Cox () ere are severalbook-length treatments of likelihood inference includingEdwards () Azzalini () Pawitan () and Sev-erini () this last discusses higher order asymptotictheory in detail as does Barndor-Nielsen andCox ()and Brazzale Davison and Reid () A short reviewpaper is Reid () An excellent overview of consis-tency results for maximum likelihood estimators is Scholz() see also Lehmann andCasella () Foundationalissues surrounding likelihood inference are discussed inBerger and Wolpert ()

About the AuthorProfessor Reid is a Past President of the Statistical Societyof Canada (ndash) During (ndash) she servedas the President of the Institute of Mathematical Statis-tics Among many awards she received the Emanuel andCarol Parzen Prize for Statistical Innovation () ldquoforleadership in statistical science for outstanding researchin theoretical statistics and highly accurate inference fromthe likelihood function and for inuential contributionsto statistical methods in biology environmental sciencehigh energy physics and complex social surveysrdquo She wasawarded the Gold Medal Statistical Society of Canada() and FlorenceNightingaleDavidAward Committeeof Presidents of Statistical Societies () She is AssociateEditor of Statistical Science (ndash) Bernoulli (ndash) andMetrika (ndash)

Cross References7Bayesian Analysis or Evidence Based Statistics7Bayesian Statistics7Bayesian Versus Frequentist Statistical Reasoning7Bayesian vs Classical Point Estimation A ComparativeOverview7Chi-Square Test Analysis of Contingency Tables7Empirical Likelihood Approach to Inference fromSample Survey Data7Estimation7Fiducial Inference7General Linear Models7Generalized Linear Models7Generalized Quasi-Likelihood (GQL) Inferences7Mixture Models7Philosophical Foundations of Statistics

7Risk Analysis7Statistical Evidence7Statistical Inference7Statistical Inference An Overview7Statistics An Overview7Statistics Nelderrsquos view7Testing Variance Components in Mixed LinearModels7Uniform Distribution in Statistics

References and Further ReadingAzzalini A () Statistical inference Chapman and Hall LondonBarndorff-Nielsen OE Cox DR () Inference and asymptotics

Chapman and Hall LondonBerger JO Wolpert R () The likelihood principle Institute of

Mathematical Statistics HaywardBirnbaum A () On the foundations of statistical inference Am

Stat Assoc ndashBrazzale AR Davison AC Reid N () Applied asymptotics

Cambridge University Press CambridgeCox DR () Regression models and life tables J R Stat Soc B

ndash (with discussion)Cox DR () Principles of statistical inference Cambridge Uni-

versity Press CambridgeCox DR Hinkley DV () Theoretical statistics Chapman and

Hall LondonDavison AC Hinkley DV Worton B () Bootstrap likelihoods

Biometrika ndashEdwards AF () Likelihood Oxford University Press OxfordFisher RA () On the mathematical foundations of theoretical

statistics Phil Trans R Soc A ndashLehmann EL Casella G () Theory of point estimation nd edn

Springer New YorkMcCullagh P () Quasi-likelihood functions Ann Stat ndashMurphy SA Van der Vaart A () Semiparametric likelihood ratio

inference Ann Stat ndashOwen A () Empirical likelihood ratio confidence intervals for a

single functional Biometrika ndashPawitan Y () In all likelihood Oxford University Press OxfordReid N () The roles of conditioning in inference Stat Sci

ndashReid N () Likelihood J Am Stat Assoc ndashRoyall RM () Statistical evidence a likelihood paradigm

Chapman and Hall LondonRoyall RM Tsou TS () Interpreting statistical evidence using

imperfect models robust adjusted likelihoodfunctions J R StatSoc B

Scholz F () Maximum likelihood estimation In Ency-clopedia of statistical sciences Wiley New York doiesspub Accessed Aug

Severini TA () Likelihood methods in statistics Oxford Univer-sity Press Oxford

Van der Vaart AW () Infinite-dimensional likelihood meth-ods in statistics httpwwwstieltjesorgarchiefbiennialframenodehtml Accessed Aug

Varin C () On composite marginal likelihood Adv Stat Analndash

L Limit Theorems of Probability Theory

Limit Theorems of ProbabilityTheory

Alexandr Alekseevich BorovkovProfessor Head of Probability and Statistics Chair at theNovosibirsk UniversityNovosibirsk University Novosibirsk Russia

Limit eorems of Probability eory is a broad namereferring to the most essential and extensive research areain Probability eory which at the same time has thegreatest impact on the numerous applications of the latterBy its very nature Probability eory is concerned

with asymptotic (limiting) laws that emerge in a long seriesof observations on random events Because of this in theearly twentieth century even the very denition of prob-ability of an event was given by a group of specialists(R von Mises and some others) as the limit of the rela-tive frequency of the occurrence of this event in a longrow of independent random experiments e ldquostabilityrdquoof this frequency (ie that such a limit always exists) waspostulated Aer the s Kolmogorovrsquos axiomatic con-struction of probability theory has prevailed One of themain assertions in this axiomatic theory is the Law of LargeNumbers (LLN) on the convergence of the averages of largenumbers of random variables to their expectation islaw implies the aforementioned stability of the relative fre-quencies and their convergence to the probability of thecorresponding event

e LLN is the simplest limit theorem (LT) of probabil-ity theory elucidating the physical meaning of probabilitye LLN is stated as follows if XXX is a sequenceof iid random variables

Sn =n

sumj=Xj

and the expectation a = EX exists then nminusSnasETHrarr a

(almost surely ie with probability )us the value nacan be called the rst order approximation for the sums Sne Central Limiteorem (CLT) gives one a more preciseapproximation for Sn It says that if σ = E(X minus a) ltinfinthen the distribution of the standardized sum ζn = (Sn minusna)σ

radicn converges as n rarr infin to the standard normal

(Gaussian) lawat is for all x

P(ζn lt x)rarr Φ(x) = radicπ

x

intminusinfin

eminustdt

e quantity nEξ + ζσradicn where ζ is a standard normal

random variable (so that P(ζ lt x) = Φ(x)) can be calledthe second order approximation for Sn

e rst LLN (for the Bernoulli scheme) was provedby Jacob Bernoulli in the late s (published posthu-mously in ) e rst CLT (also for the Bernoullischeme) was established by A de Moivre (rst publishedin and referred nowadays to as the deMoivrendashLaplacetheorem) In the beginning of the nineteenth centuryPS Laplace and CF Gauss contributed to the generaliza-tion of these assertions and appreciation of their enormousapplied importance (in particular for the theory of errorsof observations) while later in that century further break-throughs in both methodology and applicability rangeof the CLT were achieved by PL Chebyshev () andAM Lyapunov ()

e main directions in which the two aforementionedmain LTs have been extended and rened since then are

Relaxing the assumption EX lt infin When the sec-ond moment is innite one needs to assume that theldquotailrdquo P(x) = P(X gt x) + P(X lt minusx) is a func-tion regularly varying at innity such that the limitlimxrarrinfinP(X gt x)P(x) existsen the distribution of thenormalized sum Snσ(n) where σ(n) = Pminus(nminus)Pminus being the generalized inverse of the function P andwe assume that Eξ = when the expectation is niteconverges to one of the so-called stable laws as nrarrinfine7characteristic functions of these laws have simpleclosed-form representations

Relaxing the assumption that the Xjrsquos are identicallydistributed and proceeding to study the so-called tri-angular array scheme where the distributions of thesummands Xj = Xjn forming the sum Sn depend notonly on j but on n as well In this case the class of alllimit laws for the distribution of Sn (under suitable nor-malization) is substantially wider it coincides with theclass of the so-called innitely divisible distributionsAn important special case here is the Poisson limit the-orem on convergence in distribution of the number ofoccurrences of rare events to a Poisson law

Relaxing the assumption of independence of the XjrsquosSeveral types of ldquoweak dependencerdquo assumptions onXjunder which the LLN and CLT still hold true havebeen suggested and investigatedOne should alsomen-tion here the so-called ergodic theorems (see7Ergodiceorem) for a wide class of random sequences andprocesses

Renement of the main LTs and derivation of asymp-totic expansions For instance in the CLT bounds ofthe rate of convergence P(ζn lt x) minus Φ(x) rarr and

Limit Theorems of Probability Theory L

L

asymptotic expansions for this dierence (in the pow-ers of nminus in the case of iid summands) have beenobtained under broad assumptions

Studying large deviation probabilities for the sums Sn(theorems on rare events) If x rarr infin together with nthen the CLT can only assert that P(ζn gt x) rarr eorems on large deviation probabilities aim to nd afunction P(xn) such that

P(ζn gt x)P(xn) rarr as nrarrinfin x rarrinfin

e nature of the function P(xn) essentially dependson the rate of decay of P(X gt x) as x rarrinfin and on theldquodeviation zonerdquo ie on the asymptotic behavior of theratio xn as nrarrinfin

Considering observations X Xn of a more com-plex nature ndash rst of all multivariate random vectorsIf Xj isin Rd then the role of the limit law in the CLTwill be played by a d-dimensional normal (Gaussian)distribution with the covariance matrix E(X minus EX)(X minus EX)T

e variety of application areas of the LLN and CLTis enormous us Mathematical Statistics is based onthese LTs Let Xlowastn = (X Xn) be a sample froma distribution F and Flowastn (u) the corresponding empiricaldistribution functione fundamental GlivenkondashCantellitheorem (see 7Glivenko-Cantelli eorems) stating thatsupu ∣F

lowastn (u)minus F(u)∣

asETHrarr as nrarrinfin is of the same natureas the LLN and basically means that the unknown distri-bution F can be estimated arbitrary well from the randomsample Xlowastn of a large enough size n

e existence of consistent estimators for the unknownparameters a = Eξ and σ = E(X minus a) also follows fromthe LLN since as nrarrinfin

alowast = n

n

sumj=Xj

asETHrarr a (σ )lowast = n

n

sumj=

(Xj minus alowast)

= n

n

sumj=Xj minus (alowast) asETHrarr σ

Under additional moment assumptions on the distri-bution F one can also construct asymptotic condenceintervals for the parameters a and σ as the distributionsof the quantities

radicn(alowast minus a) and

radicn((σ )lowast minus σ ) con-

verge as n rarr infin to the normal onese same can alsobe said about other parameters that are ldquosmoothrdquo enoughfunctionals of the unknown distribution F

e theorem on the 7asymptotic normality andasymptotic eciency of maximum likelihood estimatorsis another classical example of LTsrsquo applications in mathe-matical statistics (see eg Borovkov ) Furthermore in

estimation theory and hypotheses testing one also needstheorems on large deviation probabilities for the respec-tive statistics as it is statistical procedures with small errorprobabilities that are oen required in applicationsIt is worth noting that summation of random variables

is by no means the only situation in which LTs appear inProbabilityeoryGenerally speaking the main objective of Probabil-

ityeory in applications is nding appropriate stochasticmodels for objects under study and then determining thedistributions or parameters one is interested in As a rulethe explicit form of these distributions and parameters isnot known LTs can be used to nd suitable approximationsto the characteristics in questionAt least two possible approaches to this problem should

be noted here

Suppose that the unknown distribution Fθ depends ona parameter θ such that as θ approaches some ldquocriticalrdquovalue θ the distributions Fθ become ldquodegeneraterdquo inone sense or anotheren in a number of cases onecan nd an approximation for Fθ which is valid for thevalues of θ that are close to θ For instance in actuarialstudies7queueing theory and some other applicationsone of the main problems is concerned with the dis-tribution of S = sup

kge(Sk minus θk) under the assumption

that EX = If θ gt then S is a proper random vari-able If however θ rarr then S asETHrarr infin Here we dealwith the so-called ldquotransient phenomenardquo It turns outthat if σ = Var (X) ltinfin then there exists the limit

limθdarrP(θS gt x) = eminusxσ x gt

is (KingmanndashProkhorov) LT enables one to ndapproximations for the distribution of S in situationswhere θ is small

Sometimes one can estimate the ldquotailsrdquo of the unknowndistributions ie their asymptotic behavior at innityis is of importance in those applications where oneneeds to evaluate the probabilities of rare events If theequation Eemicro(Xminusθ) = has a solution micro gt then inthe above example one has

P(S gt x) sim ceminusmicrox x rarrinfin

where c is a known constant If the distribution F of Xis subexponential (in this case Eemicro(Xminusθ) = infin for anymicro gt ) then

P(S gt x) sim θ int

infin

x( minus F(t))dt x rarrinfin

L Linear Mixed Models

isLTenablesone tondapproximations forP(S gt x)for large xFor both approaches the obtained approximations

can be rened

An important part of Probabilityeory is concernedwith LTs for random processes eir main objective isto nd conditions under which random processes con-verge in some sense to some limit process An extensionof the CLT to that context is the so-called FunctionalCLT (aka the DonskerndashProkhorov invariance princi-ple) which states that as n rarr infin the processesζn(t) = (Slfloorntrfloor minus ant)σ

radicntisin[] converge in distribu-

tion to the standard Wiener process w(t)tisin[]e LTs(including large deviation theorems) for a broad class offunctionals of the sequence (7random walk) S Sncan also be classied as LTs for 7stochastic processese same can be said about Law of iterated logarithmwhich states that for an arbitrary ε gt the random walkSkinfink= crosses the boundary (minusε)σ

radick ln ln k innitely

many times but crosses the boundary ( + ε)σradick ln ln k

nitely many times with probability Similar results holdtrue for trajectories of Wiener processes w(t)tisin[] andw(t)tisin[infin)In mathematical statistics a closely related to func-

tional CLT result says that the so-called ldquoempirical processrdquoradicn(Flowastn (u) minus F(u))uisin(minusinfininfin) converges in distribution

to w(F(u))uisin(minusinfininfin) where w(t) = w(t) minus tw() isthe Brownian bridge processis LT implies7asymptoticnormality of a great many estimators that can be repre-sented as smooth functionals of the empirical distributionfunction Flowastn (u)

ere are many other areas in Probability eoryand its applications where various LTs appear and areextensively used For instance convergence theoremsfor 7martingales asymptotics of extinction probabilityof a branching processes and conditional (under non-extinction condition) LTs on a number of particles etc

About the AuthorProfessor Borovkov is Head of Probability and Statis-tics Department of the Sobolev Institute of MathematicsNovosibirsk (RussianAcademy of Sciences) since Heis Head of Probability and Statistics Chair at the Novosi-birsk University since He is Concelour of RussianAcademy of Sciences and fullmember of RussianAcademyof Sciences (Academician) () He was awarded theState Prize of the USSR () the Markov Prize of theRussian Academy of Sciences () and GovernmentPrize in Education () Professor Borovkov is Editor-in-Chief of the journal ldquoSiberianAdvances inMathematicsrdquo

and Associated Editor of journals ldquoeory of Probabil-ity and its Applicationsrdquo ldquoSiberian Mathematical JournalrdquoldquoMathematical Methods of Statisticsrdquo ldquoElectronic Journal ofProbabilityrdquo

Cross References7Almost Sure Convergence of Random Variables7Approximations to Distributions7Asymptotic Normality7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Central Limiteorems7Empirical Processes7Ergodiceorem7Glivenko-Cantellieorems7Large Deviations and Applications7Laws of Large Numbers7Martingale Central Limiteorem7Probabilityeory An Outline7RandomMatrixeory7Strong Approximations in Probability and Statistics7Weak Convergence of Probability Measures

References and Further ReadingBillingsley P () Convergence of probability measures Wiley

New YorkBorovkov AA () Mathematical statistics Gordon amp Breach

AmsterdamGnedenko BV Kolmogorov AN () Limit distributions for sums

of independent random variables Addison-Wesley CambridgeLeacutevy P () Theacuteorie de lrsquoAddition des Variables Aleacuteatories

Gauthier-Villars ParisLoegraveve M (ndash) Probability theory vols I and II th edn

Springer New YorkPetrov VV () Limit theorems of probability theory sequences of

independent random variables ClarendonOxford UniversityPress New York

Linear Mixed Models

GeertMolenberghsProfessorUniversiteit Hasselt amp Katholieke Universiteit LeuvenLeuven Belgium

In observational studies repeated measurements may betaken at almost arbitrary time points resulting in anextremely large number of time points at which only one

Linear Mixed Models L

L

or only a few measurements have been taken Many of theparametric covariance models described so far may thencontain toomany parameters tomake them useful in prac-tice while othermore parsimoniousmodelsmay be basedon assumptions which are too simplistic to be realisticA general and very exible class of parametric models forcontinuous longitudinal data is formulated as follows

yi∣bi sim N(Xiβ + ZibiΣi) ()bi sim N(D) ()

where Xi and Zi are (ni times p) and (ni times q) dimensionalmatrices of known covariates β is a p-dimensional vec-tor of regression parameters called the xed eects D isa general (q times q) covariance matrix and Σi is a (ni times ni)covariance matrix which depends on i only through itsdimension ni ie the set of unknown parameters in Σi willnot depend upon i Finally bi is a vector of subject-specicor random eects

e above model can be interpreted as a linear regres-sion model (see 7Linear Regression Models) for the vec-tor yi of repeated measurements for each unit separatelywhere some of the regression parameters are specic (ran-dom eects bi) while others are not (xed eects β)edistributional assumptions in () with respect to the ran-dom eects can be motivated as follows First E(bi) = implies that the mean of yi still equals Xiβ such that thexed eects in the random-eects model () can also beinterpreted marginally Not only do they reect the eectof changing covariates within specic units they alsomea-sure the marginal eect in the population of changing thesame covariates Second the normality assumption imme-diately implies that marginally yi also follows a normaldistribution with mean vector Xiβ and with covariancematrix Vi = ZiDZprimei + ΣiNote that the random eects in () implicitly imply the

marginal covariance matrix Vi of yi to be of the very spe-cic form Vi = ZiDZprimei + Σi Let us consider two examplesunder the assumption of conditional independence ieassuming Σi = σ Ini First consider the case where therandom eects are univariate and represent unit-specicintercepts is corresponds to covariates Zi which areni-dimensional vectors containing only ones

e marginal model implied by expressions () and() is

yi sim N(XiβVi) Vi = ZiDZprimei + Σi

which can be viewed as a multivariate linear regressionmodel with a very particular parameterization of thecovariance matrix Vi

With respect to the estimation of unit-specic param-eters bi the posterior distribution of bi given the observeddata yi can be shown to be (multivariate) normal withmean vector equal to DZprimeiV

minusi (α)(yi minus Xiβ) Replacing β

and α by their maximum likelihood estimates we obtainthe so-called empirical Bayes estimates bi for the bi A keyproperty of these EB estimates is shrinkage which is bestillustrated by considering the prediction yi equiv Xi β +Zibi ofthe ith prole It can easily be shown that

yi = ΣiVminusi Xi β + (Ini minus ΣiVminusi ) yi

which can be interpreted as a weighted average of thepopulation-averaged prole Xi β and the observed data yiwith weights ΣiVminusi and Ini minusΣiVminusi respectively Note thatthe ldquonumeratorrdquo of ΣiVminusi represents within-unit variabil-ity and the ldquodenominatorrdquo is the overall covariance matrixVi Hence much weight will be given to the overall averageprole if the within-unit variability is large in comparisonto the between-unit variability (modeled by the randomeects) whereasmuchweight will be given to the observeddata if the opposite is trueis phenomenon is referred toas shrinkage toward the average proleXi β An immediateconsequence of shrinkage is that the EB estimates show lessvariability than actually present in the random-eects dis-tribution ie for any linear combination λ of the randomeects

var(λprimebi)le var(λprimebi) = λprimeDλ

About the AuthorGeert Molenberghs is Professor of Biostatistics at Uni-versiteit Hasselt and Katholieke Universiteit Leuven inBelgium He was Joint Editor of Applied Statistics (ndash) and Co-Editor of Biometrics (ndash) Cur-rently he is Co-Editor of Biostatistics (ndash) He wasPresident of the International Biometric Society (ndash) received the Guy Medal in Bronze from the RoyalStatistical Society and the Myrto Lefkopoulou award fromthe Harvard School of Public Health Geert Molenberghsis founding director of the Center for Statistics He isalso the director of the Interuniversity Institute for Bio-statistics and statistical Bioinformatics (I-BioStat) Jointlywith Geert Verbeke Mike Kenward Tomasz BurzykowskiMarc Buyse and Marc Aerts he authored books on lon-gitudinal and incomplete data and on surrogate markerevaluation GeertMolenberghs received several Excellencein Continuing Education Awards of the American Statisti-cal Association for courses at Joint Statistical Meetings

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 6: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

L Least Absolute Residuals Procedure

e problem is worse with introductory probabilitycourses where games of chance are extensively usedas illustrative examples in textbooks Most of our stu-dents have never seen a deck of playing cards and somemay even be oended by discussing card games in aclassroom

e burden of nding strategies to overcome these dicul-ties falls on the instructor Statistical terms and conceptssuch as parameterstatistic sampling distribution unbi-asedness consistency suciency and ideas underlyinghypothesis testing are not easy to get across even in the stu-dentsrsquo own language To do that in a foreign language is areal challenge For the Statistics program to be successfulall (or at least most of the) instructors involved should beup to this challengeis is a time-consuming task with lit-tle reward other than self satisfaction In SQU the problemis compounded further by the fact that most of the instruc-tors are expatriates on short-term contracts who are morelikely to use their time for personal career advancementrather than time-consuming community service jobs

What Can Be DoneFor our rst Statistics course we produced a manual thatcontains very brief notes and many samples of previousquizzes tests and examinations It contains a good col-lection of problems from local culture to motivate thestudentse manual was well received by the students tothe extent that students prefer to practice with examplesfrom the manual rather than the textbookTextbooks written in English that are brief and to the

point are neededese should include examples and exer-cises from the studentsrsquo own culture A student tryingto understand a specic point gets distracted by lengthyexplanations and discouraged by thick textbooks to beginwith In a classroom where studentsrsquo faces clearly indicatethat you have got nothing across it is natural to try explain-ing more using more examples In the end of semesterevaluation of a course I taught a student once wrote ldquoeinstructor explains things more than needed He makessimple points dicultrdquois indicates that when teachingin a foreign language lengthy oral or written explanationsare not helpful A better strategywill be to explain conceptsand techniques briey and provide plenty of examples andexercises that will help the students absorb the material byosmosis e basic statistical concepts can only be eec-tively communicated to students in their own languageFor this reason textbooks should contain a good glossarywhere technical terms and concepts are explained using thelocal language

I expect such textbooks to go a long way to enhancestudentsrsquo understanding of Statistics An internationalproject can be initiated to produce an introductory statis-tics textbook with dierent versions intended for dierentgeographical arease English material will be the samethe examples vary to some extent from area to area andglossaries in local languages Universities in the develop-ing world naturally look at western universities asmodelsand international (western) involvement in such a projectis needed for it to succeede project will be a major con-tribution to the promotion of understanding Statistics andexcellence in statistical education in developing countriese international statistical institute takes pride in sup-porting statistical progress in the developing world isproject can lay the foundation for this progress and henceis worth serious consideration by the institute

Cross References7Online Statistics Education7Promoting Fostering and Development of Statistics inDeveloping Countries7Selection of Appropriate Statistical Methods in Develop-ing Countries7Statistical Literacy Reasoning andinking7Statistics Education

References and Further ReadingCoutis P Wood L () Teaching statistics and academic language

in culturally diverse classrooms httpwwwmathuocgr~ictmProceedingspappdf

Hubbard R () Teaching statistics to students who are learningin a foreign language ICOTS

Koh E () Teaching statistics to students with limited languageskills using MINITAB httparchivesmathutkeduICTCMVOLCpaperpdf

Least Absolute ResidualsProcedureRichardWilliam FarebrotherHonorary Reader in EconometricsVictoria University of Manchester Manchester UK

For i = n let xi xi xiq yi represent the ithobservation on a set of q + variables and suppose that wewish to t a linear model of the form

yi = xiβ + xiβ +⋯ + xiqβq + єi

Least Absolute Residuals Procedure L

L

to these n observations en for p gt the Lp-normtting procedure chooses values for b b bq to min-imise the Lp-norm of the residuals [sumni= ∣ei∣p]

p where fori = n the ith residual is dened by

ei = yi minus xib minus xib minus minus xiqbq

e most familiar Lp-norm tting procedure knownas the 7least squares procedure sets p = and choosesvalues for b b bq tominimize the sum of the squaredresidualssumni= ei A second choice to be discussed in the present article

sets p = and chooses b b bq to minimize the sumof the absolute residualssumni= ∣ei∣A third choice sets p =infin and chooses b b bq to

minimize the largest absolute residualmaxni=∣ei∣Setting ui = ei and vi = if ei ge and ui =

and vi = minusei if ei lt we nd that ei = ui minus vi so thatthe least absolute residuals (LAR) tting problem choosesb b bq to minimize the sum of the absolute residuals

n

sumi=

(ui + vi)

subject to

xib + xib +⋯ + xiqbq +Ui minus vi = yi for i = n

and Ui ge vi ge for i = ne LAR tting problem thus takes the form of a linearprogramming problem and is oen solved by means of avariant of the dual simplex procedureGauss has noted (when q = ) that solutions of this

problem are characterized by the presence of a set of qzero residuals Such solutions are robust to the presence ofoutlying observations Indeed they remain constant undervariations in the other n minus q observations provided thatthese variations do not cause any of the residuals to changetheir signs

e LAR tting procedure corresponds to the maxi-mum likelihood estimator when the є-disturbances followa double exponential (Laplacian) distributionis estima-tor is more robust to the presence of outlying observationsthan is the standard least squares estimator which maxi-mizes the likelihood function when the є-disturbances arenormal (Gaussian)Nevertheless theLAR estimator has anasymptotic normal distribution as it is amember ofHuberrsquosclass ofM-estimators

ere are many variants of the basic LAR proce-dure but the one of greatest historical interest is thatproposed in by the Croatian Jesuit scientist Rugjer(or Rudjer) Josip Boškovic (ndash) (Latin RogeriusJosephusBoscovich Italian RuggieroGiuseppeBoscovich)

In his variant of the standard LAR procedure there are twoexplanatory variables of which the rst is constant xi = and the values of b and b are constrained to satisfy theadding-up condition sumni=(yi minus b minus xib) = usuallyassociated with the least squares procedure developed byGauss in and published by Legendre in Com-puter algorithms implementing this variant of the LARprocedure with q ge variables are still to be found in theliteratureFor an account of recent developments in this area

see the series of volumes edited by Dodge ( ) For a detailed history of the LAR procedureanalyzing the contributions of Boškovic Laplace GaussEdgeworth Turner Bowley and Rhodes see Farebrother() And for a discussion of the geometrical andmechanical representation of the least squares and LARtting procedures see Farebrother ()

About the AuthorRichard William Farebrother was a member of the teach-ing sta of the Department of Econometrics and SocialStatistics in the (Victoria) University of Manchester from until when he took early retirement (he hasbeen blind since ) From until he was anHonorary Reader in Econometrics in the Department ofEconomic Studies of the sameUniversity He has publishedthree books Linear Least Squares Computations in Fitting Linear Relationships A History of the Calculus ofObservations ndash in and Visualizing statisticalModels and Concepts in He has also published morethan research papers in a wide range of subject areasincluding econometric theory computer algorithms sta-tistical distributions statistical inference and the history ofstatistics Dr Farebrother was also visiting Associate Pro-fessor () at Monash University Australia and VisitingProfessor () at the Institute for Advanced Studies inVienna Austria

Cross References7Least Squares7Residuals

References and Further ReadingDodge Y (ed) () Statistical data analysis based on the L-norm

and related methods North-Holland AmsterdamDodge Y (ed) () L-statistical analysis and related methods

North-Holland AmsterdamDodge Y (ed) () L-statistical procedures and related topics

Institute of Mathematical Statistics HaywardDodge Y (ed) () Statistical data analysis based on the L-norm

and related methods Birkhaumluser Basel

L Least Squares

Farebrother RW () Fitting linear relationships a history of thecalculus of observations ndash Springer New York

Farebrother RW () Visualizing statistical models and conceptsMarcel Dekker New York

Least Squares

Czesław StepniakProfessorMaria Curie-Skłodowska University Lublin PolandUniversity of Rzeszoacutew Rzeszoacutew Poland

Least Squares (LS) problem involves some algebraic andnumerical techniques used in ldquosolvingrdquo overdeterminedsystems F(x) asymp b of equations where b isin Rn while F(x)is a column of the form

F(x) =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

f(x)

f(x)

fmminus(x)

fm(x)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

with entries fi = fi(x) i = n where x = (x xp)T e LS problem is linear when each fi is a linear functionand nonlinear ndash if notLinear LS problem refers to a system Ax = b of linear

equations Such a system is overdetermined if n gt p If b notinrange(A) the system has no proper solution and will bedenoted by Ax asymp b In this situation we are seeking for asolution of some optimization probleme name ldquoLeastSquaresrdquo is justied by the l-norm commonly used as ameasure of imprecision

e LS problem has a clear statistical interpretation inregression terms Consider the usual regression model

yi = fi(xi xip β βp) + ei for i = n ()

where xij i = n j = p are some constants givenby experimental design fi i = n are given functionsdepending on unknown parameters βj j = p whileyi i = n are values of these functions observed withsome random errors ei We want to estimate the unknownparameters βi on the basis of the data set xij yi

In linear regression each fi is a linear function of typefi = sump

j= cij(x xn)βj and the model () may bepresented in vector-matrix notation as

y = Xβ + e

where y = (y yn)T e = (e en)T and β =(β βp)T while X is a n times p matrix with entries xijIf e en are not correlated with mean zero and a com-mon (perhaps unknown) variance then the problem ofBest Linear Unbiased Estimation (BLUE) of β reduces tonding a vector β that minimizes the norm ∣∣y minus Xβ ∣∣

=

(y minus Xβ)T(y minus Xβ)Such a vector is said to be the ordinary LS solution of

the overparametrized system Xβ asymp y On the other handthe last one reduces to solving the consistent system

XTXβ = XTy

of linear equations called normal equations In particularif rank(X) = p then the system has a unique solution of theform

β = (XTX)minusXTy

For linear regression yi = α+βxi+ei with one regressorx the BLU estimators of the parameters α and β may bepresented in the convenient form as

β = nsxynsx

and α = y minus βx

where

nsxy =sumixiyi minus

(sumi xi) (sumi yi)n

nsx =sumixi minus

(sumi xi)

n x = sumi xi

nand y = sumi yi

n

For its computation we only need to use a simple pocketcalculator

Example e following table presents the number of resi-dents in thousands (x) and the unemployment rate in (y)for some cities of Poland Estimate the parameters β and α

xi

yi

In this casesumi xi = sumi yi = sumi xi = and sumi xiyi = erefore nsx = and nsxy =minusus β = minus and α = and hence f (x) =minusx +

Leacutevy Processes L

L

If the variancendashcovariance matrix of the error vector ecoincides (except a multiplicative scalar σ ) with a positivedenite matrix V then the Best Linear Unbiased estima-tion reduces to the minimization of (y minusXβ)TV(y minusXβ)called the weighed LS problem Moreover if rank(X) = pthen its solution is given in the form

β = (XTVminusX)minusXTVminusy

It is worth to add that a nonlinear LS problem is morecomplicated and its explicit solution is usually not knownInstead of this some algorithms are suggestedTotal least squares problem e problem has been

posed in recent years in numerical analysis as an alterna-tive for the LS problem in the casewhen all data are aectedby errorsConsider an overdetermined system of n linear equa-

tionsAx asymp b with k unknown xe TLS problem consistsin minimizing the Frobenius norm

∣∣[A b] minus [A b]∣∣Ffor all A isin Rntimesk and b isin range(A) where the Frobeniusnorm is dened by ∣∣(aij)∣∣F = sumij aij Once a minimizing[A b] is found then any x satisfying Ax = b is called a TLSsolution of the initial system Ax asymp b

e trouble is that the minimization problem maynot be solvable or its solution may not be unique As anexample one can set

A =⎡⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎦and b =

⎡⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎦

It is known that the TLS solution (if exists) is alwaysbetter than the ordinary LS in the sense that the cor-rection b minus Ax has smaller l-norm e main tool insolving the TLS problems is the following Singular ValueDecompositionFor any matrix A of n times k with real entries there

exist orthonormal matrices P = [p pn] and

Q = [q qk] such that

PTAQ = diag(σ σm) where σ ge ⋯ ge σm andm = minn k

About the AuthorFor biography see the entry 7Random Variable

Cross References7Absolute Penalty Estimation7Adaptive Linear Regression

7Analysis of Areal and Spatial Interaction Data7Autocorrelation in Regression7Best Linear Unbiased Estimation in Linear Models7Estimation7Gauss-Markoveorem7General Linear Models7Least Absolute Residuals Procedure7Linear Regression Models7Nonlinear Models7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Simple Linear Regression7Statistics History of7Two-Stage Least Squares

References and Further ReadingBjoumlrk A () Numerical methods for least squares problems

SIAM PhiladelphiaKariya T () Generalized least squares Wiley New YorkRao CR Toutenberg H () Linear models least squares and

alternatives Springer New YorkVan Huffel S Vandewalle J () The total least squares problem

computational aspects and analysis SIAM PhiladelphiaWolberg J () Data analysis using the method of least squares

extracting the most information from experiments SpringerNew York

Leacutevy Processes

Mohamed Abdel-HameedProfessorUnited Arab Emirates University Al Ain United ArabEmirates

IntroductionLeacutevy processes have become increasingly popular in engi-neering (reliability dams telecommunication) andmathe-matical nanceeir applications in reliability stems fromthe fact that they provide a realistic model for the degrada-tion of devices while their applications in the mathemati-cal theory of dams as they provide a basis for describingthe water input of dams eir popularity in nance isbecause they describe the nancialmarkets in amore accu-rate way than the celebrated BlackndashScholes model elatter model assumes that the rate of returns on assetsare normally distributed thus the process describing theasset price over time is continuous process In reality theasset prices have jumps or spikes and the asset returnsexhibit fat tails and 7skewness which negates the nor-mality assumption inherited in the BlackndashScholes model

L Leacutevy Processes

Because of the deciencies in the BlackndashScholes modelresearchers in mathematical nance have been trying tondmore suitable models for asset prices Certain types ofLeacutevy processes have been found to provide a good modelfor creep of concrete fatigue crack growth corroded steelgates and chloride ingress into concrete Furthermore cer-tain types of Leacutevy processes have been used to model thewater input in damsIn this entry we will review Leacutevy processes and give

important examples of such processes and state somereferences to their applications

Leacutevy ProcessesA stochastic process X = Xt t ge that has right con-tinuous sample paths with le limits is said to be a Leacutevyprocess if the following hold

X has stationary increments ie for every s t ge thedistribution of Xt+s minus Xt is independent of t

X has independent increments ie for every t sge Xt+s minus Xt is independent of ( Xuu le t)

X is stochastically continuous ie for every t ge andє gt

limsrarrtP(∣Xt minus Xs∣ gt є) = at is to say a Leacutevy process is a stochastically continu-ous process with stationary and independent incrementswhose sample paths are right continuous with le handlimitsIf Φ(z) is the characteristic function of a Leacutevy process

then its characteristic component φ(z) def= ln Φ(z)t is of the

form

iza minus zb+ int

R[exp(izx) minus minus izxI∣x∣lt]ν(dx)

where a isin R b isin R+ and ν is a measure on R satisfyingν() = intR( and x

)ν(dx) ltinfine measure ν characterizes the size and frequency of

the jumps If the measure is innite then the process hasinnitelymany jumps of very small sizes in any small inter-vale constant a dened above is called the dri term ofthe process and b is the variance (volatility) term

e LeacutevyndashIt^o decomposition identify any Leacutevy process

as the sum of three independent processes it is stated asfollowsGiven any a isin R b isin R+ and measure ν on R satisfying

ν() = intR(andx)ν(dx) ltinfin there exists a probability

space (Ω P) on which a Leacutevy process X is denede

process X is the sum of three independent processes()X

()X and

()X where

()X is a Brownianmotionwith dri a and

volatility b (in the sense dened below)()X is a compound

Poisson process and()X is a square integrable martingale

e characteristic components of()X

()X and

()X (denoted

by()φ (z)

()φ (z) and

()φ (z) respectively) are as follows

()φ (z) = iza minus zb

()φ (z)= int∣x∣ge(exp(izx) minus )ν(dx)()φ (z)= int∣x∣lt(exp(izx) minus minus izx)ν(dx)

Examples of the Leacutevy ProcessesThe Brownian MotionA Leacutevy process is said to be a Brownian motion (see7Brownian Motion and Diusions) with dri micro andvolatility rate σ if micro = a b = σ and ν (R)= Brow-nian motion is the only nondeterministic Leacutevy processeswith continuous sample paths

The Inverse Brownian ProcessLet X be a Brownian motion with micro gt and volatility rateσ For any x gt let Tx = inft Xt gt xen Tx is anincreasing Leacutevy process (called inverse Brownianmotion)its Leacutevy measure is given by

υ(dx) = radicπσ x

exp(minusxmicro

σ )

The Compound Poisson Processe compound Poisson process (see 7Poisson Processes)is a Leacutevy process where b = and ν is a nite measure

The Gamma Processe gamma process is a nonnegative increasing Leacutevy pro-cess X where b = a minus int xν(dx) = and its Leacutevymeasure is given by

ν(dx) = αxexp(minusxβ)dx x gt

where α β gt It follows that the mean term (E(X ))and the variance term (V(X )) for the process are equalto αβ and αβ respectively

e following is a simulated sample path of a gammaprocess where α = and β = (Fig )

The Variance Gamma Processe variance gamma process is a Leacutevy process that can berepresented as either the dierence between two indepen-dent gamma processes or as a Brownian process subordi-nated by a gamma processe latter is accomplished by arandom time change replacing the time of the Brownianprocess by a gamma process with a mean term equal to e variance gamma process has three parameters micro ndash the

Life Expectancy L

L

0 2 4 6 8 100

2

4

6

8

10

12

t

x

Leacutevy Processes Fig Gamma process

0 2 4 6 8 10minus08

minus06

minus04

minus02

0

02

04

06

t

x

Brownian motionVariance gamma

Leacutevy Processes Fig Brownian motion and variance gammasample paths

Brownian process dri term σ ndash the volatility of theBrownian process and ν ndash the variance term of the thegamma process

e following are two simulated sample paths one fora Brownian motion with a dri term micro = and volatilityterm σ = and the other is for a variance gamma processwith the same values for the dri term and the volatilityterms and ν = (Fig )

About the AuthorProfessor Mohamed Abdel-Hameed was the chairman ofthe Department of Statistics and Operations ResearchKuwait University (ndash) He was on the editorialboard of Applied Stochastic Models and Data Analysis(ndash) He was the Editor (jointly with E Cinlarand J Quinn) of the text Survival Models Maintenance

Replacement Policies and Accelerated Life Testing (Aca-demic Press ) He is an elected member of the Inter-national Statistical Institute He has published numerouspapers in Statistics and Operations research

Cross References7Non-Uniform Random Variate Generations7Poisson Processes7Statistical Modeling of Financial Markets7Stochastic Processes7Stochastic Processes Classication

References and Further ReadingAbdel-Hameed MS () Life distribution of devices subject to a

Leacutevy wear process Math Oper Res ndashAbdel-Hameed M () Optimal control of a dam using PMλτ

policies and penalty cost when the input process is a com-pound Poisson process with a positive drift J Appl Prob ndash

Black F Scholes MS () The pricing of options and corporateliabilities J Pol Econ ndash

Cont R Tankov P () Financial modeling with jump processesChapman amp HallCRC Press Boca Raton

Madan DB Carr PP Chang EC () The variance gamma processand option pricing Eur Finance Rev ndash

Patel A Kosko B () Stochastic resonance in continuous andspiking Neutron models with Leacutevy noise IEEE Trans NeuralNetw ndash

van Noortwijk JM () A survey of the applications of gammaprocesses in maintenance Reliab Eng Syst Safety ndash

Life Expectancy

Maja Biljan-AugustFull ProfessorUniversity of Rijeka Rijeka Croatia

Life expectancy is dened as the average number of yearsa person is expected to live from age x as determined bystatisticsStatistics on life expectancy are derived from a mathe-

matical model known as the 7life table In order to calcu-late this indicator the mortality rate at each age is assumedto be constant Life expectancy (ex) can be evaluated atany age and in a hypothetical stationary population canbe written in discrete form as

ex =Txlx

where x is age Tx is the number of person-years livedaged x and over and lx is the number of survivors at agex according to the life table

L Life Expectancy

Life expectancy can be calculated for combined sexesor separately for males and femalesere can be signi-cant dierences between sexesLife expectancy at birth (e) is the average number of

years a newborn child can expect to live if currentmortalitytrends remain constant

e =Tl

where T is the total size of the population and l is thenumber of births (the original number of persons in thebirth cohort)Life expectancy declines with age Life expectancy at

birth is highly inuenced by infant mortality rate eparadox of the life table refers to a situation where lifeexpectancy at birth increases for several years aer birth(e lt e lt e and even beyond)e paradox reects thehigher rates of infant and child mortality in populationsin pre-transition and middle stages of the demographictransitionLife expectancy at birth is a summary measure of mor-

tality in a population It is a frequently used indicatorof health standards and socio-economic living standardsLife expectancy is also one of the most commonly usedindicators of social development is indicator is easilycomparable through time and between areas includingcountries Inequalities in life expectancy usually indicateinequalities in health and socio-economic developmentLife expectancy rose throughout human history In

ancient Greece and Rome the average life expectancywas below years between the years and life expectancy at birth rose from about years to aglobal average of years and to more than years inthe richest countries (Riley ) Furthermore in mostindustrialized countries in the early twenty-rst centurylife expectancy averaged at about years (WHO)esechanges called the ldquohealth transitionrdquo are essentially theresult of improvements in public health medicine andnutritionLife expectancy varies signicantly across regions and

continents from life expectancies of about years insome central African populations to life expectancies of years and above in many European countries emore developed regions have an average life expectancy of years while the population of less developed regionsis at birth expected to live an average years less etwo continents that display the most extreme dierencesin life expectancies are North America ( years) andAfrica ( years) where as of recently the gap betweenlife expectancies amounts to years (UN ndash data)

Countries with the highest life expectancies in theworld ( years) are Australia Iceland Italy Switzerlandand Japan ( years) Japanese men and women live anaverage of and years respectively (WHO )In countries with a high rate of HIV infection prin-

cipally in Sub-Saharan Africa the average life expectancyis years and below Some of the worldrsquos lowest lifeexpectancies are in Sierra Leone ( years) Afghanistan( years) Lesotho ( years) and Zimbabwe ( years)In nearly all countries women live longer than men

e worldrsquos average life expectancy at birth is years formales and years for females the gap is about ve yearse female-to-male gap is expected to narrow in the moredeveloped regions andwiden in the less developed regionse Russian Federation has the greatest dierence in lifeexpectancies between the sexes ( years less for men)whereas in Tonga life expectancy for males exceeds thatfor females by years (WHO )Life expectancy is assumed to rise continuously

According to estimation by the UN global life expectancyat birth is likely to rise to an average years by ndashBy life expectancy is expected to vary across countriesfrom to years Long-range United Nations popula-tion projections predict that by on average peoplecan expect to live more than years from (Liberia) upto years (Japan)For more details on the calculation of life expectancy

including continuous notation see Keytz ( )and Preston et al ()

Cross References7Biopharmaceutical Research Statistics in7Biostatistics7Demography7Life Table7Mean Residual Life7Measurement of Economic Progress

References and Further ReadingKeyfitz N () Introduction to the Mathematics of Population

Addison-Wesly MassachusettsKeyfitz N () Applied mathematical demography Springer

New YorkPreston SH Heuveline P Guillot M () Demography measuring

and modelling potion processes Blackwell OxfordRiley JC () Rising life expectancy a global history Cambridge

University Press CambridgeRowland DT () Demographic methods and concepts Oxford

University Press New YorkSiegel JS Swanson DA () The methods and materials of demog-

raphy nd edn Elsevier Academic Press AmsterdamWorld Health Organization () World health statistics

Geneva

Life Table L

L

World Population Prospects The revision Highlights UnitedNations New York

World Population to () United Nations New York

Life Table

JanM HoemProfessor Director EmeritusMax Planck Institute for Demographic Research RostockGermany

e life table is a classical tabular representation of centralfeatures of the distribution function F of a positive variablesay X which normally is taken to represent the lifetime ofa newborn individuale life table was introduced wellbefore modern conventions concerning statistical distri-butions were developed and it comes with some specialterminology and notation as follows Suppose that F has a

density f (x) = ddxF(x) and dene the force of mortality or

death intensity at age x as the function

micro(x) = minus ddxln minus F(x) = f (x)

minus F(x)

Heuristically it is interpreted by the relation micro(x)dx =Px lt X lt x + dx∣X gt x Conversely F(x) =

minus expminusx

intmicro(s)ds e survivor function is dened as

ℓ(x) = ℓ() minus F(x) normally with ℓ() = Inmortality applications ℓ(x) is the expected number of sur-vivors to exact age x out of an original cohort of newborn babiese survival probability is

tpx = PX gt x + t∣X gt x = ℓ(x + t)ℓ(x)

= expminusintt

micro(x + s)ds

and the non-survival probability is (the converse) tqx = minustpx For t = one writes qx = qx and px = px In partic-ular we get ℓ(x + ) = ℓ(x)pxis is a practical recursionformula that permits us to compute all values of ℓ(x) oncewe know the values of px for all relevant x

e life expectancy is eo = EX = int infin ℓ(x)dxℓ()(e subscript in eo indicates that the expected value iscomputed at age (ie for newborn individuals) and thesuperscript o indicates that the computation is made in

the continuous mode)e remaining life expectancy atage x is

eox = E(X minus x∣X gt x) = intinfin

ℓ(x + t)dtℓ(x)

ie it is the expected lifetime remaining to someone whohas attained age xTo turn to the statistical estimation of these various

quantities suppose that the function micro(x) is piecewiseconstant which means that we take it to equal some con-stant say microj over each interval (xj xj+) for some partitionxj of the age axis For a collection Xi of independentobservations of X let Dj be the number of Xi that fall inthe interval (xj xj+) In mortality applications this is thenumber of deaths observed in the given interval For thecohort of the initially newborn Dj is the number of indi-viduals whodie in the interval (called the occurrences in theinterval) If individual i dies in the interval he or she willof course have lived for Xi minus xj time units during the inter-val Individuals who survive the interval will have lived forxj+ minus xj time units in the interval and individuals whodo not survive to age xj will not have lived during thisinterval at all When we aggregate the time units lived in(xj xj+) over all individuals we get a total Rj which iscalled the exposures for the interval the idea being thatindividuals are exposed to the risk of death for as long asthey live in the interval In the simple case where there areno relations between the individual parameters microj the col-lection DjRj constitutes a statistically sucient set ofobservations with a likelihood Λ that satises the relationln Λ = sumjminusmicrojRj+Dj ln microjwhich is easily seen to bemax-imized by microj = DjRje latter fraction is therefore themaximum-likelihood estimator for microj (In some connec-tions an age schedule of mortality will be specied such asthe classical GompertzndashMakeham function microx = a + bcxwhich does represent a relationship between the intensityvalues at the dierent ages x normally for single-year agegroupsMaximum likelihood estimators can then be foundby plugging this functional specication of the intensitiesinto the likelihood function nding the values a b andc that maximize Λ and using a + bcx for the intensity inthe rest of the life table computations Methods that do notamount tomaximum likelihood estimationwill sometimesbe used because they involve simpler computations Withsome luck they provide starting values for the iterative pro-cess that must usually be applied to produce the maximumlikelihood estimators For an example see Forseacuten ())is whole schema can be extended trivially to cover cen-soring (withdrawals) provided the censoringmechanism isunrelated to the mortality process

L Likelihood

If the force of mortality is constant over a single-yearage interval (x x + ) say and is estimated by microx in thisinterval then px = eminusmicrox is an estimator of the single-yearsurvival probability pxis allows us to estimate the sur-vival function recursively for all corresponding ages usingℓ(x + ) = ℓ(x)px for x = and the rest of thelife table computations follow suit Life table constructionconsists in the estimation of the parameters and the tab-ulation of functions like those above from empirical datae data can be for age at death for individuals as in theexample indicated above but they can also be observa-tions of duration until recovery from an illness of intervalsbetween births of time until breakdown of some piece ofmachinery or of any other positive duration variableSo far we have argued as if the life table is computed for

a group of mutually independent individuals who have allbeen observed in parallel essentially a cohort that is fol-lowed from a signicant common starting point (namelyfrom birth in our mortality example) and which is dimin-ished over time due to decrements (attrition) caused bythe risk in question and also subject to reduction due tocensoring (withdrawals)e corresponding table is thencalled a cohort life table It is more common however toestimate a px schedule from data collected for the mem-bers of a population during a limited time period and touse the mechanics of life-table construction to produce aperiod life table from the px valuesLife table techniques are described in detail in most

introductory textbooks in actuarial statistics7biostatistics7demography and epidemiology See eg Chiang ()Elandt-Johnson and Johnson () Manton and Stallard() Preston et al () For the history of thetopic consult Seal () Smith and Keytz () andDupacircquier ()

About the AuthorFor biography see the entry 7Demography

Cross References7Demographic Analysis A Stochastic Approach7Demography7Event History Analysis7Kaplan-Meier Estimator7Population Projections7Statistics An Overview

References and Further ReadingChiang CL () The life table and its applications Krieger

MalabarDupacircquier J () Lrsquoinvention de la table de mortaliteacute Presses

universitaires de France Paris

Elandt-Johnson RC Johnson NL () Survival models and dataanalysis Wiley New York

Forseacuten L () The efficiency of selected moment methods inGompertzndashMakeham graduation of mortality Scand Actuarial Jndash

Manton KG Stallard E () Recent trends in mortality analysisAcademic Press Orlando

Preston SH Heuveline P Guillot M () Demography Measuringand modeling populations Blackwell Oxford

Seal H () Studies in history of probability and statistics mul-tiple decrements of competing risks Biometrica ()ndash

Smith D Keyfitz N () Mathematical demography selectedpapers Springer Heidelberg

Likelihood

Nancy ReidProfessorUniversity of Toronto Toronto ON Canada

Introductione likelihood function in a statistical model is propor-tional to the density function for the random variable tobe observed in the model Most oen in applications oflikelihood we have a parametric model f (y θ) where theparameter θ is assumed to take values in a subset of Rkand the variable y is assumed to take values in a subset ofRn the likelihood function is dened by

L(θ) = L(θ y) = cf (y θ) ()

where c can depend on y but not on θ In more gen-eral settings where the model is semi-parametric or non-parametric the explicit denition is more dicult becausethe density needs to be dened relative to a dominatingmeasure whichmay not exist seeVan derVaart () andMurphy andVan derVaart ()is article will consideronly nite-dimensional parametric modelsWithin the context of the given parametric model the

likelihood function measures the relative plausibility ofvarious values of θ for a given observed data point y Val-ues of the likelihood function are only meaningful relativeto each other and for this reason are sometimes stan-dardized by the maximum value of the likelihood func-tion although other reference points might be of interestdepending on the contextIf ourmodel is f (y θ) = (ny)θy(minusθ)nminusy y = n

θ isin [ ] then the likelihood function is (any functionproportional to)

L(θ y) = θy( minus θ)nminusy

Likelihood L

L

and can be plotted as a function of θ for any xed valueof y e likelihood function is maximized at θ = ynis model might be appropriate for a sampling schemewhich recorded the number of successes amongn indepen-dent trials that result in success or failure each trial havingthe same probability of success θ Another example is thelikelihood function for the mean and variance parameterswhen sampling from a normal distribution with mean microand variance σ

L(θ y) = expminusn log σ minus (σ )Σ(yi minus micro)

where θ = (micro σ )is could also be plotted as a functionof micro and σ for a given sample y yn and it is not dif-cult to show that this likelihood function only dependson the sample through the sample mean y = nminusΣyi andsample variance s = (n minus )minusΣ(yi minus y) or equivalentlythrough Σyi and Σyi It is a general property of likelihoodfunctions that they depend on the data only through theminimal sucient statistic

Inferencee likelihood function was dened in a seminal paperof Fisher () and has since become the basis for mostmethods of statistical inference One version of likelihoodinference suggested by Fisher is to use some rule suchas L(θ)L(θ) gt k to dene a range of ldquolikelyrdquo or ldquoplau-siblerdquo values of θ Many authors including Royall ()and Edwards () have promoted the use of plots ofthe likelihood function along with interval estimates ofplausible valuesis approach is somewhat limited how-ever as it requires that θ have dimension or possibly or that a likelihood function can be constructed that onlydepends on a component of θ that is of interest see sectionldquo7Nuisance Parametersrdquo belowIn general we would wish to calibrate our inference

for θ by referring to the probabilistic properties of theinferential method One way to do this is to introduce aprobability measure on the unknown parameter θ typi-cally called a prior distribution and use Bayesrsquo rule forconditional probabilities to conclude

π(θ ∣ y) = L(θ y)π(θ)intθL(θ y)π(θ)dθ

where π(θ) is the density for the prior measure and π(θ ∣y) provides a probabilistic assessment of θ aer observingY = y in the model f (y θ) We could then make con-clusions of the form ldquohaving observed successes in trials and assuming π(θ) = the posterior probabilitythat θ gt is rdquo and so on

is is a very brief description of Bayesian inference inwhich probability statements refer to that generated from

the prior through the likelihood to the posterior A majordiculty with this approach is the choice of prior prob-ability function In some applications there may be anaccumulation of previous data that can be incorporatedinto a probability distribution but in general there is notand some rather ad hoc choices are oen made Anotherdiculty is themeaning to be attached to probability state-ments about the parameterInference based on the likelihood function can also be

calibrated with reference to the probability model f (y θ)by examining the distribution ofL(θY) as a random func-tion or more usually by examining the distribution ofvarious derived quantitiesis is the basis for likelihoodinference from a frequentist point of view In particularit can be shown that logL(θY)L(θY) where θ =θ(Y) is the value of θ at which L(θY) is maximized isapproximately distributed as a χk random variable wherek is the dimension of θ To make this precise requires anasymptotic theory for likelihood which is based on a cen-tral limit theorem (see 7Central Limiteorems) for thescore function

U(θY) = partpartθlogL(θY)

If Y = (Y Yn) has independent components thenU(θ) is a sum of n independent components which undermild regularity conditions will be asymptotically normalTo obtain the χ result quoted above it is also necessary toinvestigate the convergence of θ to the true value govern-ing the model f (y θ) Showing this convergence usuallyin probability but sometimes almost surely can be di-cult see Scholz () for a summary of some of the issuesthat ariseAssuming that θ is consistent for θ and that L(θY)

has sucient regularity the follow asymptotic results canbe established

(θ minus θ)T i(θ)(θ minus θ) drarr χk ()

U(θ)T iminus(θ)U(θ) drarr χk ()

ℓ(θ) minus ℓ(θ) drarr χk ()

where i(θ) = Eminusℓprimeprime(θY) θ is the expected Fisher infor-mation function ℓ(θ) = logL(θ) is the log-likelihoodfunction and χk is the 7chi-square distribution with kdegrees of freedom

ese results are all versions of a more general resultthat the log-likelihood function converges to the quadraticform corresponding to a multivariate normal distribution(see 7Multivariate Normal Distributions) under suitablystated limiting conditions ere is a similar asymptoticresult showing that the posterior density is asymptotically

L Likelihood

normal and in fact asymptotically free of the prior distri-bution although this result requires that the prior distribu-tion be a proper probability density ie has integral overthe parameter space equal to

Nuisance ParametersIn models where the dimension of θ is large plotting thelikelihood function is not possible and inference based onthe multivariate normal distribution for θ or the χk dis-tribution of the log-likelihood ratio doesnrsquot lead easily tointerval estimates for components of θ However it is pos-sible to use the likelihood function to construct inferencefor parameters of interest using variousmethods that havebeen proposed to eliminate nuisance parametersSuppose in the model f (y θ) that θ = (ψ λ) where ψ

is a k-dimensional parameter of interest (which will oenbe )e prole log-likelihood function of ψ is

ℓP(ψ) = ℓ(ψ λψ)

where λψ is the constrainedmaximum likelihood estimateit maximizes the likelihood function L(ψ λ) when ψ isheld xede prole log-likelihood function is also calledthe concentrated log-likelihood function especially ineconometrics If the parameter of interest is not expressedexplicitly as a subvector of θ then the constrained maxi-mum likelihood estimate is found using Lagrange multi-pliersIt can be veried under suitable smoothness conditions

that results similar to those at ( ndash ) hold as well for theprole log-likelihood function in particular

ℓP(ψ) minus ℓP(ψ) = ℓ(ψ λ) minus ℓ(ψ λψ)drarr χk

is method of eliminating nuisance parameters is notcompletely satisfactory especially when there are manynuisance parameters in particular it doesnrsquot allow forerrors in estimation of λ For example the prole likeli-hood approach to estimation of σ in the linear regressionmodel (see 7Linear Regression Models) y sim N(Xβ σ )will lead to the estimator σ = Σ(yi minus yi)n whereas theestimator usually preferred has divisor nminusp where p is thedimension of β

us a large literature has developed on improvementsto the prole log-likelihood For Bayesian inference suchimprovements are ldquoautomaticallyrdquo included in the formu-lation of the marginal posterior density for ψ

πM(ψ ∣ y)prop int π(ψ λ ∣ y)dλ

but it is typically quite dicult to specify priors for possiblyhigh-dimensional nuisance parameters For non-Bayesian

inference most modications to the prole log-likelihoodare derived by considering conditional or marginal infer-ence in models that admit factorizations at least approxi-mately like the following

f (y θ) = f(yψ)f(y ∣ y λ) orf (y θ) = f(y ∣ yψ)f(y λ)

A discussion of conditional inference and density factori-sations is given in Reid ()is literature is closely tiedto that on higher order asymptotic theory for likelihoode latter theory builds on saddlepoint and Laplace expan-sions to derive more accurate versions of (ndash) see forexample Severini () and Brazzale et al () edirect likelihood approach of Royall () and others doesnot generalize very well to the nuisance parameter settingalthough Royall and Tsou () present some results inthis direction

Extensions to Likelihoode likelihood function is such an important aspect ofinference based on models that it has been extended toldquolikelihood-likerdquo functions formore complex data settingsExamples include nonparametric and semi-parametriclikelihoods the most famous semi-parametric likelihoodis the proportional hazards model of Cox () Butmany other extensions have been suggested to empiri-cal likelihood (Owen ) which is a type of nonpara-metric likelihood supported on the observed sample toquasi-likelihood (McCullagh ) which starts from thescore function U(θ) and works backwards to an infer-ence function to bootstrap likelihood (Davison et al )and many modications of prole likelihood (Barndor-Nielsen andCox Fraser )ere is recent interestfor multi-dimensional responses Yi in composite likeli-hoods which are products of lower dimensional condi-tional or marginal distributions (Varin ) Reid ()concluded a review article on likelihood by stating

7 From either a Bayesian or frequentist perspective the like-lihood function plays an essential role in inference Themaximum likelihood estimator once regarded on an equalfooting among competing point estimators is now typi-cally the estimator of choice although some refinementis needed in problems with large numbers of nuisanceparameters The likelihood ratio statistic is the basis formost tests of hypotheses and interval estimates The emer-gence of the centrality of the likelihood function for infer-ence partly due to the large increase in computing poweris one of the central developments in the theory of statisticsduring the latter half of the twentieth century

Likelihood L

L

Further Readinge book by Cox and Hinkley () gives a detailedaccount of likelihood inference and principles of statis-tical inference see also Cox () ere are severalbook-length treatments of likelihood inference includingEdwards () Azzalini () Pawitan () and Sev-erini () this last discusses higher order asymptotictheory in detail as does Barndor-Nielsen andCox ()and Brazzale Davison and Reid () A short reviewpaper is Reid () An excellent overview of consis-tency results for maximum likelihood estimators is Scholz() see also Lehmann andCasella () Foundationalissues surrounding likelihood inference are discussed inBerger and Wolpert ()

About the AuthorProfessor Reid is a Past President of the Statistical Societyof Canada (ndash) During (ndash) she servedas the President of the Institute of Mathematical Statis-tics Among many awards she received the Emanuel andCarol Parzen Prize for Statistical Innovation () ldquoforleadership in statistical science for outstanding researchin theoretical statistics and highly accurate inference fromthe likelihood function and for inuential contributionsto statistical methods in biology environmental sciencehigh energy physics and complex social surveysrdquo She wasawarded the Gold Medal Statistical Society of Canada() and FlorenceNightingaleDavidAward Committeeof Presidents of Statistical Societies () She is AssociateEditor of Statistical Science (ndash) Bernoulli (ndash) andMetrika (ndash)

Cross References7Bayesian Analysis or Evidence Based Statistics7Bayesian Statistics7Bayesian Versus Frequentist Statistical Reasoning7Bayesian vs Classical Point Estimation A ComparativeOverview7Chi-Square Test Analysis of Contingency Tables7Empirical Likelihood Approach to Inference fromSample Survey Data7Estimation7Fiducial Inference7General Linear Models7Generalized Linear Models7Generalized Quasi-Likelihood (GQL) Inferences7Mixture Models7Philosophical Foundations of Statistics

7Risk Analysis7Statistical Evidence7Statistical Inference7Statistical Inference An Overview7Statistics An Overview7Statistics Nelderrsquos view7Testing Variance Components in Mixed LinearModels7Uniform Distribution in Statistics

References and Further ReadingAzzalini A () Statistical inference Chapman and Hall LondonBarndorff-Nielsen OE Cox DR () Inference and asymptotics

Chapman and Hall LondonBerger JO Wolpert R () The likelihood principle Institute of

Mathematical Statistics HaywardBirnbaum A () On the foundations of statistical inference Am

Stat Assoc ndashBrazzale AR Davison AC Reid N () Applied asymptotics

Cambridge University Press CambridgeCox DR () Regression models and life tables J R Stat Soc B

ndash (with discussion)Cox DR () Principles of statistical inference Cambridge Uni-

versity Press CambridgeCox DR Hinkley DV () Theoretical statistics Chapman and

Hall LondonDavison AC Hinkley DV Worton B () Bootstrap likelihoods

Biometrika ndashEdwards AF () Likelihood Oxford University Press OxfordFisher RA () On the mathematical foundations of theoretical

statistics Phil Trans R Soc A ndashLehmann EL Casella G () Theory of point estimation nd edn

Springer New YorkMcCullagh P () Quasi-likelihood functions Ann Stat ndashMurphy SA Van der Vaart A () Semiparametric likelihood ratio

inference Ann Stat ndashOwen A () Empirical likelihood ratio confidence intervals for a

single functional Biometrika ndashPawitan Y () In all likelihood Oxford University Press OxfordReid N () The roles of conditioning in inference Stat Sci

ndashReid N () Likelihood J Am Stat Assoc ndashRoyall RM () Statistical evidence a likelihood paradigm

Chapman and Hall LondonRoyall RM Tsou TS () Interpreting statistical evidence using

imperfect models robust adjusted likelihoodfunctions J R StatSoc B

Scholz F () Maximum likelihood estimation In Ency-clopedia of statistical sciences Wiley New York doiesspub Accessed Aug

Severini TA () Likelihood methods in statistics Oxford Univer-sity Press Oxford

Van der Vaart AW () Infinite-dimensional likelihood meth-ods in statistics httpwwwstieltjesorgarchiefbiennialframenodehtml Accessed Aug

Varin C () On composite marginal likelihood Adv Stat Analndash

L Limit Theorems of Probability Theory

Limit Theorems of ProbabilityTheory

Alexandr Alekseevich BorovkovProfessor Head of Probability and Statistics Chair at theNovosibirsk UniversityNovosibirsk University Novosibirsk Russia

Limit eorems of Probability eory is a broad namereferring to the most essential and extensive research areain Probability eory which at the same time has thegreatest impact on the numerous applications of the latterBy its very nature Probability eory is concerned

with asymptotic (limiting) laws that emerge in a long seriesof observations on random events Because of this in theearly twentieth century even the very denition of prob-ability of an event was given by a group of specialists(R von Mises and some others) as the limit of the rela-tive frequency of the occurrence of this event in a longrow of independent random experiments e ldquostabilityrdquoof this frequency (ie that such a limit always exists) waspostulated Aer the s Kolmogorovrsquos axiomatic con-struction of probability theory has prevailed One of themain assertions in this axiomatic theory is the Law of LargeNumbers (LLN) on the convergence of the averages of largenumbers of random variables to their expectation islaw implies the aforementioned stability of the relative fre-quencies and their convergence to the probability of thecorresponding event

e LLN is the simplest limit theorem (LT) of probabil-ity theory elucidating the physical meaning of probabilitye LLN is stated as follows if XXX is a sequenceof iid random variables

Sn =n

sumj=Xj

and the expectation a = EX exists then nminusSnasETHrarr a

(almost surely ie with probability )us the value nacan be called the rst order approximation for the sums Sne Central Limiteorem (CLT) gives one a more preciseapproximation for Sn It says that if σ = E(X minus a) ltinfinthen the distribution of the standardized sum ζn = (Sn minusna)σ

radicn converges as n rarr infin to the standard normal

(Gaussian) lawat is for all x

P(ζn lt x)rarr Φ(x) = radicπ

x

intminusinfin

eminustdt

e quantity nEξ + ζσradicn where ζ is a standard normal

random variable (so that P(ζ lt x) = Φ(x)) can be calledthe second order approximation for Sn

e rst LLN (for the Bernoulli scheme) was provedby Jacob Bernoulli in the late s (published posthu-mously in ) e rst CLT (also for the Bernoullischeme) was established by A de Moivre (rst publishedin and referred nowadays to as the deMoivrendashLaplacetheorem) In the beginning of the nineteenth centuryPS Laplace and CF Gauss contributed to the generaliza-tion of these assertions and appreciation of their enormousapplied importance (in particular for the theory of errorsof observations) while later in that century further break-throughs in both methodology and applicability rangeof the CLT were achieved by PL Chebyshev () andAM Lyapunov ()

e main directions in which the two aforementionedmain LTs have been extended and rened since then are

Relaxing the assumption EX lt infin When the sec-ond moment is innite one needs to assume that theldquotailrdquo P(x) = P(X gt x) + P(X lt minusx) is a func-tion regularly varying at innity such that the limitlimxrarrinfinP(X gt x)P(x) existsen the distribution of thenormalized sum Snσ(n) where σ(n) = Pminus(nminus)Pminus being the generalized inverse of the function P andwe assume that Eξ = when the expectation is niteconverges to one of the so-called stable laws as nrarrinfine7characteristic functions of these laws have simpleclosed-form representations

Relaxing the assumption that the Xjrsquos are identicallydistributed and proceeding to study the so-called tri-angular array scheme where the distributions of thesummands Xj = Xjn forming the sum Sn depend notonly on j but on n as well In this case the class of alllimit laws for the distribution of Sn (under suitable nor-malization) is substantially wider it coincides with theclass of the so-called innitely divisible distributionsAn important special case here is the Poisson limit the-orem on convergence in distribution of the number ofoccurrences of rare events to a Poisson law

Relaxing the assumption of independence of the XjrsquosSeveral types of ldquoweak dependencerdquo assumptions onXjunder which the LLN and CLT still hold true havebeen suggested and investigatedOne should alsomen-tion here the so-called ergodic theorems (see7Ergodiceorem) for a wide class of random sequences andprocesses

Renement of the main LTs and derivation of asymp-totic expansions For instance in the CLT bounds ofthe rate of convergence P(ζn lt x) minus Φ(x) rarr and

Limit Theorems of Probability Theory L

L

asymptotic expansions for this dierence (in the pow-ers of nminus in the case of iid summands) have beenobtained under broad assumptions

Studying large deviation probabilities for the sums Sn(theorems on rare events) If x rarr infin together with nthen the CLT can only assert that P(ζn gt x) rarr eorems on large deviation probabilities aim to nd afunction P(xn) such that

P(ζn gt x)P(xn) rarr as nrarrinfin x rarrinfin

e nature of the function P(xn) essentially dependson the rate of decay of P(X gt x) as x rarrinfin and on theldquodeviation zonerdquo ie on the asymptotic behavior of theratio xn as nrarrinfin

Considering observations X Xn of a more com-plex nature ndash rst of all multivariate random vectorsIf Xj isin Rd then the role of the limit law in the CLTwill be played by a d-dimensional normal (Gaussian)distribution with the covariance matrix E(X minus EX)(X minus EX)T

e variety of application areas of the LLN and CLTis enormous us Mathematical Statistics is based onthese LTs Let Xlowastn = (X Xn) be a sample froma distribution F and Flowastn (u) the corresponding empiricaldistribution functione fundamental GlivenkondashCantellitheorem (see 7Glivenko-Cantelli eorems) stating thatsupu ∣F

lowastn (u)minus F(u)∣

asETHrarr as nrarrinfin is of the same natureas the LLN and basically means that the unknown distri-bution F can be estimated arbitrary well from the randomsample Xlowastn of a large enough size n

e existence of consistent estimators for the unknownparameters a = Eξ and σ = E(X minus a) also follows fromthe LLN since as nrarrinfin

alowast = n

n

sumj=Xj

asETHrarr a (σ )lowast = n

n

sumj=

(Xj minus alowast)

= n

n

sumj=Xj minus (alowast) asETHrarr σ

Under additional moment assumptions on the distri-bution F one can also construct asymptotic condenceintervals for the parameters a and σ as the distributionsof the quantities

radicn(alowast minus a) and

radicn((σ )lowast minus σ ) con-

verge as n rarr infin to the normal onese same can alsobe said about other parameters that are ldquosmoothrdquo enoughfunctionals of the unknown distribution F

e theorem on the 7asymptotic normality andasymptotic eciency of maximum likelihood estimatorsis another classical example of LTsrsquo applications in mathe-matical statistics (see eg Borovkov ) Furthermore in

estimation theory and hypotheses testing one also needstheorems on large deviation probabilities for the respec-tive statistics as it is statistical procedures with small errorprobabilities that are oen required in applicationsIt is worth noting that summation of random variables

is by no means the only situation in which LTs appear inProbabilityeoryGenerally speaking the main objective of Probabil-

ityeory in applications is nding appropriate stochasticmodels for objects under study and then determining thedistributions or parameters one is interested in As a rulethe explicit form of these distributions and parameters isnot known LTs can be used to nd suitable approximationsto the characteristics in questionAt least two possible approaches to this problem should

be noted here

Suppose that the unknown distribution Fθ depends ona parameter θ such that as θ approaches some ldquocriticalrdquovalue θ the distributions Fθ become ldquodegeneraterdquo inone sense or anotheren in a number of cases onecan nd an approximation for Fθ which is valid for thevalues of θ that are close to θ For instance in actuarialstudies7queueing theory and some other applicationsone of the main problems is concerned with the dis-tribution of S = sup

kge(Sk minus θk) under the assumption

that EX = If θ gt then S is a proper random vari-able If however θ rarr then S asETHrarr infin Here we dealwith the so-called ldquotransient phenomenardquo It turns outthat if σ = Var (X) ltinfin then there exists the limit

limθdarrP(θS gt x) = eminusxσ x gt

is (KingmanndashProkhorov) LT enables one to ndapproximations for the distribution of S in situationswhere θ is small

Sometimes one can estimate the ldquotailsrdquo of the unknowndistributions ie their asymptotic behavior at innityis is of importance in those applications where oneneeds to evaluate the probabilities of rare events If theequation Eemicro(Xminusθ) = has a solution micro gt then inthe above example one has

P(S gt x) sim ceminusmicrox x rarrinfin

where c is a known constant If the distribution F of Xis subexponential (in this case Eemicro(Xminusθ) = infin for anymicro gt ) then

P(S gt x) sim θ int

infin

x( minus F(t))dt x rarrinfin

L Linear Mixed Models

isLTenablesone tondapproximations forP(S gt x)for large xFor both approaches the obtained approximations

can be rened

An important part of Probabilityeory is concernedwith LTs for random processes eir main objective isto nd conditions under which random processes con-verge in some sense to some limit process An extensionof the CLT to that context is the so-called FunctionalCLT (aka the DonskerndashProkhorov invariance princi-ple) which states that as n rarr infin the processesζn(t) = (Slfloorntrfloor minus ant)σ

radicntisin[] converge in distribu-

tion to the standard Wiener process w(t)tisin[]e LTs(including large deviation theorems) for a broad class offunctionals of the sequence (7random walk) S Sncan also be classied as LTs for 7stochastic processese same can be said about Law of iterated logarithmwhich states that for an arbitrary ε gt the random walkSkinfink= crosses the boundary (minusε)σ

radick ln ln k innitely

many times but crosses the boundary ( + ε)σradick ln ln k

nitely many times with probability Similar results holdtrue for trajectories of Wiener processes w(t)tisin[] andw(t)tisin[infin)In mathematical statistics a closely related to func-

tional CLT result says that the so-called ldquoempirical processrdquoradicn(Flowastn (u) minus F(u))uisin(minusinfininfin) converges in distribution

to w(F(u))uisin(minusinfininfin) where w(t) = w(t) minus tw() isthe Brownian bridge processis LT implies7asymptoticnormality of a great many estimators that can be repre-sented as smooth functionals of the empirical distributionfunction Flowastn (u)

ere are many other areas in Probability eoryand its applications where various LTs appear and areextensively used For instance convergence theoremsfor 7martingales asymptotics of extinction probabilityof a branching processes and conditional (under non-extinction condition) LTs on a number of particles etc

About the AuthorProfessor Borovkov is Head of Probability and Statis-tics Department of the Sobolev Institute of MathematicsNovosibirsk (RussianAcademy of Sciences) since Heis Head of Probability and Statistics Chair at the Novosi-birsk University since He is Concelour of RussianAcademy of Sciences and fullmember of RussianAcademyof Sciences (Academician) () He was awarded theState Prize of the USSR () the Markov Prize of theRussian Academy of Sciences () and GovernmentPrize in Education () Professor Borovkov is Editor-in-Chief of the journal ldquoSiberianAdvances inMathematicsrdquo

and Associated Editor of journals ldquoeory of Probabil-ity and its Applicationsrdquo ldquoSiberian Mathematical JournalrdquoldquoMathematical Methods of Statisticsrdquo ldquoElectronic Journal ofProbabilityrdquo

Cross References7Almost Sure Convergence of Random Variables7Approximations to Distributions7Asymptotic Normality7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Central Limiteorems7Empirical Processes7Ergodiceorem7Glivenko-Cantellieorems7Large Deviations and Applications7Laws of Large Numbers7Martingale Central Limiteorem7Probabilityeory An Outline7RandomMatrixeory7Strong Approximations in Probability and Statistics7Weak Convergence of Probability Measures

References and Further ReadingBillingsley P () Convergence of probability measures Wiley

New YorkBorovkov AA () Mathematical statistics Gordon amp Breach

AmsterdamGnedenko BV Kolmogorov AN () Limit distributions for sums

of independent random variables Addison-Wesley CambridgeLeacutevy P () Theacuteorie de lrsquoAddition des Variables Aleacuteatories

Gauthier-Villars ParisLoegraveve M (ndash) Probability theory vols I and II th edn

Springer New YorkPetrov VV () Limit theorems of probability theory sequences of

independent random variables ClarendonOxford UniversityPress New York

Linear Mixed Models

GeertMolenberghsProfessorUniversiteit Hasselt amp Katholieke Universiteit LeuvenLeuven Belgium

In observational studies repeated measurements may betaken at almost arbitrary time points resulting in anextremely large number of time points at which only one

Linear Mixed Models L

L

or only a few measurements have been taken Many of theparametric covariance models described so far may thencontain toomany parameters tomake them useful in prac-tice while othermore parsimoniousmodelsmay be basedon assumptions which are too simplistic to be realisticA general and very exible class of parametric models forcontinuous longitudinal data is formulated as follows

yi∣bi sim N(Xiβ + ZibiΣi) ()bi sim N(D) ()

where Xi and Zi are (ni times p) and (ni times q) dimensionalmatrices of known covariates β is a p-dimensional vec-tor of regression parameters called the xed eects D isa general (q times q) covariance matrix and Σi is a (ni times ni)covariance matrix which depends on i only through itsdimension ni ie the set of unknown parameters in Σi willnot depend upon i Finally bi is a vector of subject-specicor random eects

e above model can be interpreted as a linear regres-sion model (see 7Linear Regression Models) for the vec-tor yi of repeated measurements for each unit separatelywhere some of the regression parameters are specic (ran-dom eects bi) while others are not (xed eects β)edistributional assumptions in () with respect to the ran-dom eects can be motivated as follows First E(bi) = implies that the mean of yi still equals Xiβ such that thexed eects in the random-eects model () can also beinterpreted marginally Not only do they reect the eectof changing covariates within specic units they alsomea-sure the marginal eect in the population of changing thesame covariates Second the normality assumption imme-diately implies that marginally yi also follows a normaldistribution with mean vector Xiβ and with covariancematrix Vi = ZiDZprimei + ΣiNote that the random eects in () implicitly imply the

marginal covariance matrix Vi of yi to be of the very spe-cic form Vi = ZiDZprimei + Σi Let us consider two examplesunder the assumption of conditional independence ieassuming Σi = σ Ini First consider the case where therandom eects are univariate and represent unit-specicintercepts is corresponds to covariates Zi which areni-dimensional vectors containing only ones

e marginal model implied by expressions () and() is

yi sim N(XiβVi) Vi = ZiDZprimei + Σi

which can be viewed as a multivariate linear regressionmodel with a very particular parameterization of thecovariance matrix Vi

With respect to the estimation of unit-specic param-eters bi the posterior distribution of bi given the observeddata yi can be shown to be (multivariate) normal withmean vector equal to DZprimeiV

minusi (α)(yi minus Xiβ) Replacing β

and α by their maximum likelihood estimates we obtainthe so-called empirical Bayes estimates bi for the bi A keyproperty of these EB estimates is shrinkage which is bestillustrated by considering the prediction yi equiv Xi β +Zibi ofthe ith prole It can easily be shown that

yi = ΣiVminusi Xi β + (Ini minus ΣiVminusi ) yi

which can be interpreted as a weighted average of thepopulation-averaged prole Xi β and the observed data yiwith weights ΣiVminusi and Ini minusΣiVminusi respectively Note thatthe ldquonumeratorrdquo of ΣiVminusi represents within-unit variabil-ity and the ldquodenominatorrdquo is the overall covariance matrixVi Hence much weight will be given to the overall averageprole if the within-unit variability is large in comparisonto the between-unit variability (modeled by the randomeects) whereasmuchweight will be given to the observeddata if the opposite is trueis phenomenon is referred toas shrinkage toward the average proleXi β An immediateconsequence of shrinkage is that the EB estimates show lessvariability than actually present in the random-eects dis-tribution ie for any linear combination λ of the randomeects

var(λprimebi)le var(λprimebi) = λprimeDλ

About the AuthorGeert Molenberghs is Professor of Biostatistics at Uni-versiteit Hasselt and Katholieke Universiteit Leuven inBelgium He was Joint Editor of Applied Statistics (ndash) and Co-Editor of Biometrics (ndash) Cur-rently he is Co-Editor of Biostatistics (ndash) He wasPresident of the International Biometric Society (ndash) received the Guy Medal in Bronze from the RoyalStatistical Society and the Myrto Lefkopoulou award fromthe Harvard School of Public Health Geert Molenberghsis founding director of the Center for Statistics He isalso the director of the Interuniversity Institute for Bio-statistics and statistical Bioinformatics (I-BioStat) Jointlywith Geert Verbeke Mike Kenward Tomasz BurzykowskiMarc Buyse and Marc Aerts he authored books on lon-gitudinal and incomplete data and on surrogate markerevaluation GeertMolenberghs received several Excellencein Continuing Education Awards of the American Statisti-cal Association for courses at Joint Statistical Meetings

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 7: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

Least Absolute Residuals Procedure L

L

to these n observations en for p gt the Lp-normtting procedure chooses values for b b bq to min-imise the Lp-norm of the residuals [sumni= ∣ei∣p]

p where fori = n the ith residual is dened by

ei = yi minus xib minus xib minus minus xiqbq

e most familiar Lp-norm tting procedure knownas the 7least squares procedure sets p = and choosesvalues for b b bq tominimize the sum of the squaredresidualssumni= ei A second choice to be discussed in the present article

sets p = and chooses b b bq to minimize the sumof the absolute residualssumni= ∣ei∣A third choice sets p =infin and chooses b b bq to

minimize the largest absolute residualmaxni=∣ei∣Setting ui = ei and vi = if ei ge and ui =

and vi = minusei if ei lt we nd that ei = ui minus vi so thatthe least absolute residuals (LAR) tting problem choosesb b bq to minimize the sum of the absolute residuals

n

sumi=

(ui + vi)

subject to

xib + xib +⋯ + xiqbq +Ui minus vi = yi for i = n

and Ui ge vi ge for i = ne LAR tting problem thus takes the form of a linearprogramming problem and is oen solved by means of avariant of the dual simplex procedureGauss has noted (when q = ) that solutions of this

problem are characterized by the presence of a set of qzero residuals Such solutions are robust to the presence ofoutlying observations Indeed they remain constant undervariations in the other n minus q observations provided thatthese variations do not cause any of the residuals to changetheir signs

e LAR tting procedure corresponds to the maxi-mum likelihood estimator when the є-disturbances followa double exponential (Laplacian) distributionis estima-tor is more robust to the presence of outlying observationsthan is the standard least squares estimator which maxi-mizes the likelihood function when the є-disturbances arenormal (Gaussian)Nevertheless theLAR estimator has anasymptotic normal distribution as it is amember ofHuberrsquosclass ofM-estimators

ere are many variants of the basic LAR proce-dure but the one of greatest historical interest is thatproposed in by the Croatian Jesuit scientist Rugjer(or Rudjer) Josip Boškovic (ndash) (Latin RogeriusJosephusBoscovich Italian RuggieroGiuseppeBoscovich)

In his variant of the standard LAR procedure there are twoexplanatory variables of which the rst is constant xi = and the values of b and b are constrained to satisfy theadding-up condition sumni=(yi minus b minus xib) = usuallyassociated with the least squares procedure developed byGauss in and published by Legendre in Com-puter algorithms implementing this variant of the LARprocedure with q ge variables are still to be found in theliteratureFor an account of recent developments in this area

see the series of volumes edited by Dodge ( ) For a detailed history of the LAR procedureanalyzing the contributions of Boškovic Laplace GaussEdgeworth Turner Bowley and Rhodes see Farebrother() And for a discussion of the geometrical andmechanical representation of the least squares and LARtting procedures see Farebrother ()

About the AuthorRichard William Farebrother was a member of the teach-ing sta of the Department of Econometrics and SocialStatistics in the (Victoria) University of Manchester from until when he took early retirement (he hasbeen blind since ) From until he was anHonorary Reader in Econometrics in the Department ofEconomic Studies of the sameUniversity He has publishedthree books Linear Least Squares Computations in Fitting Linear Relationships A History of the Calculus ofObservations ndash in and Visualizing statisticalModels and Concepts in He has also published morethan research papers in a wide range of subject areasincluding econometric theory computer algorithms sta-tistical distributions statistical inference and the history ofstatistics Dr Farebrother was also visiting Associate Pro-fessor () at Monash University Australia and VisitingProfessor () at the Institute for Advanced Studies inVienna Austria

Cross References7Least Squares7Residuals

References and Further ReadingDodge Y (ed) () Statistical data analysis based on the L-norm

and related methods North-Holland AmsterdamDodge Y (ed) () L-statistical analysis and related methods

North-Holland AmsterdamDodge Y (ed) () L-statistical procedures and related topics

Institute of Mathematical Statistics HaywardDodge Y (ed) () Statistical data analysis based on the L-norm

and related methods Birkhaumluser Basel

L Least Squares

Farebrother RW () Fitting linear relationships a history of thecalculus of observations ndash Springer New York

Farebrother RW () Visualizing statistical models and conceptsMarcel Dekker New York

Least Squares

Czesław StepniakProfessorMaria Curie-Skłodowska University Lublin PolandUniversity of Rzeszoacutew Rzeszoacutew Poland

Least Squares (LS) problem involves some algebraic andnumerical techniques used in ldquosolvingrdquo overdeterminedsystems F(x) asymp b of equations where b isin Rn while F(x)is a column of the form

F(x) =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

f(x)

f(x)

fmminus(x)

fm(x)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

with entries fi = fi(x) i = n where x = (x xp)T e LS problem is linear when each fi is a linear functionand nonlinear ndash if notLinear LS problem refers to a system Ax = b of linear

equations Such a system is overdetermined if n gt p If b notinrange(A) the system has no proper solution and will bedenoted by Ax asymp b In this situation we are seeking for asolution of some optimization probleme name ldquoLeastSquaresrdquo is justied by the l-norm commonly used as ameasure of imprecision

e LS problem has a clear statistical interpretation inregression terms Consider the usual regression model

yi = fi(xi xip β βp) + ei for i = n ()

where xij i = n j = p are some constants givenby experimental design fi i = n are given functionsdepending on unknown parameters βj j = p whileyi i = n are values of these functions observed withsome random errors ei We want to estimate the unknownparameters βi on the basis of the data set xij yi

In linear regression each fi is a linear function of typefi = sump

j= cij(x xn)βj and the model () may bepresented in vector-matrix notation as

y = Xβ + e

where y = (y yn)T e = (e en)T and β =(β βp)T while X is a n times p matrix with entries xijIf e en are not correlated with mean zero and a com-mon (perhaps unknown) variance then the problem ofBest Linear Unbiased Estimation (BLUE) of β reduces tonding a vector β that minimizes the norm ∣∣y minus Xβ ∣∣

=

(y minus Xβ)T(y minus Xβ)Such a vector is said to be the ordinary LS solution of

the overparametrized system Xβ asymp y On the other handthe last one reduces to solving the consistent system

XTXβ = XTy

of linear equations called normal equations In particularif rank(X) = p then the system has a unique solution of theform

β = (XTX)minusXTy

For linear regression yi = α+βxi+ei with one regressorx the BLU estimators of the parameters α and β may bepresented in the convenient form as

β = nsxynsx

and α = y minus βx

where

nsxy =sumixiyi minus

(sumi xi) (sumi yi)n

nsx =sumixi minus

(sumi xi)

n x = sumi xi

nand y = sumi yi

n

For its computation we only need to use a simple pocketcalculator

Example e following table presents the number of resi-dents in thousands (x) and the unemployment rate in (y)for some cities of Poland Estimate the parameters β and α

xi

yi

In this casesumi xi = sumi yi = sumi xi = and sumi xiyi = erefore nsx = and nsxy =minusus β = minus and α = and hence f (x) =minusx +

Leacutevy Processes L

L

If the variancendashcovariance matrix of the error vector ecoincides (except a multiplicative scalar σ ) with a positivedenite matrix V then the Best Linear Unbiased estima-tion reduces to the minimization of (y minusXβ)TV(y minusXβ)called the weighed LS problem Moreover if rank(X) = pthen its solution is given in the form

β = (XTVminusX)minusXTVminusy

It is worth to add that a nonlinear LS problem is morecomplicated and its explicit solution is usually not knownInstead of this some algorithms are suggestedTotal least squares problem e problem has been

posed in recent years in numerical analysis as an alterna-tive for the LS problem in the casewhen all data are aectedby errorsConsider an overdetermined system of n linear equa-

tionsAx asymp b with k unknown xe TLS problem consistsin minimizing the Frobenius norm

∣∣[A b] minus [A b]∣∣Ffor all A isin Rntimesk and b isin range(A) where the Frobeniusnorm is dened by ∣∣(aij)∣∣F = sumij aij Once a minimizing[A b] is found then any x satisfying Ax = b is called a TLSsolution of the initial system Ax asymp b

e trouble is that the minimization problem maynot be solvable or its solution may not be unique As anexample one can set

A =⎡⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎦and b =

⎡⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎦

It is known that the TLS solution (if exists) is alwaysbetter than the ordinary LS in the sense that the cor-rection b minus Ax has smaller l-norm e main tool insolving the TLS problems is the following Singular ValueDecompositionFor any matrix A of n times k with real entries there

exist orthonormal matrices P = [p pn] and

Q = [q qk] such that

PTAQ = diag(σ σm) where σ ge ⋯ ge σm andm = minn k

About the AuthorFor biography see the entry 7Random Variable

Cross References7Absolute Penalty Estimation7Adaptive Linear Regression

7Analysis of Areal and Spatial Interaction Data7Autocorrelation in Regression7Best Linear Unbiased Estimation in Linear Models7Estimation7Gauss-Markoveorem7General Linear Models7Least Absolute Residuals Procedure7Linear Regression Models7Nonlinear Models7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Simple Linear Regression7Statistics History of7Two-Stage Least Squares

References and Further ReadingBjoumlrk A () Numerical methods for least squares problems

SIAM PhiladelphiaKariya T () Generalized least squares Wiley New YorkRao CR Toutenberg H () Linear models least squares and

alternatives Springer New YorkVan Huffel S Vandewalle J () The total least squares problem

computational aspects and analysis SIAM PhiladelphiaWolberg J () Data analysis using the method of least squares

extracting the most information from experiments SpringerNew York

Leacutevy Processes

Mohamed Abdel-HameedProfessorUnited Arab Emirates University Al Ain United ArabEmirates

IntroductionLeacutevy processes have become increasingly popular in engi-neering (reliability dams telecommunication) andmathe-matical nanceeir applications in reliability stems fromthe fact that they provide a realistic model for the degrada-tion of devices while their applications in the mathemati-cal theory of dams as they provide a basis for describingthe water input of dams eir popularity in nance isbecause they describe the nancialmarkets in amore accu-rate way than the celebrated BlackndashScholes model elatter model assumes that the rate of returns on assetsare normally distributed thus the process describing theasset price over time is continuous process In reality theasset prices have jumps or spikes and the asset returnsexhibit fat tails and 7skewness which negates the nor-mality assumption inherited in the BlackndashScholes model

L Leacutevy Processes

Because of the deciencies in the BlackndashScholes modelresearchers in mathematical nance have been trying tondmore suitable models for asset prices Certain types ofLeacutevy processes have been found to provide a good modelfor creep of concrete fatigue crack growth corroded steelgates and chloride ingress into concrete Furthermore cer-tain types of Leacutevy processes have been used to model thewater input in damsIn this entry we will review Leacutevy processes and give

important examples of such processes and state somereferences to their applications

Leacutevy ProcessesA stochastic process X = Xt t ge that has right con-tinuous sample paths with le limits is said to be a Leacutevyprocess if the following hold

X has stationary increments ie for every s t ge thedistribution of Xt+s minus Xt is independent of t

X has independent increments ie for every t sge Xt+s minus Xt is independent of ( Xuu le t)

X is stochastically continuous ie for every t ge andє gt

limsrarrtP(∣Xt minus Xs∣ gt є) = at is to say a Leacutevy process is a stochastically continu-ous process with stationary and independent incrementswhose sample paths are right continuous with le handlimitsIf Φ(z) is the characteristic function of a Leacutevy process

then its characteristic component φ(z) def= ln Φ(z)t is of the

form

iza minus zb+ int

R[exp(izx) minus minus izxI∣x∣lt]ν(dx)

where a isin R b isin R+ and ν is a measure on R satisfyingν() = intR( and x

)ν(dx) ltinfine measure ν characterizes the size and frequency of

the jumps If the measure is innite then the process hasinnitelymany jumps of very small sizes in any small inter-vale constant a dened above is called the dri term ofthe process and b is the variance (volatility) term

e LeacutevyndashIt^o decomposition identify any Leacutevy process

as the sum of three independent processes it is stated asfollowsGiven any a isin R b isin R+ and measure ν on R satisfying

ν() = intR(andx)ν(dx) ltinfin there exists a probability

space (Ω P) on which a Leacutevy process X is denede

process X is the sum of three independent processes()X

()X and

()X where

()X is a Brownianmotionwith dri a and

volatility b (in the sense dened below)()X is a compound

Poisson process and()X is a square integrable martingale

e characteristic components of()X

()X and

()X (denoted

by()φ (z)

()φ (z) and

()φ (z) respectively) are as follows

()φ (z) = iza minus zb

()φ (z)= int∣x∣ge(exp(izx) minus )ν(dx)()φ (z)= int∣x∣lt(exp(izx) minus minus izx)ν(dx)

Examples of the Leacutevy ProcessesThe Brownian MotionA Leacutevy process is said to be a Brownian motion (see7Brownian Motion and Diusions) with dri micro andvolatility rate σ if micro = a b = σ and ν (R)= Brow-nian motion is the only nondeterministic Leacutevy processeswith continuous sample paths

The Inverse Brownian ProcessLet X be a Brownian motion with micro gt and volatility rateσ For any x gt let Tx = inft Xt gt xen Tx is anincreasing Leacutevy process (called inverse Brownianmotion)its Leacutevy measure is given by

υ(dx) = radicπσ x

exp(minusxmicro

σ )

The Compound Poisson Processe compound Poisson process (see 7Poisson Processes)is a Leacutevy process where b = and ν is a nite measure

The Gamma Processe gamma process is a nonnegative increasing Leacutevy pro-cess X where b = a minus int xν(dx) = and its Leacutevymeasure is given by

ν(dx) = αxexp(minusxβ)dx x gt

where α β gt It follows that the mean term (E(X ))and the variance term (V(X )) for the process are equalto αβ and αβ respectively

e following is a simulated sample path of a gammaprocess where α = and β = (Fig )

The Variance Gamma Processe variance gamma process is a Leacutevy process that can berepresented as either the dierence between two indepen-dent gamma processes or as a Brownian process subordi-nated by a gamma processe latter is accomplished by arandom time change replacing the time of the Brownianprocess by a gamma process with a mean term equal to e variance gamma process has three parameters micro ndash the

Life Expectancy L

L

0 2 4 6 8 100

2

4

6

8

10

12

t

x

Leacutevy Processes Fig Gamma process

0 2 4 6 8 10minus08

minus06

minus04

minus02

0

02

04

06

t

x

Brownian motionVariance gamma

Leacutevy Processes Fig Brownian motion and variance gammasample paths

Brownian process dri term σ ndash the volatility of theBrownian process and ν ndash the variance term of the thegamma process

e following are two simulated sample paths one fora Brownian motion with a dri term micro = and volatilityterm σ = and the other is for a variance gamma processwith the same values for the dri term and the volatilityterms and ν = (Fig )

About the AuthorProfessor Mohamed Abdel-Hameed was the chairman ofthe Department of Statistics and Operations ResearchKuwait University (ndash) He was on the editorialboard of Applied Stochastic Models and Data Analysis(ndash) He was the Editor (jointly with E Cinlarand J Quinn) of the text Survival Models Maintenance

Replacement Policies and Accelerated Life Testing (Aca-demic Press ) He is an elected member of the Inter-national Statistical Institute He has published numerouspapers in Statistics and Operations research

Cross References7Non-Uniform Random Variate Generations7Poisson Processes7Statistical Modeling of Financial Markets7Stochastic Processes7Stochastic Processes Classication

References and Further ReadingAbdel-Hameed MS () Life distribution of devices subject to a

Leacutevy wear process Math Oper Res ndashAbdel-Hameed M () Optimal control of a dam using PMλτ

policies and penalty cost when the input process is a com-pound Poisson process with a positive drift J Appl Prob ndash

Black F Scholes MS () The pricing of options and corporateliabilities J Pol Econ ndash

Cont R Tankov P () Financial modeling with jump processesChapman amp HallCRC Press Boca Raton

Madan DB Carr PP Chang EC () The variance gamma processand option pricing Eur Finance Rev ndash

Patel A Kosko B () Stochastic resonance in continuous andspiking Neutron models with Leacutevy noise IEEE Trans NeuralNetw ndash

van Noortwijk JM () A survey of the applications of gammaprocesses in maintenance Reliab Eng Syst Safety ndash

Life Expectancy

Maja Biljan-AugustFull ProfessorUniversity of Rijeka Rijeka Croatia

Life expectancy is dened as the average number of yearsa person is expected to live from age x as determined bystatisticsStatistics on life expectancy are derived from a mathe-

matical model known as the 7life table In order to calcu-late this indicator the mortality rate at each age is assumedto be constant Life expectancy (ex) can be evaluated atany age and in a hypothetical stationary population canbe written in discrete form as

ex =Txlx

where x is age Tx is the number of person-years livedaged x and over and lx is the number of survivors at agex according to the life table

L Life Expectancy

Life expectancy can be calculated for combined sexesor separately for males and femalesere can be signi-cant dierences between sexesLife expectancy at birth (e) is the average number of

years a newborn child can expect to live if currentmortalitytrends remain constant

e =Tl

where T is the total size of the population and l is thenumber of births (the original number of persons in thebirth cohort)Life expectancy declines with age Life expectancy at

birth is highly inuenced by infant mortality rate eparadox of the life table refers to a situation where lifeexpectancy at birth increases for several years aer birth(e lt e lt e and even beyond)e paradox reects thehigher rates of infant and child mortality in populationsin pre-transition and middle stages of the demographictransitionLife expectancy at birth is a summary measure of mor-

tality in a population It is a frequently used indicatorof health standards and socio-economic living standardsLife expectancy is also one of the most commonly usedindicators of social development is indicator is easilycomparable through time and between areas includingcountries Inequalities in life expectancy usually indicateinequalities in health and socio-economic developmentLife expectancy rose throughout human history In

ancient Greece and Rome the average life expectancywas below years between the years and life expectancy at birth rose from about years to aglobal average of years and to more than years inthe richest countries (Riley ) Furthermore in mostindustrialized countries in the early twenty-rst centurylife expectancy averaged at about years (WHO)esechanges called the ldquohealth transitionrdquo are essentially theresult of improvements in public health medicine andnutritionLife expectancy varies signicantly across regions and

continents from life expectancies of about years insome central African populations to life expectancies of years and above in many European countries emore developed regions have an average life expectancy of years while the population of less developed regionsis at birth expected to live an average years less etwo continents that display the most extreme dierencesin life expectancies are North America ( years) andAfrica ( years) where as of recently the gap betweenlife expectancies amounts to years (UN ndash data)

Countries with the highest life expectancies in theworld ( years) are Australia Iceland Italy Switzerlandand Japan ( years) Japanese men and women live anaverage of and years respectively (WHO )In countries with a high rate of HIV infection prin-

cipally in Sub-Saharan Africa the average life expectancyis years and below Some of the worldrsquos lowest lifeexpectancies are in Sierra Leone ( years) Afghanistan( years) Lesotho ( years) and Zimbabwe ( years)In nearly all countries women live longer than men

e worldrsquos average life expectancy at birth is years formales and years for females the gap is about ve yearse female-to-male gap is expected to narrow in the moredeveloped regions andwiden in the less developed regionse Russian Federation has the greatest dierence in lifeexpectancies between the sexes ( years less for men)whereas in Tonga life expectancy for males exceeds thatfor females by years (WHO )Life expectancy is assumed to rise continuously

According to estimation by the UN global life expectancyat birth is likely to rise to an average years by ndashBy life expectancy is expected to vary across countriesfrom to years Long-range United Nations popula-tion projections predict that by on average peoplecan expect to live more than years from (Liberia) upto years (Japan)For more details on the calculation of life expectancy

including continuous notation see Keytz ( )and Preston et al ()

Cross References7Biopharmaceutical Research Statistics in7Biostatistics7Demography7Life Table7Mean Residual Life7Measurement of Economic Progress

References and Further ReadingKeyfitz N () Introduction to the Mathematics of Population

Addison-Wesly MassachusettsKeyfitz N () Applied mathematical demography Springer

New YorkPreston SH Heuveline P Guillot M () Demography measuring

and modelling potion processes Blackwell OxfordRiley JC () Rising life expectancy a global history Cambridge

University Press CambridgeRowland DT () Demographic methods and concepts Oxford

University Press New YorkSiegel JS Swanson DA () The methods and materials of demog-

raphy nd edn Elsevier Academic Press AmsterdamWorld Health Organization () World health statistics

Geneva

Life Table L

L

World Population Prospects The revision Highlights UnitedNations New York

World Population to () United Nations New York

Life Table

JanM HoemProfessor Director EmeritusMax Planck Institute for Demographic Research RostockGermany

e life table is a classical tabular representation of centralfeatures of the distribution function F of a positive variablesay X which normally is taken to represent the lifetime ofa newborn individuale life table was introduced wellbefore modern conventions concerning statistical distri-butions were developed and it comes with some specialterminology and notation as follows Suppose that F has a

density f (x) = ddxF(x) and dene the force of mortality or

death intensity at age x as the function

micro(x) = minus ddxln minus F(x) = f (x)

minus F(x)

Heuristically it is interpreted by the relation micro(x)dx =Px lt X lt x + dx∣X gt x Conversely F(x) =

minus expminusx

intmicro(s)ds e survivor function is dened as

ℓ(x) = ℓ() minus F(x) normally with ℓ() = Inmortality applications ℓ(x) is the expected number of sur-vivors to exact age x out of an original cohort of newborn babiese survival probability is

tpx = PX gt x + t∣X gt x = ℓ(x + t)ℓ(x)

= expminusintt

micro(x + s)ds

and the non-survival probability is (the converse) tqx = minustpx For t = one writes qx = qx and px = px In partic-ular we get ℓ(x + ) = ℓ(x)pxis is a practical recursionformula that permits us to compute all values of ℓ(x) oncewe know the values of px for all relevant x

e life expectancy is eo = EX = int infin ℓ(x)dxℓ()(e subscript in eo indicates that the expected value iscomputed at age (ie for newborn individuals) and thesuperscript o indicates that the computation is made in

the continuous mode)e remaining life expectancy atage x is

eox = E(X minus x∣X gt x) = intinfin

ℓ(x + t)dtℓ(x)

ie it is the expected lifetime remaining to someone whohas attained age xTo turn to the statistical estimation of these various

quantities suppose that the function micro(x) is piecewiseconstant which means that we take it to equal some con-stant say microj over each interval (xj xj+) for some partitionxj of the age axis For a collection Xi of independentobservations of X let Dj be the number of Xi that fall inthe interval (xj xj+) In mortality applications this is thenumber of deaths observed in the given interval For thecohort of the initially newborn Dj is the number of indi-viduals whodie in the interval (called the occurrences in theinterval) If individual i dies in the interval he or she willof course have lived for Xi minus xj time units during the inter-val Individuals who survive the interval will have lived forxj+ minus xj time units in the interval and individuals whodo not survive to age xj will not have lived during thisinterval at all When we aggregate the time units lived in(xj xj+) over all individuals we get a total Rj which iscalled the exposures for the interval the idea being thatindividuals are exposed to the risk of death for as long asthey live in the interval In the simple case where there areno relations between the individual parameters microj the col-lection DjRj constitutes a statistically sucient set ofobservations with a likelihood Λ that satises the relationln Λ = sumjminusmicrojRj+Dj ln microjwhich is easily seen to bemax-imized by microj = DjRje latter fraction is therefore themaximum-likelihood estimator for microj (In some connec-tions an age schedule of mortality will be specied such asthe classical GompertzndashMakeham function microx = a + bcxwhich does represent a relationship between the intensityvalues at the dierent ages x normally for single-year agegroupsMaximum likelihood estimators can then be foundby plugging this functional specication of the intensitiesinto the likelihood function nding the values a b andc that maximize Λ and using a + bcx for the intensity inthe rest of the life table computations Methods that do notamount tomaximum likelihood estimationwill sometimesbe used because they involve simpler computations Withsome luck they provide starting values for the iterative pro-cess that must usually be applied to produce the maximumlikelihood estimators For an example see Forseacuten ())is whole schema can be extended trivially to cover cen-soring (withdrawals) provided the censoringmechanism isunrelated to the mortality process

L Likelihood

If the force of mortality is constant over a single-yearage interval (x x + ) say and is estimated by microx in thisinterval then px = eminusmicrox is an estimator of the single-yearsurvival probability pxis allows us to estimate the sur-vival function recursively for all corresponding ages usingℓ(x + ) = ℓ(x)px for x = and the rest of thelife table computations follow suit Life table constructionconsists in the estimation of the parameters and the tab-ulation of functions like those above from empirical datae data can be for age at death for individuals as in theexample indicated above but they can also be observa-tions of duration until recovery from an illness of intervalsbetween births of time until breakdown of some piece ofmachinery or of any other positive duration variableSo far we have argued as if the life table is computed for

a group of mutually independent individuals who have allbeen observed in parallel essentially a cohort that is fol-lowed from a signicant common starting point (namelyfrom birth in our mortality example) and which is dimin-ished over time due to decrements (attrition) caused bythe risk in question and also subject to reduction due tocensoring (withdrawals)e corresponding table is thencalled a cohort life table It is more common however toestimate a px schedule from data collected for the mem-bers of a population during a limited time period and touse the mechanics of life-table construction to produce aperiod life table from the px valuesLife table techniques are described in detail in most

introductory textbooks in actuarial statistics7biostatistics7demography and epidemiology See eg Chiang ()Elandt-Johnson and Johnson () Manton and Stallard() Preston et al () For the history of thetopic consult Seal () Smith and Keytz () andDupacircquier ()

About the AuthorFor biography see the entry 7Demography

Cross References7Demographic Analysis A Stochastic Approach7Demography7Event History Analysis7Kaplan-Meier Estimator7Population Projections7Statistics An Overview

References and Further ReadingChiang CL () The life table and its applications Krieger

MalabarDupacircquier J () Lrsquoinvention de la table de mortaliteacute Presses

universitaires de France Paris

Elandt-Johnson RC Johnson NL () Survival models and dataanalysis Wiley New York

Forseacuten L () The efficiency of selected moment methods inGompertzndashMakeham graduation of mortality Scand Actuarial Jndash

Manton KG Stallard E () Recent trends in mortality analysisAcademic Press Orlando

Preston SH Heuveline P Guillot M () Demography Measuringand modeling populations Blackwell Oxford

Seal H () Studies in history of probability and statistics mul-tiple decrements of competing risks Biometrica ()ndash

Smith D Keyfitz N () Mathematical demography selectedpapers Springer Heidelberg

Likelihood

Nancy ReidProfessorUniversity of Toronto Toronto ON Canada

Introductione likelihood function in a statistical model is propor-tional to the density function for the random variable tobe observed in the model Most oen in applications oflikelihood we have a parametric model f (y θ) where theparameter θ is assumed to take values in a subset of Rkand the variable y is assumed to take values in a subset ofRn the likelihood function is dened by

L(θ) = L(θ y) = cf (y θ) ()

where c can depend on y but not on θ In more gen-eral settings where the model is semi-parametric or non-parametric the explicit denition is more dicult becausethe density needs to be dened relative to a dominatingmeasure whichmay not exist seeVan derVaart () andMurphy andVan derVaart ()is article will consideronly nite-dimensional parametric modelsWithin the context of the given parametric model the

likelihood function measures the relative plausibility ofvarious values of θ for a given observed data point y Val-ues of the likelihood function are only meaningful relativeto each other and for this reason are sometimes stan-dardized by the maximum value of the likelihood func-tion although other reference points might be of interestdepending on the contextIf ourmodel is f (y θ) = (ny)θy(minusθ)nminusy y = n

θ isin [ ] then the likelihood function is (any functionproportional to)

L(θ y) = θy( minus θ)nminusy

Likelihood L

L

and can be plotted as a function of θ for any xed valueof y e likelihood function is maximized at θ = ynis model might be appropriate for a sampling schemewhich recorded the number of successes amongn indepen-dent trials that result in success or failure each trial havingthe same probability of success θ Another example is thelikelihood function for the mean and variance parameterswhen sampling from a normal distribution with mean microand variance σ

L(θ y) = expminusn log σ minus (σ )Σ(yi minus micro)

where θ = (micro σ )is could also be plotted as a functionof micro and σ for a given sample y yn and it is not dif-cult to show that this likelihood function only dependson the sample through the sample mean y = nminusΣyi andsample variance s = (n minus )minusΣ(yi minus y) or equivalentlythrough Σyi and Σyi It is a general property of likelihoodfunctions that they depend on the data only through theminimal sucient statistic

Inferencee likelihood function was dened in a seminal paperof Fisher () and has since become the basis for mostmethods of statistical inference One version of likelihoodinference suggested by Fisher is to use some rule suchas L(θ)L(θ) gt k to dene a range of ldquolikelyrdquo or ldquoplau-siblerdquo values of θ Many authors including Royall ()and Edwards () have promoted the use of plots ofthe likelihood function along with interval estimates ofplausible valuesis approach is somewhat limited how-ever as it requires that θ have dimension or possibly or that a likelihood function can be constructed that onlydepends on a component of θ that is of interest see sectionldquo7Nuisance Parametersrdquo belowIn general we would wish to calibrate our inference

for θ by referring to the probabilistic properties of theinferential method One way to do this is to introduce aprobability measure on the unknown parameter θ typi-cally called a prior distribution and use Bayesrsquo rule forconditional probabilities to conclude

π(θ ∣ y) = L(θ y)π(θ)intθL(θ y)π(θ)dθ

where π(θ) is the density for the prior measure and π(θ ∣y) provides a probabilistic assessment of θ aer observingY = y in the model f (y θ) We could then make con-clusions of the form ldquohaving observed successes in trials and assuming π(θ) = the posterior probabilitythat θ gt is rdquo and so on

is is a very brief description of Bayesian inference inwhich probability statements refer to that generated from

the prior through the likelihood to the posterior A majordiculty with this approach is the choice of prior prob-ability function In some applications there may be anaccumulation of previous data that can be incorporatedinto a probability distribution but in general there is notand some rather ad hoc choices are oen made Anotherdiculty is themeaning to be attached to probability state-ments about the parameterInference based on the likelihood function can also be

calibrated with reference to the probability model f (y θ)by examining the distribution ofL(θY) as a random func-tion or more usually by examining the distribution ofvarious derived quantitiesis is the basis for likelihoodinference from a frequentist point of view In particularit can be shown that logL(θY)L(θY) where θ =θ(Y) is the value of θ at which L(θY) is maximized isapproximately distributed as a χk random variable wherek is the dimension of θ To make this precise requires anasymptotic theory for likelihood which is based on a cen-tral limit theorem (see 7Central Limiteorems) for thescore function

U(θY) = partpartθlogL(θY)

If Y = (Y Yn) has independent components thenU(θ) is a sum of n independent components which undermild regularity conditions will be asymptotically normalTo obtain the χ result quoted above it is also necessary toinvestigate the convergence of θ to the true value govern-ing the model f (y θ) Showing this convergence usuallyin probability but sometimes almost surely can be di-cult see Scholz () for a summary of some of the issuesthat ariseAssuming that θ is consistent for θ and that L(θY)

has sucient regularity the follow asymptotic results canbe established

(θ minus θ)T i(θ)(θ minus θ) drarr χk ()

U(θ)T iminus(θ)U(θ) drarr χk ()

ℓ(θ) minus ℓ(θ) drarr χk ()

where i(θ) = Eminusℓprimeprime(θY) θ is the expected Fisher infor-mation function ℓ(θ) = logL(θ) is the log-likelihoodfunction and χk is the 7chi-square distribution with kdegrees of freedom

ese results are all versions of a more general resultthat the log-likelihood function converges to the quadraticform corresponding to a multivariate normal distribution(see 7Multivariate Normal Distributions) under suitablystated limiting conditions ere is a similar asymptoticresult showing that the posterior density is asymptotically

L Likelihood

normal and in fact asymptotically free of the prior distri-bution although this result requires that the prior distribu-tion be a proper probability density ie has integral overthe parameter space equal to

Nuisance ParametersIn models where the dimension of θ is large plotting thelikelihood function is not possible and inference based onthe multivariate normal distribution for θ or the χk dis-tribution of the log-likelihood ratio doesnrsquot lead easily tointerval estimates for components of θ However it is pos-sible to use the likelihood function to construct inferencefor parameters of interest using variousmethods that havebeen proposed to eliminate nuisance parametersSuppose in the model f (y θ) that θ = (ψ λ) where ψ

is a k-dimensional parameter of interest (which will oenbe )e prole log-likelihood function of ψ is

ℓP(ψ) = ℓ(ψ λψ)

where λψ is the constrainedmaximum likelihood estimateit maximizes the likelihood function L(ψ λ) when ψ isheld xede prole log-likelihood function is also calledthe concentrated log-likelihood function especially ineconometrics If the parameter of interest is not expressedexplicitly as a subvector of θ then the constrained maxi-mum likelihood estimate is found using Lagrange multi-pliersIt can be veried under suitable smoothness conditions

that results similar to those at ( ndash ) hold as well for theprole log-likelihood function in particular

ℓP(ψ) minus ℓP(ψ) = ℓ(ψ λ) minus ℓ(ψ λψ)drarr χk

is method of eliminating nuisance parameters is notcompletely satisfactory especially when there are manynuisance parameters in particular it doesnrsquot allow forerrors in estimation of λ For example the prole likeli-hood approach to estimation of σ in the linear regressionmodel (see 7Linear Regression Models) y sim N(Xβ σ )will lead to the estimator σ = Σ(yi minus yi)n whereas theestimator usually preferred has divisor nminusp where p is thedimension of β

us a large literature has developed on improvementsto the prole log-likelihood For Bayesian inference suchimprovements are ldquoautomaticallyrdquo included in the formu-lation of the marginal posterior density for ψ

πM(ψ ∣ y)prop int π(ψ λ ∣ y)dλ

but it is typically quite dicult to specify priors for possiblyhigh-dimensional nuisance parameters For non-Bayesian

inference most modications to the prole log-likelihoodare derived by considering conditional or marginal infer-ence in models that admit factorizations at least approxi-mately like the following

f (y θ) = f(yψ)f(y ∣ y λ) orf (y θ) = f(y ∣ yψ)f(y λ)

A discussion of conditional inference and density factori-sations is given in Reid ()is literature is closely tiedto that on higher order asymptotic theory for likelihoode latter theory builds on saddlepoint and Laplace expan-sions to derive more accurate versions of (ndash) see forexample Severini () and Brazzale et al () edirect likelihood approach of Royall () and others doesnot generalize very well to the nuisance parameter settingalthough Royall and Tsou () present some results inthis direction

Extensions to Likelihoode likelihood function is such an important aspect ofinference based on models that it has been extended toldquolikelihood-likerdquo functions formore complex data settingsExamples include nonparametric and semi-parametriclikelihoods the most famous semi-parametric likelihoodis the proportional hazards model of Cox () Butmany other extensions have been suggested to empiri-cal likelihood (Owen ) which is a type of nonpara-metric likelihood supported on the observed sample toquasi-likelihood (McCullagh ) which starts from thescore function U(θ) and works backwards to an infer-ence function to bootstrap likelihood (Davison et al )and many modications of prole likelihood (Barndor-Nielsen andCox Fraser )ere is recent interestfor multi-dimensional responses Yi in composite likeli-hoods which are products of lower dimensional condi-tional or marginal distributions (Varin ) Reid ()concluded a review article on likelihood by stating

7 From either a Bayesian or frequentist perspective the like-lihood function plays an essential role in inference Themaximum likelihood estimator once regarded on an equalfooting among competing point estimators is now typi-cally the estimator of choice although some refinementis needed in problems with large numbers of nuisanceparameters The likelihood ratio statistic is the basis formost tests of hypotheses and interval estimates The emer-gence of the centrality of the likelihood function for infer-ence partly due to the large increase in computing poweris one of the central developments in the theory of statisticsduring the latter half of the twentieth century

Likelihood L

L

Further Readinge book by Cox and Hinkley () gives a detailedaccount of likelihood inference and principles of statis-tical inference see also Cox () ere are severalbook-length treatments of likelihood inference includingEdwards () Azzalini () Pawitan () and Sev-erini () this last discusses higher order asymptotictheory in detail as does Barndor-Nielsen andCox ()and Brazzale Davison and Reid () A short reviewpaper is Reid () An excellent overview of consis-tency results for maximum likelihood estimators is Scholz() see also Lehmann andCasella () Foundationalissues surrounding likelihood inference are discussed inBerger and Wolpert ()

About the AuthorProfessor Reid is a Past President of the Statistical Societyof Canada (ndash) During (ndash) she servedas the President of the Institute of Mathematical Statis-tics Among many awards she received the Emanuel andCarol Parzen Prize for Statistical Innovation () ldquoforleadership in statistical science for outstanding researchin theoretical statistics and highly accurate inference fromthe likelihood function and for inuential contributionsto statistical methods in biology environmental sciencehigh energy physics and complex social surveysrdquo She wasawarded the Gold Medal Statistical Society of Canada() and FlorenceNightingaleDavidAward Committeeof Presidents of Statistical Societies () She is AssociateEditor of Statistical Science (ndash) Bernoulli (ndash) andMetrika (ndash)

Cross References7Bayesian Analysis or Evidence Based Statistics7Bayesian Statistics7Bayesian Versus Frequentist Statistical Reasoning7Bayesian vs Classical Point Estimation A ComparativeOverview7Chi-Square Test Analysis of Contingency Tables7Empirical Likelihood Approach to Inference fromSample Survey Data7Estimation7Fiducial Inference7General Linear Models7Generalized Linear Models7Generalized Quasi-Likelihood (GQL) Inferences7Mixture Models7Philosophical Foundations of Statistics

7Risk Analysis7Statistical Evidence7Statistical Inference7Statistical Inference An Overview7Statistics An Overview7Statistics Nelderrsquos view7Testing Variance Components in Mixed LinearModels7Uniform Distribution in Statistics

References and Further ReadingAzzalini A () Statistical inference Chapman and Hall LondonBarndorff-Nielsen OE Cox DR () Inference and asymptotics

Chapman and Hall LondonBerger JO Wolpert R () The likelihood principle Institute of

Mathematical Statistics HaywardBirnbaum A () On the foundations of statistical inference Am

Stat Assoc ndashBrazzale AR Davison AC Reid N () Applied asymptotics

Cambridge University Press CambridgeCox DR () Regression models and life tables J R Stat Soc B

ndash (with discussion)Cox DR () Principles of statistical inference Cambridge Uni-

versity Press CambridgeCox DR Hinkley DV () Theoretical statistics Chapman and

Hall LondonDavison AC Hinkley DV Worton B () Bootstrap likelihoods

Biometrika ndashEdwards AF () Likelihood Oxford University Press OxfordFisher RA () On the mathematical foundations of theoretical

statistics Phil Trans R Soc A ndashLehmann EL Casella G () Theory of point estimation nd edn

Springer New YorkMcCullagh P () Quasi-likelihood functions Ann Stat ndashMurphy SA Van der Vaart A () Semiparametric likelihood ratio

inference Ann Stat ndashOwen A () Empirical likelihood ratio confidence intervals for a

single functional Biometrika ndashPawitan Y () In all likelihood Oxford University Press OxfordReid N () The roles of conditioning in inference Stat Sci

ndashReid N () Likelihood J Am Stat Assoc ndashRoyall RM () Statistical evidence a likelihood paradigm

Chapman and Hall LondonRoyall RM Tsou TS () Interpreting statistical evidence using

imperfect models robust adjusted likelihoodfunctions J R StatSoc B

Scholz F () Maximum likelihood estimation In Ency-clopedia of statistical sciences Wiley New York doiesspub Accessed Aug

Severini TA () Likelihood methods in statistics Oxford Univer-sity Press Oxford

Van der Vaart AW () Infinite-dimensional likelihood meth-ods in statistics httpwwwstieltjesorgarchiefbiennialframenodehtml Accessed Aug

Varin C () On composite marginal likelihood Adv Stat Analndash

L Limit Theorems of Probability Theory

Limit Theorems of ProbabilityTheory

Alexandr Alekseevich BorovkovProfessor Head of Probability and Statistics Chair at theNovosibirsk UniversityNovosibirsk University Novosibirsk Russia

Limit eorems of Probability eory is a broad namereferring to the most essential and extensive research areain Probability eory which at the same time has thegreatest impact on the numerous applications of the latterBy its very nature Probability eory is concerned

with asymptotic (limiting) laws that emerge in a long seriesof observations on random events Because of this in theearly twentieth century even the very denition of prob-ability of an event was given by a group of specialists(R von Mises and some others) as the limit of the rela-tive frequency of the occurrence of this event in a longrow of independent random experiments e ldquostabilityrdquoof this frequency (ie that such a limit always exists) waspostulated Aer the s Kolmogorovrsquos axiomatic con-struction of probability theory has prevailed One of themain assertions in this axiomatic theory is the Law of LargeNumbers (LLN) on the convergence of the averages of largenumbers of random variables to their expectation islaw implies the aforementioned stability of the relative fre-quencies and their convergence to the probability of thecorresponding event

e LLN is the simplest limit theorem (LT) of probabil-ity theory elucidating the physical meaning of probabilitye LLN is stated as follows if XXX is a sequenceof iid random variables

Sn =n

sumj=Xj

and the expectation a = EX exists then nminusSnasETHrarr a

(almost surely ie with probability )us the value nacan be called the rst order approximation for the sums Sne Central Limiteorem (CLT) gives one a more preciseapproximation for Sn It says that if σ = E(X minus a) ltinfinthen the distribution of the standardized sum ζn = (Sn minusna)σ

radicn converges as n rarr infin to the standard normal

(Gaussian) lawat is for all x

P(ζn lt x)rarr Φ(x) = radicπ

x

intminusinfin

eminustdt

e quantity nEξ + ζσradicn where ζ is a standard normal

random variable (so that P(ζ lt x) = Φ(x)) can be calledthe second order approximation for Sn

e rst LLN (for the Bernoulli scheme) was provedby Jacob Bernoulli in the late s (published posthu-mously in ) e rst CLT (also for the Bernoullischeme) was established by A de Moivre (rst publishedin and referred nowadays to as the deMoivrendashLaplacetheorem) In the beginning of the nineteenth centuryPS Laplace and CF Gauss contributed to the generaliza-tion of these assertions and appreciation of their enormousapplied importance (in particular for the theory of errorsof observations) while later in that century further break-throughs in both methodology and applicability rangeof the CLT were achieved by PL Chebyshev () andAM Lyapunov ()

e main directions in which the two aforementionedmain LTs have been extended and rened since then are

Relaxing the assumption EX lt infin When the sec-ond moment is innite one needs to assume that theldquotailrdquo P(x) = P(X gt x) + P(X lt minusx) is a func-tion regularly varying at innity such that the limitlimxrarrinfinP(X gt x)P(x) existsen the distribution of thenormalized sum Snσ(n) where σ(n) = Pminus(nminus)Pminus being the generalized inverse of the function P andwe assume that Eξ = when the expectation is niteconverges to one of the so-called stable laws as nrarrinfine7characteristic functions of these laws have simpleclosed-form representations

Relaxing the assumption that the Xjrsquos are identicallydistributed and proceeding to study the so-called tri-angular array scheme where the distributions of thesummands Xj = Xjn forming the sum Sn depend notonly on j but on n as well In this case the class of alllimit laws for the distribution of Sn (under suitable nor-malization) is substantially wider it coincides with theclass of the so-called innitely divisible distributionsAn important special case here is the Poisson limit the-orem on convergence in distribution of the number ofoccurrences of rare events to a Poisson law

Relaxing the assumption of independence of the XjrsquosSeveral types of ldquoweak dependencerdquo assumptions onXjunder which the LLN and CLT still hold true havebeen suggested and investigatedOne should alsomen-tion here the so-called ergodic theorems (see7Ergodiceorem) for a wide class of random sequences andprocesses

Renement of the main LTs and derivation of asymp-totic expansions For instance in the CLT bounds ofthe rate of convergence P(ζn lt x) minus Φ(x) rarr and

Limit Theorems of Probability Theory L

L

asymptotic expansions for this dierence (in the pow-ers of nminus in the case of iid summands) have beenobtained under broad assumptions

Studying large deviation probabilities for the sums Sn(theorems on rare events) If x rarr infin together with nthen the CLT can only assert that P(ζn gt x) rarr eorems on large deviation probabilities aim to nd afunction P(xn) such that

P(ζn gt x)P(xn) rarr as nrarrinfin x rarrinfin

e nature of the function P(xn) essentially dependson the rate of decay of P(X gt x) as x rarrinfin and on theldquodeviation zonerdquo ie on the asymptotic behavior of theratio xn as nrarrinfin

Considering observations X Xn of a more com-plex nature ndash rst of all multivariate random vectorsIf Xj isin Rd then the role of the limit law in the CLTwill be played by a d-dimensional normal (Gaussian)distribution with the covariance matrix E(X minus EX)(X minus EX)T

e variety of application areas of the LLN and CLTis enormous us Mathematical Statistics is based onthese LTs Let Xlowastn = (X Xn) be a sample froma distribution F and Flowastn (u) the corresponding empiricaldistribution functione fundamental GlivenkondashCantellitheorem (see 7Glivenko-Cantelli eorems) stating thatsupu ∣F

lowastn (u)minus F(u)∣

asETHrarr as nrarrinfin is of the same natureas the LLN and basically means that the unknown distri-bution F can be estimated arbitrary well from the randomsample Xlowastn of a large enough size n

e existence of consistent estimators for the unknownparameters a = Eξ and σ = E(X minus a) also follows fromthe LLN since as nrarrinfin

alowast = n

n

sumj=Xj

asETHrarr a (σ )lowast = n

n

sumj=

(Xj minus alowast)

= n

n

sumj=Xj minus (alowast) asETHrarr σ

Under additional moment assumptions on the distri-bution F one can also construct asymptotic condenceintervals for the parameters a and σ as the distributionsof the quantities

radicn(alowast minus a) and

radicn((σ )lowast minus σ ) con-

verge as n rarr infin to the normal onese same can alsobe said about other parameters that are ldquosmoothrdquo enoughfunctionals of the unknown distribution F

e theorem on the 7asymptotic normality andasymptotic eciency of maximum likelihood estimatorsis another classical example of LTsrsquo applications in mathe-matical statistics (see eg Borovkov ) Furthermore in

estimation theory and hypotheses testing one also needstheorems on large deviation probabilities for the respec-tive statistics as it is statistical procedures with small errorprobabilities that are oen required in applicationsIt is worth noting that summation of random variables

is by no means the only situation in which LTs appear inProbabilityeoryGenerally speaking the main objective of Probabil-

ityeory in applications is nding appropriate stochasticmodels for objects under study and then determining thedistributions or parameters one is interested in As a rulethe explicit form of these distributions and parameters isnot known LTs can be used to nd suitable approximationsto the characteristics in questionAt least two possible approaches to this problem should

be noted here

Suppose that the unknown distribution Fθ depends ona parameter θ such that as θ approaches some ldquocriticalrdquovalue θ the distributions Fθ become ldquodegeneraterdquo inone sense or anotheren in a number of cases onecan nd an approximation for Fθ which is valid for thevalues of θ that are close to θ For instance in actuarialstudies7queueing theory and some other applicationsone of the main problems is concerned with the dis-tribution of S = sup

kge(Sk minus θk) under the assumption

that EX = If θ gt then S is a proper random vari-able If however θ rarr then S asETHrarr infin Here we dealwith the so-called ldquotransient phenomenardquo It turns outthat if σ = Var (X) ltinfin then there exists the limit

limθdarrP(θS gt x) = eminusxσ x gt

is (KingmanndashProkhorov) LT enables one to ndapproximations for the distribution of S in situationswhere θ is small

Sometimes one can estimate the ldquotailsrdquo of the unknowndistributions ie their asymptotic behavior at innityis is of importance in those applications where oneneeds to evaluate the probabilities of rare events If theequation Eemicro(Xminusθ) = has a solution micro gt then inthe above example one has

P(S gt x) sim ceminusmicrox x rarrinfin

where c is a known constant If the distribution F of Xis subexponential (in this case Eemicro(Xminusθ) = infin for anymicro gt ) then

P(S gt x) sim θ int

infin

x( minus F(t))dt x rarrinfin

L Linear Mixed Models

isLTenablesone tondapproximations forP(S gt x)for large xFor both approaches the obtained approximations

can be rened

An important part of Probabilityeory is concernedwith LTs for random processes eir main objective isto nd conditions under which random processes con-verge in some sense to some limit process An extensionof the CLT to that context is the so-called FunctionalCLT (aka the DonskerndashProkhorov invariance princi-ple) which states that as n rarr infin the processesζn(t) = (Slfloorntrfloor minus ant)σ

radicntisin[] converge in distribu-

tion to the standard Wiener process w(t)tisin[]e LTs(including large deviation theorems) for a broad class offunctionals of the sequence (7random walk) S Sncan also be classied as LTs for 7stochastic processese same can be said about Law of iterated logarithmwhich states that for an arbitrary ε gt the random walkSkinfink= crosses the boundary (minusε)σ

radick ln ln k innitely

many times but crosses the boundary ( + ε)σradick ln ln k

nitely many times with probability Similar results holdtrue for trajectories of Wiener processes w(t)tisin[] andw(t)tisin[infin)In mathematical statistics a closely related to func-

tional CLT result says that the so-called ldquoempirical processrdquoradicn(Flowastn (u) minus F(u))uisin(minusinfininfin) converges in distribution

to w(F(u))uisin(minusinfininfin) where w(t) = w(t) minus tw() isthe Brownian bridge processis LT implies7asymptoticnormality of a great many estimators that can be repre-sented as smooth functionals of the empirical distributionfunction Flowastn (u)

ere are many other areas in Probability eoryand its applications where various LTs appear and areextensively used For instance convergence theoremsfor 7martingales asymptotics of extinction probabilityof a branching processes and conditional (under non-extinction condition) LTs on a number of particles etc

About the AuthorProfessor Borovkov is Head of Probability and Statis-tics Department of the Sobolev Institute of MathematicsNovosibirsk (RussianAcademy of Sciences) since Heis Head of Probability and Statistics Chair at the Novosi-birsk University since He is Concelour of RussianAcademy of Sciences and fullmember of RussianAcademyof Sciences (Academician) () He was awarded theState Prize of the USSR () the Markov Prize of theRussian Academy of Sciences () and GovernmentPrize in Education () Professor Borovkov is Editor-in-Chief of the journal ldquoSiberianAdvances inMathematicsrdquo

and Associated Editor of journals ldquoeory of Probabil-ity and its Applicationsrdquo ldquoSiberian Mathematical JournalrdquoldquoMathematical Methods of Statisticsrdquo ldquoElectronic Journal ofProbabilityrdquo

Cross References7Almost Sure Convergence of Random Variables7Approximations to Distributions7Asymptotic Normality7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Central Limiteorems7Empirical Processes7Ergodiceorem7Glivenko-Cantellieorems7Large Deviations and Applications7Laws of Large Numbers7Martingale Central Limiteorem7Probabilityeory An Outline7RandomMatrixeory7Strong Approximations in Probability and Statistics7Weak Convergence of Probability Measures

References and Further ReadingBillingsley P () Convergence of probability measures Wiley

New YorkBorovkov AA () Mathematical statistics Gordon amp Breach

AmsterdamGnedenko BV Kolmogorov AN () Limit distributions for sums

of independent random variables Addison-Wesley CambridgeLeacutevy P () Theacuteorie de lrsquoAddition des Variables Aleacuteatories

Gauthier-Villars ParisLoegraveve M (ndash) Probability theory vols I and II th edn

Springer New YorkPetrov VV () Limit theorems of probability theory sequences of

independent random variables ClarendonOxford UniversityPress New York

Linear Mixed Models

GeertMolenberghsProfessorUniversiteit Hasselt amp Katholieke Universiteit LeuvenLeuven Belgium

In observational studies repeated measurements may betaken at almost arbitrary time points resulting in anextremely large number of time points at which only one

Linear Mixed Models L

L

or only a few measurements have been taken Many of theparametric covariance models described so far may thencontain toomany parameters tomake them useful in prac-tice while othermore parsimoniousmodelsmay be basedon assumptions which are too simplistic to be realisticA general and very exible class of parametric models forcontinuous longitudinal data is formulated as follows

yi∣bi sim N(Xiβ + ZibiΣi) ()bi sim N(D) ()

where Xi and Zi are (ni times p) and (ni times q) dimensionalmatrices of known covariates β is a p-dimensional vec-tor of regression parameters called the xed eects D isa general (q times q) covariance matrix and Σi is a (ni times ni)covariance matrix which depends on i only through itsdimension ni ie the set of unknown parameters in Σi willnot depend upon i Finally bi is a vector of subject-specicor random eects

e above model can be interpreted as a linear regres-sion model (see 7Linear Regression Models) for the vec-tor yi of repeated measurements for each unit separatelywhere some of the regression parameters are specic (ran-dom eects bi) while others are not (xed eects β)edistributional assumptions in () with respect to the ran-dom eects can be motivated as follows First E(bi) = implies that the mean of yi still equals Xiβ such that thexed eects in the random-eects model () can also beinterpreted marginally Not only do they reect the eectof changing covariates within specic units they alsomea-sure the marginal eect in the population of changing thesame covariates Second the normality assumption imme-diately implies that marginally yi also follows a normaldistribution with mean vector Xiβ and with covariancematrix Vi = ZiDZprimei + ΣiNote that the random eects in () implicitly imply the

marginal covariance matrix Vi of yi to be of the very spe-cic form Vi = ZiDZprimei + Σi Let us consider two examplesunder the assumption of conditional independence ieassuming Σi = σ Ini First consider the case where therandom eects are univariate and represent unit-specicintercepts is corresponds to covariates Zi which areni-dimensional vectors containing only ones

e marginal model implied by expressions () and() is

yi sim N(XiβVi) Vi = ZiDZprimei + Σi

which can be viewed as a multivariate linear regressionmodel with a very particular parameterization of thecovariance matrix Vi

With respect to the estimation of unit-specic param-eters bi the posterior distribution of bi given the observeddata yi can be shown to be (multivariate) normal withmean vector equal to DZprimeiV

minusi (α)(yi minus Xiβ) Replacing β

and α by their maximum likelihood estimates we obtainthe so-called empirical Bayes estimates bi for the bi A keyproperty of these EB estimates is shrinkage which is bestillustrated by considering the prediction yi equiv Xi β +Zibi ofthe ith prole It can easily be shown that

yi = ΣiVminusi Xi β + (Ini minus ΣiVminusi ) yi

which can be interpreted as a weighted average of thepopulation-averaged prole Xi β and the observed data yiwith weights ΣiVminusi and Ini minusΣiVminusi respectively Note thatthe ldquonumeratorrdquo of ΣiVminusi represents within-unit variabil-ity and the ldquodenominatorrdquo is the overall covariance matrixVi Hence much weight will be given to the overall averageprole if the within-unit variability is large in comparisonto the between-unit variability (modeled by the randomeects) whereasmuchweight will be given to the observeddata if the opposite is trueis phenomenon is referred toas shrinkage toward the average proleXi β An immediateconsequence of shrinkage is that the EB estimates show lessvariability than actually present in the random-eects dis-tribution ie for any linear combination λ of the randomeects

var(λprimebi)le var(λprimebi) = λprimeDλ

About the AuthorGeert Molenberghs is Professor of Biostatistics at Uni-versiteit Hasselt and Katholieke Universiteit Leuven inBelgium He was Joint Editor of Applied Statistics (ndash) and Co-Editor of Biometrics (ndash) Cur-rently he is Co-Editor of Biostatistics (ndash) He wasPresident of the International Biometric Society (ndash) received the Guy Medal in Bronze from the RoyalStatistical Society and the Myrto Lefkopoulou award fromthe Harvard School of Public Health Geert Molenberghsis founding director of the Center for Statistics He isalso the director of the Interuniversity Institute for Bio-statistics and statistical Bioinformatics (I-BioStat) Jointlywith Geert Verbeke Mike Kenward Tomasz BurzykowskiMarc Buyse and Marc Aerts he authored books on lon-gitudinal and incomplete data and on surrogate markerevaluation GeertMolenberghs received several Excellencein Continuing Education Awards of the American Statisti-cal Association for courses at Joint Statistical Meetings

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 8: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

L Least Squares

Farebrother RW () Fitting linear relationships a history of thecalculus of observations ndash Springer New York

Farebrother RW () Visualizing statistical models and conceptsMarcel Dekker New York

Least Squares

Czesław StepniakProfessorMaria Curie-Skłodowska University Lublin PolandUniversity of Rzeszoacutew Rzeszoacutew Poland

Least Squares (LS) problem involves some algebraic andnumerical techniques used in ldquosolvingrdquo overdeterminedsystems F(x) asymp b of equations where b isin Rn while F(x)is a column of the form

F(x) =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

f(x)

f(x)

fmminus(x)

fm(x)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

with entries fi = fi(x) i = n where x = (x xp)T e LS problem is linear when each fi is a linear functionand nonlinear ndash if notLinear LS problem refers to a system Ax = b of linear

equations Such a system is overdetermined if n gt p If b notinrange(A) the system has no proper solution and will bedenoted by Ax asymp b In this situation we are seeking for asolution of some optimization probleme name ldquoLeastSquaresrdquo is justied by the l-norm commonly used as ameasure of imprecision

e LS problem has a clear statistical interpretation inregression terms Consider the usual regression model

yi = fi(xi xip β βp) + ei for i = n ()

where xij i = n j = p are some constants givenby experimental design fi i = n are given functionsdepending on unknown parameters βj j = p whileyi i = n are values of these functions observed withsome random errors ei We want to estimate the unknownparameters βi on the basis of the data set xij yi

In linear regression each fi is a linear function of typefi = sump

j= cij(x xn)βj and the model () may bepresented in vector-matrix notation as

y = Xβ + e

where y = (y yn)T e = (e en)T and β =(β βp)T while X is a n times p matrix with entries xijIf e en are not correlated with mean zero and a com-mon (perhaps unknown) variance then the problem ofBest Linear Unbiased Estimation (BLUE) of β reduces tonding a vector β that minimizes the norm ∣∣y minus Xβ ∣∣

=

(y minus Xβ)T(y minus Xβ)Such a vector is said to be the ordinary LS solution of

the overparametrized system Xβ asymp y On the other handthe last one reduces to solving the consistent system

XTXβ = XTy

of linear equations called normal equations In particularif rank(X) = p then the system has a unique solution of theform

β = (XTX)minusXTy

For linear regression yi = α+βxi+ei with one regressorx the BLU estimators of the parameters α and β may bepresented in the convenient form as

β = nsxynsx

and α = y minus βx

where

nsxy =sumixiyi minus

(sumi xi) (sumi yi)n

nsx =sumixi minus

(sumi xi)

n x = sumi xi

nand y = sumi yi

n

For its computation we only need to use a simple pocketcalculator

Example e following table presents the number of resi-dents in thousands (x) and the unemployment rate in (y)for some cities of Poland Estimate the parameters β and α

xi

yi

In this casesumi xi = sumi yi = sumi xi = and sumi xiyi = erefore nsx = and nsxy =minusus β = minus and α = and hence f (x) =minusx +

Leacutevy Processes L

L

If the variancendashcovariance matrix of the error vector ecoincides (except a multiplicative scalar σ ) with a positivedenite matrix V then the Best Linear Unbiased estima-tion reduces to the minimization of (y minusXβ)TV(y minusXβ)called the weighed LS problem Moreover if rank(X) = pthen its solution is given in the form

β = (XTVminusX)minusXTVminusy

It is worth to add that a nonlinear LS problem is morecomplicated and its explicit solution is usually not knownInstead of this some algorithms are suggestedTotal least squares problem e problem has been

posed in recent years in numerical analysis as an alterna-tive for the LS problem in the casewhen all data are aectedby errorsConsider an overdetermined system of n linear equa-

tionsAx asymp b with k unknown xe TLS problem consistsin minimizing the Frobenius norm

∣∣[A b] minus [A b]∣∣Ffor all A isin Rntimesk and b isin range(A) where the Frobeniusnorm is dened by ∣∣(aij)∣∣F = sumij aij Once a minimizing[A b] is found then any x satisfying Ax = b is called a TLSsolution of the initial system Ax asymp b

e trouble is that the minimization problem maynot be solvable or its solution may not be unique As anexample one can set

A =⎡⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎦and b =

⎡⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎦

It is known that the TLS solution (if exists) is alwaysbetter than the ordinary LS in the sense that the cor-rection b minus Ax has smaller l-norm e main tool insolving the TLS problems is the following Singular ValueDecompositionFor any matrix A of n times k with real entries there

exist orthonormal matrices P = [p pn] and

Q = [q qk] such that

PTAQ = diag(σ σm) where σ ge ⋯ ge σm andm = minn k

About the AuthorFor biography see the entry 7Random Variable

Cross References7Absolute Penalty Estimation7Adaptive Linear Regression

7Analysis of Areal and Spatial Interaction Data7Autocorrelation in Regression7Best Linear Unbiased Estimation in Linear Models7Estimation7Gauss-Markoveorem7General Linear Models7Least Absolute Residuals Procedure7Linear Regression Models7Nonlinear Models7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Simple Linear Regression7Statistics History of7Two-Stage Least Squares

References and Further ReadingBjoumlrk A () Numerical methods for least squares problems

SIAM PhiladelphiaKariya T () Generalized least squares Wiley New YorkRao CR Toutenberg H () Linear models least squares and

alternatives Springer New YorkVan Huffel S Vandewalle J () The total least squares problem

computational aspects and analysis SIAM PhiladelphiaWolberg J () Data analysis using the method of least squares

extracting the most information from experiments SpringerNew York

Leacutevy Processes

Mohamed Abdel-HameedProfessorUnited Arab Emirates University Al Ain United ArabEmirates

IntroductionLeacutevy processes have become increasingly popular in engi-neering (reliability dams telecommunication) andmathe-matical nanceeir applications in reliability stems fromthe fact that they provide a realistic model for the degrada-tion of devices while their applications in the mathemati-cal theory of dams as they provide a basis for describingthe water input of dams eir popularity in nance isbecause they describe the nancialmarkets in amore accu-rate way than the celebrated BlackndashScholes model elatter model assumes that the rate of returns on assetsare normally distributed thus the process describing theasset price over time is continuous process In reality theasset prices have jumps or spikes and the asset returnsexhibit fat tails and 7skewness which negates the nor-mality assumption inherited in the BlackndashScholes model

L Leacutevy Processes

Because of the deciencies in the BlackndashScholes modelresearchers in mathematical nance have been trying tondmore suitable models for asset prices Certain types ofLeacutevy processes have been found to provide a good modelfor creep of concrete fatigue crack growth corroded steelgates and chloride ingress into concrete Furthermore cer-tain types of Leacutevy processes have been used to model thewater input in damsIn this entry we will review Leacutevy processes and give

important examples of such processes and state somereferences to their applications

Leacutevy ProcessesA stochastic process X = Xt t ge that has right con-tinuous sample paths with le limits is said to be a Leacutevyprocess if the following hold

X has stationary increments ie for every s t ge thedistribution of Xt+s minus Xt is independent of t

X has independent increments ie for every t sge Xt+s minus Xt is independent of ( Xuu le t)

X is stochastically continuous ie for every t ge andє gt

limsrarrtP(∣Xt minus Xs∣ gt є) = at is to say a Leacutevy process is a stochastically continu-ous process with stationary and independent incrementswhose sample paths are right continuous with le handlimitsIf Φ(z) is the characteristic function of a Leacutevy process

then its characteristic component φ(z) def= ln Φ(z)t is of the

form

iza minus zb+ int

R[exp(izx) minus minus izxI∣x∣lt]ν(dx)

where a isin R b isin R+ and ν is a measure on R satisfyingν() = intR( and x

)ν(dx) ltinfine measure ν characterizes the size and frequency of

the jumps If the measure is innite then the process hasinnitelymany jumps of very small sizes in any small inter-vale constant a dened above is called the dri term ofthe process and b is the variance (volatility) term

e LeacutevyndashIt^o decomposition identify any Leacutevy process

as the sum of three independent processes it is stated asfollowsGiven any a isin R b isin R+ and measure ν on R satisfying

ν() = intR(andx)ν(dx) ltinfin there exists a probability

space (Ω P) on which a Leacutevy process X is denede

process X is the sum of three independent processes()X

()X and

()X where

()X is a Brownianmotionwith dri a and

volatility b (in the sense dened below)()X is a compound

Poisson process and()X is a square integrable martingale

e characteristic components of()X

()X and

()X (denoted

by()φ (z)

()φ (z) and

()φ (z) respectively) are as follows

()φ (z) = iza minus zb

()φ (z)= int∣x∣ge(exp(izx) minus )ν(dx)()φ (z)= int∣x∣lt(exp(izx) minus minus izx)ν(dx)

Examples of the Leacutevy ProcessesThe Brownian MotionA Leacutevy process is said to be a Brownian motion (see7Brownian Motion and Diusions) with dri micro andvolatility rate σ if micro = a b = σ and ν (R)= Brow-nian motion is the only nondeterministic Leacutevy processeswith continuous sample paths

The Inverse Brownian ProcessLet X be a Brownian motion with micro gt and volatility rateσ For any x gt let Tx = inft Xt gt xen Tx is anincreasing Leacutevy process (called inverse Brownianmotion)its Leacutevy measure is given by

υ(dx) = radicπσ x

exp(minusxmicro

σ )

The Compound Poisson Processe compound Poisson process (see 7Poisson Processes)is a Leacutevy process where b = and ν is a nite measure

The Gamma Processe gamma process is a nonnegative increasing Leacutevy pro-cess X where b = a minus int xν(dx) = and its Leacutevymeasure is given by

ν(dx) = αxexp(minusxβ)dx x gt

where α β gt It follows that the mean term (E(X ))and the variance term (V(X )) for the process are equalto αβ and αβ respectively

e following is a simulated sample path of a gammaprocess where α = and β = (Fig )

The Variance Gamma Processe variance gamma process is a Leacutevy process that can berepresented as either the dierence between two indepen-dent gamma processes or as a Brownian process subordi-nated by a gamma processe latter is accomplished by arandom time change replacing the time of the Brownianprocess by a gamma process with a mean term equal to e variance gamma process has three parameters micro ndash the

Life Expectancy L

L

0 2 4 6 8 100

2

4

6

8

10

12

t

x

Leacutevy Processes Fig Gamma process

0 2 4 6 8 10minus08

minus06

minus04

minus02

0

02

04

06

t

x

Brownian motionVariance gamma

Leacutevy Processes Fig Brownian motion and variance gammasample paths

Brownian process dri term σ ndash the volatility of theBrownian process and ν ndash the variance term of the thegamma process

e following are two simulated sample paths one fora Brownian motion with a dri term micro = and volatilityterm σ = and the other is for a variance gamma processwith the same values for the dri term and the volatilityterms and ν = (Fig )

About the AuthorProfessor Mohamed Abdel-Hameed was the chairman ofthe Department of Statistics and Operations ResearchKuwait University (ndash) He was on the editorialboard of Applied Stochastic Models and Data Analysis(ndash) He was the Editor (jointly with E Cinlarand J Quinn) of the text Survival Models Maintenance

Replacement Policies and Accelerated Life Testing (Aca-demic Press ) He is an elected member of the Inter-national Statistical Institute He has published numerouspapers in Statistics and Operations research

Cross References7Non-Uniform Random Variate Generations7Poisson Processes7Statistical Modeling of Financial Markets7Stochastic Processes7Stochastic Processes Classication

References and Further ReadingAbdel-Hameed MS () Life distribution of devices subject to a

Leacutevy wear process Math Oper Res ndashAbdel-Hameed M () Optimal control of a dam using PMλτ

policies and penalty cost when the input process is a com-pound Poisson process with a positive drift J Appl Prob ndash

Black F Scholes MS () The pricing of options and corporateliabilities J Pol Econ ndash

Cont R Tankov P () Financial modeling with jump processesChapman amp HallCRC Press Boca Raton

Madan DB Carr PP Chang EC () The variance gamma processand option pricing Eur Finance Rev ndash

Patel A Kosko B () Stochastic resonance in continuous andspiking Neutron models with Leacutevy noise IEEE Trans NeuralNetw ndash

van Noortwijk JM () A survey of the applications of gammaprocesses in maintenance Reliab Eng Syst Safety ndash

Life Expectancy

Maja Biljan-AugustFull ProfessorUniversity of Rijeka Rijeka Croatia

Life expectancy is dened as the average number of yearsa person is expected to live from age x as determined bystatisticsStatistics on life expectancy are derived from a mathe-

matical model known as the 7life table In order to calcu-late this indicator the mortality rate at each age is assumedto be constant Life expectancy (ex) can be evaluated atany age and in a hypothetical stationary population canbe written in discrete form as

ex =Txlx

where x is age Tx is the number of person-years livedaged x and over and lx is the number of survivors at agex according to the life table

L Life Expectancy

Life expectancy can be calculated for combined sexesor separately for males and femalesere can be signi-cant dierences between sexesLife expectancy at birth (e) is the average number of

years a newborn child can expect to live if currentmortalitytrends remain constant

e =Tl

where T is the total size of the population and l is thenumber of births (the original number of persons in thebirth cohort)Life expectancy declines with age Life expectancy at

birth is highly inuenced by infant mortality rate eparadox of the life table refers to a situation where lifeexpectancy at birth increases for several years aer birth(e lt e lt e and even beyond)e paradox reects thehigher rates of infant and child mortality in populationsin pre-transition and middle stages of the demographictransitionLife expectancy at birth is a summary measure of mor-

tality in a population It is a frequently used indicatorof health standards and socio-economic living standardsLife expectancy is also one of the most commonly usedindicators of social development is indicator is easilycomparable through time and between areas includingcountries Inequalities in life expectancy usually indicateinequalities in health and socio-economic developmentLife expectancy rose throughout human history In

ancient Greece and Rome the average life expectancywas below years between the years and life expectancy at birth rose from about years to aglobal average of years and to more than years inthe richest countries (Riley ) Furthermore in mostindustrialized countries in the early twenty-rst centurylife expectancy averaged at about years (WHO)esechanges called the ldquohealth transitionrdquo are essentially theresult of improvements in public health medicine andnutritionLife expectancy varies signicantly across regions and

continents from life expectancies of about years insome central African populations to life expectancies of years and above in many European countries emore developed regions have an average life expectancy of years while the population of less developed regionsis at birth expected to live an average years less etwo continents that display the most extreme dierencesin life expectancies are North America ( years) andAfrica ( years) where as of recently the gap betweenlife expectancies amounts to years (UN ndash data)

Countries with the highest life expectancies in theworld ( years) are Australia Iceland Italy Switzerlandand Japan ( years) Japanese men and women live anaverage of and years respectively (WHO )In countries with a high rate of HIV infection prin-

cipally in Sub-Saharan Africa the average life expectancyis years and below Some of the worldrsquos lowest lifeexpectancies are in Sierra Leone ( years) Afghanistan( years) Lesotho ( years) and Zimbabwe ( years)In nearly all countries women live longer than men

e worldrsquos average life expectancy at birth is years formales and years for females the gap is about ve yearse female-to-male gap is expected to narrow in the moredeveloped regions andwiden in the less developed regionse Russian Federation has the greatest dierence in lifeexpectancies between the sexes ( years less for men)whereas in Tonga life expectancy for males exceeds thatfor females by years (WHO )Life expectancy is assumed to rise continuously

According to estimation by the UN global life expectancyat birth is likely to rise to an average years by ndashBy life expectancy is expected to vary across countriesfrom to years Long-range United Nations popula-tion projections predict that by on average peoplecan expect to live more than years from (Liberia) upto years (Japan)For more details on the calculation of life expectancy

including continuous notation see Keytz ( )and Preston et al ()

Cross References7Biopharmaceutical Research Statistics in7Biostatistics7Demography7Life Table7Mean Residual Life7Measurement of Economic Progress

References and Further ReadingKeyfitz N () Introduction to the Mathematics of Population

Addison-Wesly MassachusettsKeyfitz N () Applied mathematical demography Springer

New YorkPreston SH Heuveline P Guillot M () Demography measuring

and modelling potion processes Blackwell OxfordRiley JC () Rising life expectancy a global history Cambridge

University Press CambridgeRowland DT () Demographic methods and concepts Oxford

University Press New YorkSiegel JS Swanson DA () The methods and materials of demog-

raphy nd edn Elsevier Academic Press AmsterdamWorld Health Organization () World health statistics

Geneva

Life Table L

L

World Population Prospects The revision Highlights UnitedNations New York

World Population to () United Nations New York

Life Table

JanM HoemProfessor Director EmeritusMax Planck Institute for Demographic Research RostockGermany

e life table is a classical tabular representation of centralfeatures of the distribution function F of a positive variablesay X which normally is taken to represent the lifetime ofa newborn individuale life table was introduced wellbefore modern conventions concerning statistical distri-butions were developed and it comes with some specialterminology and notation as follows Suppose that F has a

density f (x) = ddxF(x) and dene the force of mortality or

death intensity at age x as the function

micro(x) = minus ddxln minus F(x) = f (x)

minus F(x)

Heuristically it is interpreted by the relation micro(x)dx =Px lt X lt x + dx∣X gt x Conversely F(x) =

minus expminusx

intmicro(s)ds e survivor function is dened as

ℓ(x) = ℓ() minus F(x) normally with ℓ() = Inmortality applications ℓ(x) is the expected number of sur-vivors to exact age x out of an original cohort of newborn babiese survival probability is

tpx = PX gt x + t∣X gt x = ℓ(x + t)ℓ(x)

= expminusintt

micro(x + s)ds

and the non-survival probability is (the converse) tqx = minustpx For t = one writes qx = qx and px = px In partic-ular we get ℓ(x + ) = ℓ(x)pxis is a practical recursionformula that permits us to compute all values of ℓ(x) oncewe know the values of px for all relevant x

e life expectancy is eo = EX = int infin ℓ(x)dxℓ()(e subscript in eo indicates that the expected value iscomputed at age (ie for newborn individuals) and thesuperscript o indicates that the computation is made in

the continuous mode)e remaining life expectancy atage x is

eox = E(X minus x∣X gt x) = intinfin

ℓ(x + t)dtℓ(x)

ie it is the expected lifetime remaining to someone whohas attained age xTo turn to the statistical estimation of these various

quantities suppose that the function micro(x) is piecewiseconstant which means that we take it to equal some con-stant say microj over each interval (xj xj+) for some partitionxj of the age axis For a collection Xi of independentobservations of X let Dj be the number of Xi that fall inthe interval (xj xj+) In mortality applications this is thenumber of deaths observed in the given interval For thecohort of the initially newborn Dj is the number of indi-viduals whodie in the interval (called the occurrences in theinterval) If individual i dies in the interval he or she willof course have lived for Xi minus xj time units during the inter-val Individuals who survive the interval will have lived forxj+ minus xj time units in the interval and individuals whodo not survive to age xj will not have lived during thisinterval at all When we aggregate the time units lived in(xj xj+) over all individuals we get a total Rj which iscalled the exposures for the interval the idea being thatindividuals are exposed to the risk of death for as long asthey live in the interval In the simple case where there areno relations between the individual parameters microj the col-lection DjRj constitutes a statistically sucient set ofobservations with a likelihood Λ that satises the relationln Λ = sumjminusmicrojRj+Dj ln microjwhich is easily seen to bemax-imized by microj = DjRje latter fraction is therefore themaximum-likelihood estimator for microj (In some connec-tions an age schedule of mortality will be specied such asthe classical GompertzndashMakeham function microx = a + bcxwhich does represent a relationship between the intensityvalues at the dierent ages x normally for single-year agegroupsMaximum likelihood estimators can then be foundby plugging this functional specication of the intensitiesinto the likelihood function nding the values a b andc that maximize Λ and using a + bcx for the intensity inthe rest of the life table computations Methods that do notamount tomaximum likelihood estimationwill sometimesbe used because they involve simpler computations Withsome luck they provide starting values for the iterative pro-cess that must usually be applied to produce the maximumlikelihood estimators For an example see Forseacuten ())is whole schema can be extended trivially to cover cen-soring (withdrawals) provided the censoringmechanism isunrelated to the mortality process

L Likelihood

If the force of mortality is constant over a single-yearage interval (x x + ) say and is estimated by microx in thisinterval then px = eminusmicrox is an estimator of the single-yearsurvival probability pxis allows us to estimate the sur-vival function recursively for all corresponding ages usingℓ(x + ) = ℓ(x)px for x = and the rest of thelife table computations follow suit Life table constructionconsists in the estimation of the parameters and the tab-ulation of functions like those above from empirical datae data can be for age at death for individuals as in theexample indicated above but they can also be observa-tions of duration until recovery from an illness of intervalsbetween births of time until breakdown of some piece ofmachinery or of any other positive duration variableSo far we have argued as if the life table is computed for

a group of mutually independent individuals who have allbeen observed in parallel essentially a cohort that is fol-lowed from a signicant common starting point (namelyfrom birth in our mortality example) and which is dimin-ished over time due to decrements (attrition) caused bythe risk in question and also subject to reduction due tocensoring (withdrawals)e corresponding table is thencalled a cohort life table It is more common however toestimate a px schedule from data collected for the mem-bers of a population during a limited time period and touse the mechanics of life-table construction to produce aperiod life table from the px valuesLife table techniques are described in detail in most

introductory textbooks in actuarial statistics7biostatistics7demography and epidemiology See eg Chiang ()Elandt-Johnson and Johnson () Manton and Stallard() Preston et al () For the history of thetopic consult Seal () Smith and Keytz () andDupacircquier ()

About the AuthorFor biography see the entry 7Demography

Cross References7Demographic Analysis A Stochastic Approach7Demography7Event History Analysis7Kaplan-Meier Estimator7Population Projections7Statistics An Overview

References and Further ReadingChiang CL () The life table and its applications Krieger

MalabarDupacircquier J () Lrsquoinvention de la table de mortaliteacute Presses

universitaires de France Paris

Elandt-Johnson RC Johnson NL () Survival models and dataanalysis Wiley New York

Forseacuten L () The efficiency of selected moment methods inGompertzndashMakeham graduation of mortality Scand Actuarial Jndash

Manton KG Stallard E () Recent trends in mortality analysisAcademic Press Orlando

Preston SH Heuveline P Guillot M () Demography Measuringand modeling populations Blackwell Oxford

Seal H () Studies in history of probability and statistics mul-tiple decrements of competing risks Biometrica ()ndash

Smith D Keyfitz N () Mathematical demography selectedpapers Springer Heidelberg

Likelihood

Nancy ReidProfessorUniversity of Toronto Toronto ON Canada

Introductione likelihood function in a statistical model is propor-tional to the density function for the random variable tobe observed in the model Most oen in applications oflikelihood we have a parametric model f (y θ) where theparameter θ is assumed to take values in a subset of Rkand the variable y is assumed to take values in a subset ofRn the likelihood function is dened by

L(θ) = L(θ y) = cf (y θ) ()

where c can depend on y but not on θ In more gen-eral settings where the model is semi-parametric or non-parametric the explicit denition is more dicult becausethe density needs to be dened relative to a dominatingmeasure whichmay not exist seeVan derVaart () andMurphy andVan derVaart ()is article will consideronly nite-dimensional parametric modelsWithin the context of the given parametric model the

likelihood function measures the relative plausibility ofvarious values of θ for a given observed data point y Val-ues of the likelihood function are only meaningful relativeto each other and for this reason are sometimes stan-dardized by the maximum value of the likelihood func-tion although other reference points might be of interestdepending on the contextIf ourmodel is f (y θ) = (ny)θy(minusθ)nminusy y = n

θ isin [ ] then the likelihood function is (any functionproportional to)

L(θ y) = θy( minus θ)nminusy

Likelihood L

L

and can be plotted as a function of θ for any xed valueof y e likelihood function is maximized at θ = ynis model might be appropriate for a sampling schemewhich recorded the number of successes amongn indepen-dent trials that result in success or failure each trial havingthe same probability of success θ Another example is thelikelihood function for the mean and variance parameterswhen sampling from a normal distribution with mean microand variance σ

L(θ y) = expminusn log σ minus (σ )Σ(yi minus micro)

where θ = (micro σ )is could also be plotted as a functionof micro and σ for a given sample y yn and it is not dif-cult to show that this likelihood function only dependson the sample through the sample mean y = nminusΣyi andsample variance s = (n minus )minusΣ(yi minus y) or equivalentlythrough Σyi and Σyi It is a general property of likelihoodfunctions that they depend on the data only through theminimal sucient statistic

Inferencee likelihood function was dened in a seminal paperof Fisher () and has since become the basis for mostmethods of statistical inference One version of likelihoodinference suggested by Fisher is to use some rule suchas L(θ)L(θ) gt k to dene a range of ldquolikelyrdquo or ldquoplau-siblerdquo values of θ Many authors including Royall ()and Edwards () have promoted the use of plots ofthe likelihood function along with interval estimates ofplausible valuesis approach is somewhat limited how-ever as it requires that θ have dimension or possibly or that a likelihood function can be constructed that onlydepends on a component of θ that is of interest see sectionldquo7Nuisance Parametersrdquo belowIn general we would wish to calibrate our inference

for θ by referring to the probabilistic properties of theinferential method One way to do this is to introduce aprobability measure on the unknown parameter θ typi-cally called a prior distribution and use Bayesrsquo rule forconditional probabilities to conclude

π(θ ∣ y) = L(θ y)π(θ)intθL(θ y)π(θ)dθ

where π(θ) is the density for the prior measure and π(θ ∣y) provides a probabilistic assessment of θ aer observingY = y in the model f (y θ) We could then make con-clusions of the form ldquohaving observed successes in trials and assuming π(θ) = the posterior probabilitythat θ gt is rdquo and so on

is is a very brief description of Bayesian inference inwhich probability statements refer to that generated from

the prior through the likelihood to the posterior A majordiculty with this approach is the choice of prior prob-ability function In some applications there may be anaccumulation of previous data that can be incorporatedinto a probability distribution but in general there is notand some rather ad hoc choices are oen made Anotherdiculty is themeaning to be attached to probability state-ments about the parameterInference based on the likelihood function can also be

calibrated with reference to the probability model f (y θ)by examining the distribution ofL(θY) as a random func-tion or more usually by examining the distribution ofvarious derived quantitiesis is the basis for likelihoodinference from a frequentist point of view In particularit can be shown that logL(θY)L(θY) where θ =θ(Y) is the value of θ at which L(θY) is maximized isapproximately distributed as a χk random variable wherek is the dimension of θ To make this precise requires anasymptotic theory for likelihood which is based on a cen-tral limit theorem (see 7Central Limiteorems) for thescore function

U(θY) = partpartθlogL(θY)

If Y = (Y Yn) has independent components thenU(θ) is a sum of n independent components which undermild regularity conditions will be asymptotically normalTo obtain the χ result quoted above it is also necessary toinvestigate the convergence of θ to the true value govern-ing the model f (y θ) Showing this convergence usuallyin probability but sometimes almost surely can be di-cult see Scholz () for a summary of some of the issuesthat ariseAssuming that θ is consistent for θ and that L(θY)

has sucient regularity the follow asymptotic results canbe established

(θ minus θ)T i(θ)(θ minus θ) drarr χk ()

U(θ)T iminus(θ)U(θ) drarr χk ()

ℓ(θ) minus ℓ(θ) drarr χk ()

where i(θ) = Eminusℓprimeprime(θY) θ is the expected Fisher infor-mation function ℓ(θ) = logL(θ) is the log-likelihoodfunction and χk is the 7chi-square distribution with kdegrees of freedom

ese results are all versions of a more general resultthat the log-likelihood function converges to the quadraticform corresponding to a multivariate normal distribution(see 7Multivariate Normal Distributions) under suitablystated limiting conditions ere is a similar asymptoticresult showing that the posterior density is asymptotically

L Likelihood

normal and in fact asymptotically free of the prior distri-bution although this result requires that the prior distribu-tion be a proper probability density ie has integral overthe parameter space equal to

Nuisance ParametersIn models where the dimension of θ is large plotting thelikelihood function is not possible and inference based onthe multivariate normal distribution for θ or the χk dis-tribution of the log-likelihood ratio doesnrsquot lead easily tointerval estimates for components of θ However it is pos-sible to use the likelihood function to construct inferencefor parameters of interest using variousmethods that havebeen proposed to eliminate nuisance parametersSuppose in the model f (y θ) that θ = (ψ λ) where ψ

is a k-dimensional parameter of interest (which will oenbe )e prole log-likelihood function of ψ is

ℓP(ψ) = ℓ(ψ λψ)

where λψ is the constrainedmaximum likelihood estimateit maximizes the likelihood function L(ψ λ) when ψ isheld xede prole log-likelihood function is also calledthe concentrated log-likelihood function especially ineconometrics If the parameter of interest is not expressedexplicitly as a subvector of θ then the constrained maxi-mum likelihood estimate is found using Lagrange multi-pliersIt can be veried under suitable smoothness conditions

that results similar to those at ( ndash ) hold as well for theprole log-likelihood function in particular

ℓP(ψ) minus ℓP(ψ) = ℓ(ψ λ) minus ℓ(ψ λψ)drarr χk

is method of eliminating nuisance parameters is notcompletely satisfactory especially when there are manynuisance parameters in particular it doesnrsquot allow forerrors in estimation of λ For example the prole likeli-hood approach to estimation of σ in the linear regressionmodel (see 7Linear Regression Models) y sim N(Xβ σ )will lead to the estimator σ = Σ(yi minus yi)n whereas theestimator usually preferred has divisor nminusp where p is thedimension of β

us a large literature has developed on improvementsto the prole log-likelihood For Bayesian inference suchimprovements are ldquoautomaticallyrdquo included in the formu-lation of the marginal posterior density for ψ

πM(ψ ∣ y)prop int π(ψ λ ∣ y)dλ

but it is typically quite dicult to specify priors for possiblyhigh-dimensional nuisance parameters For non-Bayesian

inference most modications to the prole log-likelihoodare derived by considering conditional or marginal infer-ence in models that admit factorizations at least approxi-mately like the following

f (y θ) = f(yψ)f(y ∣ y λ) orf (y θ) = f(y ∣ yψ)f(y λ)

A discussion of conditional inference and density factori-sations is given in Reid ()is literature is closely tiedto that on higher order asymptotic theory for likelihoode latter theory builds on saddlepoint and Laplace expan-sions to derive more accurate versions of (ndash) see forexample Severini () and Brazzale et al () edirect likelihood approach of Royall () and others doesnot generalize very well to the nuisance parameter settingalthough Royall and Tsou () present some results inthis direction

Extensions to Likelihoode likelihood function is such an important aspect ofinference based on models that it has been extended toldquolikelihood-likerdquo functions formore complex data settingsExamples include nonparametric and semi-parametriclikelihoods the most famous semi-parametric likelihoodis the proportional hazards model of Cox () Butmany other extensions have been suggested to empiri-cal likelihood (Owen ) which is a type of nonpara-metric likelihood supported on the observed sample toquasi-likelihood (McCullagh ) which starts from thescore function U(θ) and works backwards to an infer-ence function to bootstrap likelihood (Davison et al )and many modications of prole likelihood (Barndor-Nielsen andCox Fraser )ere is recent interestfor multi-dimensional responses Yi in composite likeli-hoods which are products of lower dimensional condi-tional or marginal distributions (Varin ) Reid ()concluded a review article on likelihood by stating

7 From either a Bayesian or frequentist perspective the like-lihood function plays an essential role in inference Themaximum likelihood estimator once regarded on an equalfooting among competing point estimators is now typi-cally the estimator of choice although some refinementis needed in problems with large numbers of nuisanceparameters The likelihood ratio statistic is the basis formost tests of hypotheses and interval estimates The emer-gence of the centrality of the likelihood function for infer-ence partly due to the large increase in computing poweris one of the central developments in the theory of statisticsduring the latter half of the twentieth century

Likelihood L

L

Further Readinge book by Cox and Hinkley () gives a detailedaccount of likelihood inference and principles of statis-tical inference see also Cox () ere are severalbook-length treatments of likelihood inference includingEdwards () Azzalini () Pawitan () and Sev-erini () this last discusses higher order asymptotictheory in detail as does Barndor-Nielsen andCox ()and Brazzale Davison and Reid () A short reviewpaper is Reid () An excellent overview of consis-tency results for maximum likelihood estimators is Scholz() see also Lehmann andCasella () Foundationalissues surrounding likelihood inference are discussed inBerger and Wolpert ()

About the AuthorProfessor Reid is a Past President of the Statistical Societyof Canada (ndash) During (ndash) she servedas the President of the Institute of Mathematical Statis-tics Among many awards she received the Emanuel andCarol Parzen Prize for Statistical Innovation () ldquoforleadership in statistical science for outstanding researchin theoretical statistics and highly accurate inference fromthe likelihood function and for inuential contributionsto statistical methods in biology environmental sciencehigh energy physics and complex social surveysrdquo She wasawarded the Gold Medal Statistical Society of Canada() and FlorenceNightingaleDavidAward Committeeof Presidents of Statistical Societies () She is AssociateEditor of Statistical Science (ndash) Bernoulli (ndash) andMetrika (ndash)

Cross References7Bayesian Analysis or Evidence Based Statistics7Bayesian Statistics7Bayesian Versus Frequentist Statistical Reasoning7Bayesian vs Classical Point Estimation A ComparativeOverview7Chi-Square Test Analysis of Contingency Tables7Empirical Likelihood Approach to Inference fromSample Survey Data7Estimation7Fiducial Inference7General Linear Models7Generalized Linear Models7Generalized Quasi-Likelihood (GQL) Inferences7Mixture Models7Philosophical Foundations of Statistics

7Risk Analysis7Statistical Evidence7Statistical Inference7Statistical Inference An Overview7Statistics An Overview7Statistics Nelderrsquos view7Testing Variance Components in Mixed LinearModels7Uniform Distribution in Statistics

References and Further ReadingAzzalini A () Statistical inference Chapman and Hall LondonBarndorff-Nielsen OE Cox DR () Inference and asymptotics

Chapman and Hall LondonBerger JO Wolpert R () The likelihood principle Institute of

Mathematical Statistics HaywardBirnbaum A () On the foundations of statistical inference Am

Stat Assoc ndashBrazzale AR Davison AC Reid N () Applied asymptotics

Cambridge University Press CambridgeCox DR () Regression models and life tables J R Stat Soc B

ndash (with discussion)Cox DR () Principles of statistical inference Cambridge Uni-

versity Press CambridgeCox DR Hinkley DV () Theoretical statistics Chapman and

Hall LondonDavison AC Hinkley DV Worton B () Bootstrap likelihoods

Biometrika ndashEdwards AF () Likelihood Oxford University Press OxfordFisher RA () On the mathematical foundations of theoretical

statistics Phil Trans R Soc A ndashLehmann EL Casella G () Theory of point estimation nd edn

Springer New YorkMcCullagh P () Quasi-likelihood functions Ann Stat ndashMurphy SA Van der Vaart A () Semiparametric likelihood ratio

inference Ann Stat ndashOwen A () Empirical likelihood ratio confidence intervals for a

single functional Biometrika ndashPawitan Y () In all likelihood Oxford University Press OxfordReid N () The roles of conditioning in inference Stat Sci

ndashReid N () Likelihood J Am Stat Assoc ndashRoyall RM () Statistical evidence a likelihood paradigm

Chapman and Hall LondonRoyall RM Tsou TS () Interpreting statistical evidence using

imperfect models robust adjusted likelihoodfunctions J R StatSoc B

Scholz F () Maximum likelihood estimation In Ency-clopedia of statistical sciences Wiley New York doiesspub Accessed Aug

Severini TA () Likelihood methods in statistics Oxford Univer-sity Press Oxford

Van der Vaart AW () Infinite-dimensional likelihood meth-ods in statistics httpwwwstieltjesorgarchiefbiennialframenodehtml Accessed Aug

Varin C () On composite marginal likelihood Adv Stat Analndash

L Limit Theorems of Probability Theory

Limit Theorems of ProbabilityTheory

Alexandr Alekseevich BorovkovProfessor Head of Probability and Statistics Chair at theNovosibirsk UniversityNovosibirsk University Novosibirsk Russia

Limit eorems of Probability eory is a broad namereferring to the most essential and extensive research areain Probability eory which at the same time has thegreatest impact on the numerous applications of the latterBy its very nature Probability eory is concerned

with asymptotic (limiting) laws that emerge in a long seriesof observations on random events Because of this in theearly twentieth century even the very denition of prob-ability of an event was given by a group of specialists(R von Mises and some others) as the limit of the rela-tive frequency of the occurrence of this event in a longrow of independent random experiments e ldquostabilityrdquoof this frequency (ie that such a limit always exists) waspostulated Aer the s Kolmogorovrsquos axiomatic con-struction of probability theory has prevailed One of themain assertions in this axiomatic theory is the Law of LargeNumbers (LLN) on the convergence of the averages of largenumbers of random variables to their expectation islaw implies the aforementioned stability of the relative fre-quencies and their convergence to the probability of thecorresponding event

e LLN is the simplest limit theorem (LT) of probabil-ity theory elucidating the physical meaning of probabilitye LLN is stated as follows if XXX is a sequenceof iid random variables

Sn =n

sumj=Xj

and the expectation a = EX exists then nminusSnasETHrarr a

(almost surely ie with probability )us the value nacan be called the rst order approximation for the sums Sne Central Limiteorem (CLT) gives one a more preciseapproximation for Sn It says that if σ = E(X minus a) ltinfinthen the distribution of the standardized sum ζn = (Sn minusna)σ

radicn converges as n rarr infin to the standard normal

(Gaussian) lawat is for all x

P(ζn lt x)rarr Φ(x) = radicπ

x

intminusinfin

eminustdt

e quantity nEξ + ζσradicn where ζ is a standard normal

random variable (so that P(ζ lt x) = Φ(x)) can be calledthe second order approximation for Sn

e rst LLN (for the Bernoulli scheme) was provedby Jacob Bernoulli in the late s (published posthu-mously in ) e rst CLT (also for the Bernoullischeme) was established by A de Moivre (rst publishedin and referred nowadays to as the deMoivrendashLaplacetheorem) In the beginning of the nineteenth centuryPS Laplace and CF Gauss contributed to the generaliza-tion of these assertions and appreciation of their enormousapplied importance (in particular for the theory of errorsof observations) while later in that century further break-throughs in both methodology and applicability rangeof the CLT were achieved by PL Chebyshev () andAM Lyapunov ()

e main directions in which the two aforementionedmain LTs have been extended and rened since then are

Relaxing the assumption EX lt infin When the sec-ond moment is innite one needs to assume that theldquotailrdquo P(x) = P(X gt x) + P(X lt minusx) is a func-tion regularly varying at innity such that the limitlimxrarrinfinP(X gt x)P(x) existsen the distribution of thenormalized sum Snσ(n) where σ(n) = Pminus(nminus)Pminus being the generalized inverse of the function P andwe assume that Eξ = when the expectation is niteconverges to one of the so-called stable laws as nrarrinfine7characteristic functions of these laws have simpleclosed-form representations

Relaxing the assumption that the Xjrsquos are identicallydistributed and proceeding to study the so-called tri-angular array scheme where the distributions of thesummands Xj = Xjn forming the sum Sn depend notonly on j but on n as well In this case the class of alllimit laws for the distribution of Sn (under suitable nor-malization) is substantially wider it coincides with theclass of the so-called innitely divisible distributionsAn important special case here is the Poisson limit the-orem on convergence in distribution of the number ofoccurrences of rare events to a Poisson law

Relaxing the assumption of independence of the XjrsquosSeveral types of ldquoweak dependencerdquo assumptions onXjunder which the LLN and CLT still hold true havebeen suggested and investigatedOne should alsomen-tion here the so-called ergodic theorems (see7Ergodiceorem) for a wide class of random sequences andprocesses

Renement of the main LTs and derivation of asymp-totic expansions For instance in the CLT bounds ofthe rate of convergence P(ζn lt x) minus Φ(x) rarr and

Limit Theorems of Probability Theory L

L

asymptotic expansions for this dierence (in the pow-ers of nminus in the case of iid summands) have beenobtained under broad assumptions

Studying large deviation probabilities for the sums Sn(theorems on rare events) If x rarr infin together with nthen the CLT can only assert that P(ζn gt x) rarr eorems on large deviation probabilities aim to nd afunction P(xn) such that

P(ζn gt x)P(xn) rarr as nrarrinfin x rarrinfin

e nature of the function P(xn) essentially dependson the rate of decay of P(X gt x) as x rarrinfin and on theldquodeviation zonerdquo ie on the asymptotic behavior of theratio xn as nrarrinfin

Considering observations X Xn of a more com-plex nature ndash rst of all multivariate random vectorsIf Xj isin Rd then the role of the limit law in the CLTwill be played by a d-dimensional normal (Gaussian)distribution with the covariance matrix E(X minus EX)(X minus EX)T

e variety of application areas of the LLN and CLTis enormous us Mathematical Statistics is based onthese LTs Let Xlowastn = (X Xn) be a sample froma distribution F and Flowastn (u) the corresponding empiricaldistribution functione fundamental GlivenkondashCantellitheorem (see 7Glivenko-Cantelli eorems) stating thatsupu ∣F

lowastn (u)minus F(u)∣

asETHrarr as nrarrinfin is of the same natureas the LLN and basically means that the unknown distri-bution F can be estimated arbitrary well from the randomsample Xlowastn of a large enough size n

e existence of consistent estimators for the unknownparameters a = Eξ and σ = E(X minus a) also follows fromthe LLN since as nrarrinfin

alowast = n

n

sumj=Xj

asETHrarr a (σ )lowast = n

n

sumj=

(Xj minus alowast)

= n

n

sumj=Xj minus (alowast) asETHrarr σ

Under additional moment assumptions on the distri-bution F one can also construct asymptotic condenceintervals for the parameters a and σ as the distributionsof the quantities

radicn(alowast minus a) and

radicn((σ )lowast minus σ ) con-

verge as n rarr infin to the normal onese same can alsobe said about other parameters that are ldquosmoothrdquo enoughfunctionals of the unknown distribution F

e theorem on the 7asymptotic normality andasymptotic eciency of maximum likelihood estimatorsis another classical example of LTsrsquo applications in mathe-matical statistics (see eg Borovkov ) Furthermore in

estimation theory and hypotheses testing one also needstheorems on large deviation probabilities for the respec-tive statistics as it is statistical procedures with small errorprobabilities that are oen required in applicationsIt is worth noting that summation of random variables

is by no means the only situation in which LTs appear inProbabilityeoryGenerally speaking the main objective of Probabil-

ityeory in applications is nding appropriate stochasticmodels for objects under study and then determining thedistributions or parameters one is interested in As a rulethe explicit form of these distributions and parameters isnot known LTs can be used to nd suitable approximationsto the characteristics in questionAt least two possible approaches to this problem should

be noted here

Suppose that the unknown distribution Fθ depends ona parameter θ such that as θ approaches some ldquocriticalrdquovalue θ the distributions Fθ become ldquodegeneraterdquo inone sense or anotheren in a number of cases onecan nd an approximation for Fθ which is valid for thevalues of θ that are close to θ For instance in actuarialstudies7queueing theory and some other applicationsone of the main problems is concerned with the dis-tribution of S = sup

kge(Sk minus θk) under the assumption

that EX = If θ gt then S is a proper random vari-able If however θ rarr then S asETHrarr infin Here we dealwith the so-called ldquotransient phenomenardquo It turns outthat if σ = Var (X) ltinfin then there exists the limit

limθdarrP(θS gt x) = eminusxσ x gt

is (KingmanndashProkhorov) LT enables one to ndapproximations for the distribution of S in situationswhere θ is small

Sometimes one can estimate the ldquotailsrdquo of the unknowndistributions ie their asymptotic behavior at innityis is of importance in those applications where oneneeds to evaluate the probabilities of rare events If theequation Eemicro(Xminusθ) = has a solution micro gt then inthe above example one has

P(S gt x) sim ceminusmicrox x rarrinfin

where c is a known constant If the distribution F of Xis subexponential (in this case Eemicro(Xminusθ) = infin for anymicro gt ) then

P(S gt x) sim θ int

infin

x( minus F(t))dt x rarrinfin

L Linear Mixed Models

isLTenablesone tondapproximations forP(S gt x)for large xFor both approaches the obtained approximations

can be rened

An important part of Probabilityeory is concernedwith LTs for random processes eir main objective isto nd conditions under which random processes con-verge in some sense to some limit process An extensionof the CLT to that context is the so-called FunctionalCLT (aka the DonskerndashProkhorov invariance princi-ple) which states that as n rarr infin the processesζn(t) = (Slfloorntrfloor minus ant)σ

radicntisin[] converge in distribu-

tion to the standard Wiener process w(t)tisin[]e LTs(including large deviation theorems) for a broad class offunctionals of the sequence (7random walk) S Sncan also be classied as LTs for 7stochastic processese same can be said about Law of iterated logarithmwhich states that for an arbitrary ε gt the random walkSkinfink= crosses the boundary (minusε)σ

radick ln ln k innitely

many times but crosses the boundary ( + ε)σradick ln ln k

nitely many times with probability Similar results holdtrue for trajectories of Wiener processes w(t)tisin[] andw(t)tisin[infin)In mathematical statistics a closely related to func-

tional CLT result says that the so-called ldquoempirical processrdquoradicn(Flowastn (u) minus F(u))uisin(minusinfininfin) converges in distribution

to w(F(u))uisin(minusinfininfin) where w(t) = w(t) minus tw() isthe Brownian bridge processis LT implies7asymptoticnormality of a great many estimators that can be repre-sented as smooth functionals of the empirical distributionfunction Flowastn (u)

ere are many other areas in Probability eoryand its applications where various LTs appear and areextensively used For instance convergence theoremsfor 7martingales asymptotics of extinction probabilityof a branching processes and conditional (under non-extinction condition) LTs on a number of particles etc

About the AuthorProfessor Borovkov is Head of Probability and Statis-tics Department of the Sobolev Institute of MathematicsNovosibirsk (RussianAcademy of Sciences) since Heis Head of Probability and Statistics Chair at the Novosi-birsk University since He is Concelour of RussianAcademy of Sciences and fullmember of RussianAcademyof Sciences (Academician) () He was awarded theState Prize of the USSR () the Markov Prize of theRussian Academy of Sciences () and GovernmentPrize in Education () Professor Borovkov is Editor-in-Chief of the journal ldquoSiberianAdvances inMathematicsrdquo

and Associated Editor of journals ldquoeory of Probabil-ity and its Applicationsrdquo ldquoSiberian Mathematical JournalrdquoldquoMathematical Methods of Statisticsrdquo ldquoElectronic Journal ofProbabilityrdquo

Cross References7Almost Sure Convergence of Random Variables7Approximations to Distributions7Asymptotic Normality7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Central Limiteorems7Empirical Processes7Ergodiceorem7Glivenko-Cantellieorems7Large Deviations and Applications7Laws of Large Numbers7Martingale Central Limiteorem7Probabilityeory An Outline7RandomMatrixeory7Strong Approximations in Probability and Statistics7Weak Convergence of Probability Measures

References and Further ReadingBillingsley P () Convergence of probability measures Wiley

New YorkBorovkov AA () Mathematical statistics Gordon amp Breach

AmsterdamGnedenko BV Kolmogorov AN () Limit distributions for sums

of independent random variables Addison-Wesley CambridgeLeacutevy P () Theacuteorie de lrsquoAddition des Variables Aleacuteatories

Gauthier-Villars ParisLoegraveve M (ndash) Probability theory vols I and II th edn

Springer New YorkPetrov VV () Limit theorems of probability theory sequences of

independent random variables ClarendonOxford UniversityPress New York

Linear Mixed Models

GeertMolenberghsProfessorUniversiteit Hasselt amp Katholieke Universiteit LeuvenLeuven Belgium

In observational studies repeated measurements may betaken at almost arbitrary time points resulting in anextremely large number of time points at which only one

Linear Mixed Models L

L

or only a few measurements have been taken Many of theparametric covariance models described so far may thencontain toomany parameters tomake them useful in prac-tice while othermore parsimoniousmodelsmay be basedon assumptions which are too simplistic to be realisticA general and very exible class of parametric models forcontinuous longitudinal data is formulated as follows

yi∣bi sim N(Xiβ + ZibiΣi) ()bi sim N(D) ()

where Xi and Zi are (ni times p) and (ni times q) dimensionalmatrices of known covariates β is a p-dimensional vec-tor of regression parameters called the xed eects D isa general (q times q) covariance matrix and Σi is a (ni times ni)covariance matrix which depends on i only through itsdimension ni ie the set of unknown parameters in Σi willnot depend upon i Finally bi is a vector of subject-specicor random eects

e above model can be interpreted as a linear regres-sion model (see 7Linear Regression Models) for the vec-tor yi of repeated measurements for each unit separatelywhere some of the regression parameters are specic (ran-dom eects bi) while others are not (xed eects β)edistributional assumptions in () with respect to the ran-dom eects can be motivated as follows First E(bi) = implies that the mean of yi still equals Xiβ such that thexed eects in the random-eects model () can also beinterpreted marginally Not only do they reect the eectof changing covariates within specic units they alsomea-sure the marginal eect in the population of changing thesame covariates Second the normality assumption imme-diately implies that marginally yi also follows a normaldistribution with mean vector Xiβ and with covariancematrix Vi = ZiDZprimei + ΣiNote that the random eects in () implicitly imply the

marginal covariance matrix Vi of yi to be of the very spe-cic form Vi = ZiDZprimei + Σi Let us consider two examplesunder the assumption of conditional independence ieassuming Σi = σ Ini First consider the case where therandom eects are univariate and represent unit-specicintercepts is corresponds to covariates Zi which areni-dimensional vectors containing only ones

e marginal model implied by expressions () and() is

yi sim N(XiβVi) Vi = ZiDZprimei + Σi

which can be viewed as a multivariate linear regressionmodel with a very particular parameterization of thecovariance matrix Vi

With respect to the estimation of unit-specic param-eters bi the posterior distribution of bi given the observeddata yi can be shown to be (multivariate) normal withmean vector equal to DZprimeiV

minusi (α)(yi minus Xiβ) Replacing β

and α by their maximum likelihood estimates we obtainthe so-called empirical Bayes estimates bi for the bi A keyproperty of these EB estimates is shrinkage which is bestillustrated by considering the prediction yi equiv Xi β +Zibi ofthe ith prole It can easily be shown that

yi = ΣiVminusi Xi β + (Ini minus ΣiVminusi ) yi

which can be interpreted as a weighted average of thepopulation-averaged prole Xi β and the observed data yiwith weights ΣiVminusi and Ini minusΣiVminusi respectively Note thatthe ldquonumeratorrdquo of ΣiVminusi represents within-unit variabil-ity and the ldquodenominatorrdquo is the overall covariance matrixVi Hence much weight will be given to the overall averageprole if the within-unit variability is large in comparisonto the between-unit variability (modeled by the randomeects) whereasmuchweight will be given to the observeddata if the opposite is trueis phenomenon is referred toas shrinkage toward the average proleXi β An immediateconsequence of shrinkage is that the EB estimates show lessvariability than actually present in the random-eects dis-tribution ie for any linear combination λ of the randomeects

var(λprimebi)le var(λprimebi) = λprimeDλ

About the AuthorGeert Molenberghs is Professor of Biostatistics at Uni-versiteit Hasselt and Katholieke Universiteit Leuven inBelgium He was Joint Editor of Applied Statistics (ndash) and Co-Editor of Biometrics (ndash) Cur-rently he is Co-Editor of Biostatistics (ndash) He wasPresident of the International Biometric Society (ndash) received the Guy Medal in Bronze from the RoyalStatistical Society and the Myrto Lefkopoulou award fromthe Harvard School of Public Health Geert Molenberghsis founding director of the Center for Statistics He isalso the director of the Interuniversity Institute for Bio-statistics and statistical Bioinformatics (I-BioStat) Jointlywith Geert Verbeke Mike Kenward Tomasz BurzykowskiMarc Buyse and Marc Aerts he authored books on lon-gitudinal and incomplete data and on surrogate markerevaluation GeertMolenberghs received several Excellencein Continuing Education Awards of the American Statisti-cal Association for courses at Joint Statistical Meetings

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 9: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

Leacutevy Processes L

L

If the variancendashcovariance matrix of the error vector ecoincides (except a multiplicative scalar σ ) with a positivedenite matrix V then the Best Linear Unbiased estima-tion reduces to the minimization of (y minusXβ)TV(y minusXβ)called the weighed LS problem Moreover if rank(X) = pthen its solution is given in the form

β = (XTVminusX)minusXTVminusy

It is worth to add that a nonlinear LS problem is morecomplicated and its explicit solution is usually not knownInstead of this some algorithms are suggestedTotal least squares problem e problem has been

posed in recent years in numerical analysis as an alterna-tive for the LS problem in the casewhen all data are aectedby errorsConsider an overdetermined system of n linear equa-

tionsAx asymp b with k unknown xe TLS problem consistsin minimizing the Frobenius norm

∣∣[A b] minus [A b]∣∣Ffor all A isin Rntimesk and b isin range(A) where the Frobeniusnorm is dened by ∣∣(aij)∣∣F = sumij aij Once a minimizing[A b] is found then any x satisfying Ax = b is called a TLSsolution of the initial system Ax asymp b

e trouble is that the minimization problem maynot be solvable or its solution may not be unique As anexample one can set

A =⎡⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎦and b =

⎡⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎦

It is known that the TLS solution (if exists) is alwaysbetter than the ordinary LS in the sense that the cor-rection b minus Ax has smaller l-norm e main tool insolving the TLS problems is the following Singular ValueDecompositionFor any matrix A of n times k with real entries there

exist orthonormal matrices P = [p pn] and

Q = [q qk] such that

PTAQ = diag(σ σm) where σ ge ⋯ ge σm andm = minn k

About the AuthorFor biography see the entry 7Random Variable

Cross References7Absolute Penalty Estimation7Adaptive Linear Regression

7Analysis of Areal and Spatial Interaction Data7Autocorrelation in Regression7Best Linear Unbiased Estimation in Linear Models7Estimation7Gauss-Markoveorem7General Linear Models7Least Absolute Residuals Procedure7Linear Regression Models7Nonlinear Models7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Simple Linear Regression7Statistics History of7Two-Stage Least Squares

References and Further ReadingBjoumlrk A () Numerical methods for least squares problems

SIAM PhiladelphiaKariya T () Generalized least squares Wiley New YorkRao CR Toutenberg H () Linear models least squares and

alternatives Springer New YorkVan Huffel S Vandewalle J () The total least squares problem

computational aspects and analysis SIAM PhiladelphiaWolberg J () Data analysis using the method of least squares

extracting the most information from experiments SpringerNew York

Leacutevy Processes

Mohamed Abdel-HameedProfessorUnited Arab Emirates University Al Ain United ArabEmirates

IntroductionLeacutevy processes have become increasingly popular in engi-neering (reliability dams telecommunication) andmathe-matical nanceeir applications in reliability stems fromthe fact that they provide a realistic model for the degrada-tion of devices while their applications in the mathemati-cal theory of dams as they provide a basis for describingthe water input of dams eir popularity in nance isbecause they describe the nancialmarkets in amore accu-rate way than the celebrated BlackndashScholes model elatter model assumes that the rate of returns on assetsare normally distributed thus the process describing theasset price over time is continuous process In reality theasset prices have jumps or spikes and the asset returnsexhibit fat tails and 7skewness which negates the nor-mality assumption inherited in the BlackndashScholes model

L Leacutevy Processes

Because of the deciencies in the BlackndashScholes modelresearchers in mathematical nance have been trying tondmore suitable models for asset prices Certain types ofLeacutevy processes have been found to provide a good modelfor creep of concrete fatigue crack growth corroded steelgates and chloride ingress into concrete Furthermore cer-tain types of Leacutevy processes have been used to model thewater input in damsIn this entry we will review Leacutevy processes and give

important examples of such processes and state somereferences to their applications

Leacutevy ProcessesA stochastic process X = Xt t ge that has right con-tinuous sample paths with le limits is said to be a Leacutevyprocess if the following hold

X has stationary increments ie for every s t ge thedistribution of Xt+s minus Xt is independent of t

X has independent increments ie for every t sge Xt+s minus Xt is independent of ( Xuu le t)

X is stochastically continuous ie for every t ge andє gt

limsrarrtP(∣Xt minus Xs∣ gt є) = at is to say a Leacutevy process is a stochastically continu-ous process with stationary and independent incrementswhose sample paths are right continuous with le handlimitsIf Φ(z) is the characteristic function of a Leacutevy process

then its characteristic component φ(z) def= ln Φ(z)t is of the

form

iza minus zb+ int

R[exp(izx) minus minus izxI∣x∣lt]ν(dx)

where a isin R b isin R+ and ν is a measure on R satisfyingν() = intR( and x

)ν(dx) ltinfine measure ν characterizes the size and frequency of

the jumps If the measure is innite then the process hasinnitelymany jumps of very small sizes in any small inter-vale constant a dened above is called the dri term ofthe process and b is the variance (volatility) term

e LeacutevyndashIt^o decomposition identify any Leacutevy process

as the sum of three independent processes it is stated asfollowsGiven any a isin R b isin R+ and measure ν on R satisfying

ν() = intR(andx)ν(dx) ltinfin there exists a probability

space (Ω P) on which a Leacutevy process X is denede

process X is the sum of three independent processes()X

()X and

()X where

()X is a Brownianmotionwith dri a and

volatility b (in the sense dened below)()X is a compound

Poisson process and()X is a square integrable martingale

e characteristic components of()X

()X and

()X (denoted

by()φ (z)

()φ (z) and

()φ (z) respectively) are as follows

()φ (z) = iza minus zb

()φ (z)= int∣x∣ge(exp(izx) minus )ν(dx)()φ (z)= int∣x∣lt(exp(izx) minus minus izx)ν(dx)

Examples of the Leacutevy ProcessesThe Brownian MotionA Leacutevy process is said to be a Brownian motion (see7Brownian Motion and Diusions) with dri micro andvolatility rate σ if micro = a b = σ and ν (R)= Brow-nian motion is the only nondeterministic Leacutevy processeswith continuous sample paths

The Inverse Brownian ProcessLet X be a Brownian motion with micro gt and volatility rateσ For any x gt let Tx = inft Xt gt xen Tx is anincreasing Leacutevy process (called inverse Brownianmotion)its Leacutevy measure is given by

υ(dx) = radicπσ x

exp(minusxmicro

σ )

The Compound Poisson Processe compound Poisson process (see 7Poisson Processes)is a Leacutevy process where b = and ν is a nite measure

The Gamma Processe gamma process is a nonnegative increasing Leacutevy pro-cess X where b = a minus int xν(dx) = and its Leacutevymeasure is given by

ν(dx) = αxexp(minusxβ)dx x gt

where α β gt It follows that the mean term (E(X ))and the variance term (V(X )) for the process are equalto αβ and αβ respectively

e following is a simulated sample path of a gammaprocess where α = and β = (Fig )

The Variance Gamma Processe variance gamma process is a Leacutevy process that can berepresented as either the dierence between two indepen-dent gamma processes or as a Brownian process subordi-nated by a gamma processe latter is accomplished by arandom time change replacing the time of the Brownianprocess by a gamma process with a mean term equal to e variance gamma process has three parameters micro ndash the

Life Expectancy L

L

0 2 4 6 8 100

2

4

6

8

10

12

t

x

Leacutevy Processes Fig Gamma process

0 2 4 6 8 10minus08

minus06

minus04

minus02

0

02

04

06

t

x

Brownian motionVariance gamma

Leacutevy Processes Fig Brownian motion and variance gammasample paths

Brownian process dri term σ ndash the volatility of theBrownian process and ν ndash the variance term of the thegamma process

e following are two simulated sample paths one fora Brownian motion with a dri term micro = and volatilityterm σ = and the other is for a variance gamma processwith the same values for the dri term and the volatilityterms and ν = (Fig )

About the AuthorProfessor Mohamed Abdel-Hameed was the chairman ofthe Department of Statistics and Operations ResearchKuwait University (ndash) He was on the editorialboard of Applied Stochastic Models and Data Analysis(ndash) He was the Editor (jointly with E Cinlarand J Quinn) of the text Survival Models Maintenance

Replacement Policies and Accelerated Life Testing (Aca-demic Press ) He is an elected member of the Inter-national Statistical Institute He has published numerouspapers in Statistics and Operations research

Cross References7Non-Uniform Random Variate Generations7Poisson Processes7Statistical Modeling of Financial Markets7Stochastic Processes7Stochastic Processes Classication

References and Further ReadingAbdel-Hameed MS () Life distribution of devices subject to a

Leacutevy wear process Math Oper Res ndashAbdel-Hameed M () Optimal control of a dam using PMλτ

policies and penalty cost when the input process is a com-pound Poisson process with a positive drift J Appl Prob ndash

Black F Scholes MS () The pricing of options and corporateliabilities J Pol Econ ndash

Cont R Tankov P () Financial modeling with jump processesChapman amp HallCRC Press Boca Raton

Madan DB Carr PP Chang EC () The variance gamma processand option pricing Eur Finance Rev ndash

Patel A Kosko B () Stochastic resonance in continuous andspiking Neutron models with Leacutevy noise IEEE Trans NeuralNetw ndash

van Noortwijk JM () A survey of the applications of gammaprocesses in maintenance Reliab Eng Syst Safety ndash

Life Expectancy

Maja Biljan-AugustFull ProfessorUniversity of Rijeka Rijeka Croatia

Life expectancy is dened as the average number of yearsa person is expected to live from age x as determined bystatisticsStatistics on life expectancy are derived from a mathe-

matical model known as the 7life table In order to calcu-late this indicator the mortality rate at each age is assumedto be constant Life expectancy (ex) can be evaluated atany age and in a hypothetical stationary population canbe written in discrete form as

ex =Txlx

where x is age Tx is the number of person-years livedaged x and over and lx is the number of survivors at agex according to the life table

L Life Expectancy

Life expectancy can be calculated for combined sexesor separately for males and femalesere can be signi-cant dierences between sexesLife expectancy at birth (e) is the average number of

years a newborn child can expect to live if currentmortalitytrends remain constant

e =Tl

where T is the total size of the population and l is thenumber of births (the original number of persons in thebirth cohort)Life expectancy declines with age Life expectancy at

birth is highly inuenced by infant mortality rate eparadox of the life table refers to a situation where lifeexpectancy at birth increases for several years aer birth(e lt e lt e and even beyond)e paradox reects thehigher rates of infant and child mortality in populationsin pre-transition and middle stages of the demographictransitionLife expectancy at birth is a summary measure of mor-

tality in a population It is a frequently used indicatorof health standards and socio-economic living standardsLife expectancy is also one of the most commonly usedindicators of social development is indicator is easilycomparable through time and between areas includingcountries Inequalities in life expectancy usually indicateinequalities in health and socio-economic developmentLife expectancy rose throughout human history In

ancient Greece and Rome the average life expectancywas below years between the years and life expectancy at birth rose from about years to aglobal average of years and to more than years inthe richest countries (Riley ) Furthermore in mostindustrialized countries in the early twenty-rst centurylife expectancy averaged at about years (WHO)esechanges called the ldquohealth transitionrdquo are essentially theresult of improvements in public health medicine andnutritionLife expectancy varies signicantly across regions and

continents from life expectancies of about years insome central African populations to life expectancies of years and above in many European countries emore developed regions have an average life expectancy of years while the population of less developed regionsis at birth expected to live an average years less etwo continents that display the most extreme dierencesin life expectancies are North America ( years) andAfrica ( years) where as of recently the gap betweenlife expectancies amounts to years (UN ndash data)

Countries with the highest life expectancies in theworld ( years) are Australia Iceland Italy Switzerlandand Japan ( years) Japanese men and women live anaverage of and years respectively (WHO )In countries with a high rate of HIV infection prin-

cipally in Sub-Saharan Africa the average life expectancyis years and below Some of the worldrsquos lowest lifeexpectancies are in Sierra Leone ( years) Afghanistan( years) Lesotho ( years) and Zimbabwe ( years)In nearly all countries women live longer than men

e worldrsquos average life expectancy at birth is years formales and years for females the gap is about ve yearse female-to-male gap is expected to narrow in the moredeveloped regions andwiden in the less developed regionse Russian Federation has the greatest dierence in lifeexpectancies between the sexes ( years less for men)whereas in Tonga life expectancy for males exceeds thatfor females by years (WHO )Life expectancy is assumed to rise continuously

According to estimation by the UN global life expectancyat birth is likely to rise to an average years by ndashBy life expectancy is expected to vary across countriesfrom to years Long-range United Nations popula-tion projections predict that by on average peoplecan expect to live more than years from (Liberia) upto years (Japan)For more details on the calculation of life expectancy

including continuous notation see Keytz ( )and Preston et al ()

Cross References7Biopharmaceutical Research Statistics in7Biostatistics7Demography7Life Table7Mean Residual Life7Measurement of Economic Progress

References and Further ReadingKeyfitz N () Introduction to the Mathematics of Population

Addison-Wesly MassachusettsKeyfitz N () Applied mathematical demography Springer

New YorkPreston SH Heuveline P Guillot M () Demography measuring

and modelling potion processes Blackwell OxfordRiley JC () Rising life expectancy a global history Cambridge

University Press CambridgeRowland DT () Demographic methods and concepts Oxford

University Press New YorkSiegel JS Swanson DA () The methods and materials of demog-

raphy nd edn Elsevier Academic Press AmsterdamWorld Health Organization () World health statistics

Geneva

Life Table L

L

World Population Prospects The revision Highlights UnitedNations New York

World Population to () United Nations New York

Life Table

JanM HoemProfessor Director EmeritusMax Planck Institute for Demographic Research RostockGermany

e life table is a classical tabular representation of centralfeatures of the distribution function F of a positive variablesay X which normally is taken to represent the lifetime ofa newborn individuale life table was introduced wellbefore modern conventions concerning statistical distri-butions were developed and it comes with some specialterminology and notation as follows Suppose that F has a

density f (x) = ddxF(x) and dene the force of mortality or

death intensity at age x as the function

micro(x) = minus ddxln minus F(x) = f (x)

minus F(x)

Heuristically it is interpreted by the relation micro(x)dx =Px lt X lt x + dx∣X gt x Conversely F(x) =

minus expminusx

intmicro(s)ds e survivor function is dened as

ℓ(x) = ℓ() minus F(x) normally with ℓ() = Inmortality applications ℓ(x) is the expected number of sur-vivors to exact age x out of an original cohort of newborn babiese survival probability is

tpx = PX gt x + t∣X gt x = ℓ(x + t)ℓ(x)

= expminusintt

micro(x + s)ds

and the non-survival probability is (the converse) tqx = minustpx For t = one writes qx = qx and px = px In partic-ular we get ℓ(x + ) = ℓ(x)pxis is a practical recursionformula that permits us to compute all values of ℓ(x) oncewe know the values of px for all relevant x

e life expectancy is eo = EX = int infin ℓ(x)dxℓ()(e subscript in eo indicates that the expected value iscomputed at age (ie for newborn individuals) and thesuperscript o indicates that the computation is made in

the continuous mode)e remaining life expectancy atage x is

eox = E(X minus x∣X gt x) = intinfin

ℓ(x + t)dtℓ(x)

ie it is the expected lifetime remaining to someone whohas attained age xTo turn to the statistical estimation of these various

quantities suppose that the function micro(x) is piecewiseconstant which means that we take it to equal some con-stant say microj over each interval (xj xj+) for some partitionxj of the age axis For a collection Xi of independentobservations of X let Dj be the number of Xi that fall inthe interval (xj xj+) In mortality applications this is thenumber of deaths observed in the given interval For thecohort of the initially newborn Dj is the number of indi-viduals whodie in the interval (called the occurrences in theinterval) If individual i dies in the interval he or she willof course have lived for Xi minus xj time units during the inter-val Individuals who survive the interval will have lived forxj+ minus xj time units in the interval and individuals whodo not survive to age xj will not have lived during thisinterval at all When we aggregate the time units lived in(xj xj+) over all individuals we get a total Rj which iscalled the exposures for the interval the idea being thatindividuals are exposed to the risk of death for as long asthey live in the interval In the simple case where there areno relations between the individual parameters microj the col-lection DjRj constitutes a statistically sucient set ofobservations with a likelihood Λ that satises the relationln Λ = sumjminusmicrojRj+Dj ln microjwhich is easily seen to bemax-imized by microj = DjRje latter fraction is therefore themaximum-likelihood estimator for microj (In some connec-tions an age schedule of mortality will be specied such asthe classical GompertzndashMakeham function microx = a + bcxwhich does represent a relationship between the intensityvalues at the dierent ages x normally for single-year agegroupsMaximum likelihood estimators can then be foundby plugging this functional specication of the intensitiesinto the likelihood function nding the values a b andc that maximize Λ and using a + bcx for the intensity inthe rest of the life table computations Methods that do notamount tomaximum likelihood estimationwill sometimesbe used because they involve simpler computations Withsome luck they provide starting values for the iterative pro-cess that must usually be applied to produce the maximumlikelihood estimators For an example see Forseacuten ())is whole schema can be extended trivially to cover cen-soring (withdrawals) provided the censoringmechanism isunrelated to the mortality process

L Likelihood

If the force of mortality is constant over a single-yearage interval (x x + ) say and is estimated by microx in thisinterval then px = eminusmicrox is an estimator of the single-yearsurvival probability pxis allows us to estimate the sur-vival function recursively for all corresponding ages usingℓ(x + ) = ℓ(x)px for x = and the rest of thelife table computations follow suit Life table constructionconsists in the estimation of the parameters and the tab-ulation of functions like those above from empirical datae data can be for age at death for individuals as in theexample indicated above but they can also be observa-tions of duration until recovery from an illness of intervalsbetween births of time until breakdown of some piece ofmachinery or of any other positive duration variableSo far we have argued as if the life table is computed for

a group of mutually independent individuals who have allbeen observed in parallel essentially a cohort that is fol-lowed from a signicant common starting point (namelyfrom birth in our mortality example) and which is dimin-ished over time due to decrements (attrition) caused bythe risk in question and also subject to reduction due tocensoring (withdrawals)e corresponding table is thencalled a cohort life table It is more common however toestimate a px schedule from data collected for the mem-bers of a population during a limited time period and touse the mechanics of life-table construction to produce aperiod life table from the px valuesLife table techniques are described in detail in most

introductory textbooks in actuarial statistics7biostatistics7demography and epidemiology See eg Chiang ()Elandt-Johnson and Johnson () Manton and Stallard() Preston et al () For the history of thetopic consult Seal () Smith and Keytz () andDupacircquier ()

About the AuthorFor biography see the entry 7Demography

Cross References7Demographic Analysis A Stochastic Approach7Demography7Event History Analysis7Kaplan-Meier Estimator7Population Projections7Statistics An Overview

References and Further ReadingChiang CL () The life table and its applications Krieger

MalabarDupacircquier J () Lrsquoinvention de la table de mortaliteacute Presses

universitaires de France Paris

Elandt-Johnson RC Johnson NL () Survival models and dataanalysis Wiley New York

Forseacuten L () The efficiency of selected moment methods inGompertzndashMakeham graduation of mortality Scand Actuarial Jndash

Manton KG Stallard E () Recent trends in mortality analysisAcademic Press Orlando

Preston SH Heuveline P Guillot M () Demography Measuringand modeling populations Blackwell Oxford

Seal H () Studies in history of probability and statistics mul-tiple decrements of competing risks Biometrica ()ndash

Smith D Keyfitz N () Mathematical demography selectedpapers Springer Heidelberg

Likelihood

Nancy ReidProfessorUniversity of Toronto Toronto ON Canada

Introductione likelihood function in a statistical model is propor-tional to the density function for the random variable tobe observed in the model Most oen in applications oflikelihood we have a parametric model f (y θ) where theparameter θ is assumed to take values in a subset of Rkand the variable y is assumed to take values in a subset ofRn the likelihood function is dened by

L(θ) = L(θ y) = cf (y θ) ()

where c can depend on y but not on θ In more gen-eral settings where the model is semi-parametric or non-parametric the explicit denition is more dicult becausethe density needs to be dened relative to a dominatingmeasure whichmay not exist seeVan derVaart () andMurphy andVan derVaart ()is article will consideronly nite-dimensional parametric modelsWithin the context of the given parametric model the

likelihood function measures the relative plausibility ofvarious values of θ for a given observed data point y Val-ues of the likelihood function are only meaningful relativeto each other and for this reason are sometimes stan-dardized by the maximum value of the likelihood func-tion although other reference points might be of interestdepending on the contextIf ourmodel is f (y θ) = (ny)θy(minusθ)nminusy y = n

θ isin [ ] then the likelihood function is (any functionproportional to)

L(θ y) = θy( minus θ)nminusy

Likelihood L

L

and can be plotted as a function of θ for any xed valueof y e likelihood function is maximized at θ = ynis model might be appropriate for a sampling schemewhich recorded the number of successes amongn indepen-dent trials that result in success or failure each trial havingthe same probability of success θ Another example is thelikelihood function for the mean and variance parameterswhen sampling from a normal distribution with mean microand variance σ

L(θ y) = expminusn log σ minus (σ )Σ(yi minus micro)

where θ = (micro σ )is could also be plotted as a functionof micro and σ for a given sample y yn and it is not dif-cult to show that this likelihood function only dependson the sample through the sample mean y = nminusΣyi andsample variance s = (n minus )minusΣ(yi minus y) or equivalentlythrough Σyi and Σyi It is a general property of likelihoodfunctions that they depend on the data only through theminimal sucient statistic

Inferencee likelihood function was dened in a seminal paperof Fisher () and has since become the basis for mostmethods of statistical inference One version of likelihoodinference suggested by Fisher is to use some rule suchas L(θ)L(θ) gt k to dene a range of ldquolikelyrdquo or ldquoplau-siblerdquo values of θ Many authors including Royall ()and Edwards () have promoted the use of plots ofthe likelihood function along with interval estimates ofplausible valuesis approach is somewhat limited how-ever as it requires that θ have dimension or possibly or that a likelihood function can be constructed that onlydepends on a component of θ that is of interest see sectionldquo7Nuisance Parametersrdquo belowIn general we would wish to calibrate our inference

for θ by referring to the probabilistic properties of theinferential method One way to do this is to introduce aprobability measure on the unknown parameter θ typi-cally called a prior distribution and use Bayesrsquo rule forconditional probabilities to conclude

π(θ ∣ y) = L(θ y)π(θ)intθL(θ y)π(θ)dθ

where π(θ) is the density for the prior measure and π(θ ∣y) provides a probabilistic assessment of θ aer observingY = y in the model f (y θ) We could then make con-clusions of the form ldquohaving observed successes in trials and assuming π(θ) = the posterior probabilitythat θ gt is rdquo and so on

is is a very brief description of Bayesian inference inwhich probability statements refer to that generated from

the prior through the likelihood to the posterior A majordiculty with this approach is the choice of prior prob-ability function In some applications there may be anaccumulation of previous data that can be incorporatedinto a probability distribution but in general there is notand some rather ad hoc choices are oen made Anotherdiculty is themeaning to be attached to probability state-ments about the parameterInference based on the likelihood function can also be

calibrated with reference to the probability model f (y θ)by examining the distribution ofL(θY) as a random func-tion or more usually by examining the distribution ofvarious derived quantitiesis is the basis for likelihoodinference from a frequentist point of view In particularit can be shown that logL(θY)L(θY) where θ =θ(Y) is the value of θ at which L(θY) is maximized isapproximately distributed as a χk random variable wherek is the dimension of θ To make this precise requires anasymptotic theory for likelihood which is based on a cen-tral limit theorem (see 7Central Limiteorems) for thescore function

U(θY) = partpartθlogL(θY)

If Y = (Y Yn) has independent components thenU(θ) is a sum of n independent components which undermild regularity conditions will be asymptotically normalTo obtain the χ result quoted above it is also necessary toinvestigate the convergence of θ to the true value govern-ing the model f (y θ) Showing this convergence usuallyin probability but sometimes almost surely can be di-cult see Scholz () for a summary of some of the issuesthat ariseAssuming that θ is consistent for θ and that L(θY)

has sucient regularity the follow asymptotic results canbe established

(θ minus θ)T i(θ)(θ minus θ) drarr χk ()

U(θ)T iminus(θ)U(θ) drarr χk ()

ℓ(θ) minus ℓ(θ) drarr χk ()

where i(θ) = Eminusℓprimeprime(θY) θ is the expected Fisher infor-mation function ℓ(θ) = logL(θ) is the log-likelihoodfunction and χk is the 7chi-square distribution with kdegrees of freedom

ese results are all versions of a more general resultthat the log-likelihood function converges to the quadraticform corresponding to a multivariate normal distribution(see 7Multivariate Normal Distributions) under suitablystated limiting conditions ere is a similar asymptoticresult showing that the posterior density is asymptotically

L Likelihood

normal and in fact asymptotically free of the prior distri-bution although this result requires that the prior distribu-tion be a proper probability density ie has integral overthe parameter space equal to

Nuisance ParametersIn models where the dimension of θ is large plotting thelikelihood function is not possible and inference based onthe multivariate normal distribution for θ or the χk dis-tribution of the log-likelihood ratio doesnrsquot lead easily tointerval estimates for components of θ However it is pos-sible to use the likelihood function to construct inferencefor parameters of interest using variousmethods that havebeen proposed to eliminate nuisance parametersSuppose in the model f (y θ) that θ = (ψ λ) where ψ

is a k-dimensional parameter of interest (which will oenbe )e prole log-likelihood function of ψ is

ℓP(ψ) = ℓ(ψ λψ)

where λψ is the constrainedmaximum likelihood estimateit maximizes the likelihood function L(ψ λ) when ψ isheld xede prole log-likelihood function is also calledthe concentrated log-likelihood function especially ineconometrics If the parameter of interest is not expressedexplicitly as a subvector of θ then the constrained maxi-mum likelihood estimate is found using Lagrange multi-pliersIt can be veried under suitable smoothness conditions

that results similar to those at ( ndash ) hold as well for theprole log-likelihood function in particular

ℓP(ψ) minus ℓP(ψ) = ℓ(ψ λ) minus ℓ(ψ λψ)drarr χk

is method of eliminating nuisance parameters is notcompletely satisfactory especially when there are manynuisance parameters in particular it doesnrsquot allow forerrors in estimation of λ For example the prole likeli-hood approach to estimation of σ in the linear regressionmodel (see 7Linear Regression Models) y sim N(Xβ σ )will lead to the estimator σ = Σ(yi minus yi)n whereas theestimator usually preferred has divisor nminusp where p is thedimension of β

us a large literature has developed on improvementsto the prole log-likelihood For Bayesian inference suchimprovements are ldquoautomaticallyrdquo included in the formu-lation of the marginal posterior density for ψ

πM(ψ ∣ y)prop int π(ψ λ ∣ y)dλ

but it is typically quite dicult to specify priors for possiblyhigh-dimensional nuisance parameters For non-Bayesian

inference most modications to the prole log-likelihoodare derived by considering conditional or marginal infer-ence in models that admit factorizations at least approxi-mately like the following

f (y θ) = f(yψ)f(y ∣ y λ) orf (y θ) = f(y ∣ yψ)f(y λ)

A discussion of conditional inference and density factori-sations is given in Reid ()is literature is closely tiedto that on higher order asymptotic theory for likelihoode latter theory builds on saddlepoint and Laplace expan-sions to derive more accurate versions of (ndash) see forexample Severini () and Brazzale et al () edirect likelihood approach of Royall () and others doesnot generalize very well to the nuisance parameter settingalthough Royall and Tsou () present some results inthis direction

Extensions to Likelihoode likelihood function is such an important aspect ofinference based on models that it has been extended toldquolikelihood-likerdquo functions formore complex data settingsExamples include nonparametric and semi-parametriclikelihoods the most famous semi-parametric likelihoodis the proportional hazards model of Cox () Butmany other extensions have been suggested to empiri-cal likelihood (Owen ) which is a type of nonpara-metric likelihood supported on the observed sample toquasi-likelihood (McCullagh ) which starts from thescore function U(θ) and works backwards to an infer-ence function to bootstrap likelihood (Davison et al )and many modications of prole likelihood (Barndor-Nielsen andCox Fraser )ere is recent interestfor multi-dimensional responses Yi in composite likeli-hoods which are products of lower dimensional condi-tional or marginal distributions (Varin ) Reid ()concluded a review article on likelihood by stating

7 From either a Bayesian or frequentist perspective the like-lihood function plays an essential role in inference Themaximum likelihood estimator once regarded on an equalfooting among competing point estimators is now typi-cally the estimator of choice although some refinementis needed in problems with large numbers of nuisanceparameters The likelihood ratio statistic is the basis formost tests of hypotheses and interval estimates The emer-gence of the centrality of the likelihood function for infer-ence partly due to the large increase in computing poweris one of the central developments in the theory of statisticsduring the latter half of the twentieth century

Likelihood L

L

Further Readinge book by Cox and Hinkley () gives a detailedaccount of likelihood inference and principles of statis-tical inference see also Cox () ere are severalbook-length treatments of likelihood inference includingEdwards () Azzalini () Pawitan () and Sev-erini () this last discusses higher order asymptotictheory in detail as does Barndor-Nielsen andCox ()and Brazzale Davison and Reid () A short reviewpaper is Reid () An excellent overview of consis-tency results for maximum likelihood estimators is Scholz() see also Lehmann andCasella () Foundationalissues surrounding likelihood inference are discussed inBerger and Wolpert ()

About the AuthorProfessor Reid is a Past President of the Statistical Societyof Canada (ndash) During (ndash) she servedas the President of the Institute of Mathematical Statis-tics Among many awards she received the Emanuel andCarol Parzen Prize for Statistical Innovation () ldquoforleadership in statistical science for outstanding researchin theoretical statistics and highly accurate inference fromthe likelihood function and for inuential contributionsto statistical methods in biology environmental sciencehigh energy physics and complex social surveysrdquo She wasawarded the Gold Medal Statistical Society of Canada() and FlorenceNightingaleDavidAward Committeeof Presidents of Statistical Societies () She is AssociateEditor of Statistical Science (ndash) Bernoulli (ndash) andMetrika (ndash)

Cross References7Bayesian Analysis or Evidence Based Statistics7Bayesian Statistics7Bayesian Versus Frequentist Statistical Reasoning7Bayesian vs Classical Point Estimation A ComparativeOverview7Chi-Square Test Analysis of Contingency Tables7Empirical Likelihood Approach to Inference fromSample Survey Data7Estimation7Fiducial Inference7General Linear Models7Generalized Linear Models7Generalized Quasi-Likelihood (GQL) Inferences7Mixture Models7Philosophical Foundations of Statistics

7Risk Analysis7Statistical Evidence7Statistical Inference7Statistical Inference An Overview7Statistics An Overview7Statistics Nelderrsquos view7Testing Variance Components in Mixed LinearModels7Uniform Distribution in Statistics

References and Further ReadingAzzalini A () Statistical inference Chapman and Hall LondonBarndorff-Nielsen OE Cox DR () Inference and asymptotics

Chapman and Hall LondonBerger JO Wolpert R () The likelihood principle Institute of

Mathematical Statistics HaywardBirnbaum A () On the foundations of statistical inference Am

Stat Assoc ndashBrazzale AR Davison AC Reid N () Applied asymptotics

Cambridge University Press CambridgeCox DR () Regression models and life tables J R Stat Soc B

ndash (with discussion)Cox DR () Principles of statistical inference Cambridge Uni-

versity Press CambridgeCox DR Hinkley DV () Theoretical statistics Chapman and

Hall LondonDavison AC Hinkley DV Worton B () Bootstrap likelihoods

Biometrika ndashEdwards AF () Likelihood Oxford University Press OxfordFisher RA () On the mathematical foundations of theoretical

statistics Phil Trans R Soc A ndashLehmann EL Casella G () Theory of point estimation nd edn

Springer New YorkMcCullagh P () Quasi-likelihood functions Ann Stat ndashMurphy SA Van der Vaart A () Semiparametric likelihood ratio

inference Ann Stat ndashOwen A () Empirical likelihood ratio confidence intervals for a

single functional Biometrika ndashPawitan Y () In all likelihood Oxford University Press OxfordReid N () The roles of conditioning in inference Stat Sci

ndashReid N () Likelihood J Am Stat Assoc ndashRoyall RM () Statistical evidence a likelihood paradigm

Chapman and Hall LondonRoyall RM Tsou TS () Interpreting statistical evidence using

imperfect models robust adjusted likelihoodfunctions J R StatSoc B

Scholz F () Maximum likelihood estimation In Ency-clopedia of statistical sciences Wiley New York doiesspub Accessed Aug

Severini TA () Likelihood methods in statistics Oxford Univer-sity Press Oxford

Van der Vaart AW () Infinite-dimensional likelihood meth-ods in statistics httpwwwstieltjesorgarchiefbiennialframenodehtml Accessed Aug

Varin C () On composite marginal likelihood Adv Stat Analndash

L Limit Theorems of Probability Theory

Limit Theorems of ProbabilityTheory

Alexandr Alekseevich BorovkovProfessor Head of Probability and Statistics Chair at theNovosibirsk UniversityNovosibirsk University Novosibirsk Russia

Limit eorems of Probability eory is a broad namereferring to the most essential and extensive research areain Probability eory which at the same time has thegreatest impact on the numerous applications of the latterBy its very nature Probability eory is concerned

with asymptotic (limiting) laws that emerge in a long seriesof observations on random events Because of this in theearly twentieth century even the very denition of prob-ability of an event was given by a group of specialists(R von Mises and some others) as the limit of the rela-tive frequency of the occurrence of this event in a longrow of independent random experiments e ldquostabilityrdquoof this frequency (ie that such a limit always exists) waspostulated Aer the s Kolmogorovrsquos axiomatic con-struction of probability theory has prevailed One of themain assertions in this axiomatic theory is the Law of LargeNumbers (LLN) on the convergence of the averages of largenumbers of random variables to their expectation islaw implies the aforementioned stability of the relative fre-quencies and their convergence to the probability of thecorresponding event

e LLN is the simplest limit theorem (LT) of probabil-ity theory elucidating the physical meaning of probabilitye LLN is stated as follows if XXX is a sequenceof iid random variables

Sn =n

sumj=Xj

and the expectation a = EX exists then nminusSnasETHrarr a

(almost surely ie with probability )us the value nacan be called the rst order approximation for the sums Sne Central Limiteorem (CLT) gives one a more preciseapproximation for Sn It says that if σ = E(X minus a) ltinfinthen the distribution of the standardized sum ζn = (Sn minusna)σ

radicn converges as n rarr infin to the standard normal

(Gaussian) lawat is for all x

P(ζn lt x)rarr Φ(x) = radicπ

x

intminusinfin

eminustdt

e quantity nEξ + ζσradicn where ζ is a standard normal

random variable (so that P(ζ lt x) = Φ(x)) can be calledthe second order approximation for Sn

e rst LLN (for the Bernoulli scheme) was provedby Jacob Bernoulli in the late s (published posthu-mously in ) e rst CLT (also for the Bernoullischeme) was established by A de Moivre (rst publishedin and referred nowadays to as the deMoivrendashLaplacetheorem) In the beginning of the nineteenth centuryPS Laplace and CF Gauss contributed to the generaliza-tion of these assertions and appreciation of their enormousapplied importance (in particular for the theory of errorsof observations) while later in that century further break-throughs in both methodology and applicability rangeof the CLT were achieved by PL Chebyshev () andAM Lyapunov ()

e main directions in which the two aforementionedmain LTs have been extended and rened since then are

Relaxing the assumption EX lt infin When the sec-ond moment is innite one needs to assume that theldquotailrdquo P(x) = P(X gt x) + P(X lt minusx) is a func-tion regularly varying at innity such that the limitlimxrarrinfinP(X gt x)P(x) existsen the distribution of thenormalized sum Snσ(n) where σ(n) = Pminus(nminus)Pminus being the generalized inverse of the function P andwe assume that Eξ = when the expectation is niteconverges to one of the so-called stable laws as nrarrinfine7characteristic functions of these laws have simpleclosed-form representations

Relaxing the assumption that the Xjrsquos are identicallydistributed and proceeding to study the so-called tri-angular array scheme where the distributions of thesummands Xj = Xjn forming the sum Sn depend notonly on j but on n as well In this case the class of alllimit laws for the distribution of Sn (under suitable nor-malization) is substantially wider it coincides with theclass of the so-called innitely divisible distributionsAn important special case here is the Poisson limit the-orem on convergence in distribution of the number ofoccurrences of rare events to a Poisson law

Relaxing the assumption of independence of the XjrsquosSeveral types of ldquoweak dependencerdquo assumptions onXjunder which the LLN and CLT still hold true havebeen suggested and investigatedOne should alsomen-tion here the so-called ergodic theorems (see7Ergodiceorem) for a wide class of random sequences andprocesses

Renement of the main LTs and derivation of asymp-totic expansions For instance in the CLT bounds ofthe rate of convergence P(ζn lt x) minus Φ(x) rarr and

Limit Theorems of Probability Theory L

L

asymptotic expansions for this dierence (in the pow-ers of nminus in the case of iid summands) have beenobtained under broad assumptions

Studying large deviation probabilities for the sums Sn(theorems on rare events) If x rarr infin together with nthen the CLT can only assert that P(ζn gt x) rarr eorems on large deviation probabilities aim to nd afunction P(xn) such that

P(ζn gt x)P(xn) rarr as nrarrinfin x rarrinfin

e nature of the function P(xn) essentially dependson the rate of decay of P(X gt x) as x rarrinfin and on theldquodeviation zonerdquo ie on the asymptotic behavior of theratio xn as nrarrinfin

Considering observations X Xn of a more com-plex nature ndash rst of all multivariate random vectorsIf Xj isin Rd then the role of the limit law in the CLTwill be played by a d-dimensional normal (Gaussian)distribution with the covariance matrix E(X minus EX)(X minus EX)T

e variety of application areas of the LLN and CLTis enormous us Mathematical Statistics is based onthese LTs Let Xlowastn = (X Xn) be a sample froma distribution F and Flowastn (u) the corresponding empiricaldistribution functione fundamental GlivenkondashCantellitheorem (see 7Glivenko-Cantelli eorems) stating thatsupu ∣F

lowastn (u)minus F(u)∣

asETHrarr as nrarrinfin is of the same natureas the LLN and basically means that the unknown distri-bution F can be estimated arbitrary well from the randomsample Xlowastn of a large enough size n

e existence of consistent estimators for the unknownparameters a = Eξ and σ = E(X minus a) also follows fromthe LLN since as nrarrinfin

alowast = n

n

sumj=Xj

asETHrarr a (σ )lowast = n

n

sumj=

(Xj minus alowast)

= n

n

sumj=Xj minus (alowast) asETHrarr σ

Under additional moment assumptions on the distri-bution F one can also construct asymptotic condenceintervals for the parameters a and σ as the distributionsof the quantities

radicn(alowast minus a) and

radicn((σ )lowast minus σ ) con-

verge as n rarr infin to the normal onese same can alsobe said about other parameters that are ldquosmoothrdquo enoughfunctionals of the unknown distribution F

e theorem on the 7asymptotic normality andasymptotic eciency of maximum likelihood estimatorsis another classical example of LTsrsquo applications in mathe-matical statistics (see eg Borovkov ) Furthermore in

estimation theory and hypotheses testing one also needstheorems on large deviation probabilities for the respec-tive statistics as it is statistical procedures with small errorprobabilities that are oen required in applicationsIt is worth noting that summation of random variables

is by no means the only situation in which LTs appear inProbabilityeoryGenerally speaking the main objective of Probabil-

ityeory in applications is nding appropriate stochasticmodels for objects under study and then determining thedistributions or parameters one is interested in As a rulethe explicit form of these distributions and parameters isnot known LTs can be used to nd suitable approximationsto the characteristics in questionAt least two possible approaches to this problem should

be noted here

Suppose that the unknown distribution Fθ depends ona parameter θ such that as θ approaches some ldquocriticalrdquovalue θ the distributions Fθ become ldquodegeneraterdquo inone sense or anotheren in a number of cases onecan nd an approximation for Fθ which is valid for thevalues of θ that are close to θ For instance in actuarialstudies7queueing theory and some other applicationsone of the main problems is concerned with the dis-tribution of S = sup

kge(Sk minus θk) under the assumption

that EX = If θ gt then S is a proper random vari-able If however θ rarr then S asETHrarr infin Here we dealwith the so-called ldquotransient phenomenardquo It turns outthat if σ = Var (X) ltinfin then there exists the limit

limθdarrP(θS gt x) = eminusxσ x gt

is (KingmanndashProkhorov) LT enables one to ndapproximations for the distribution of S in situationswhere θ is small

Sometimes one can estimate the ldquotailsrdquo of the unknowndistributions ie their asymptotic behavior at innityis is of importance in those applications where oneneeds to evaluate the probabilities of rare events If theequation Eemicro(Xminusθ) = has a solution micro gt then inthe above example one has

P(S gt x) sim ceminusmicrox x rarrinfin

where c is a known constant If the distribution F of Xis subexponential (in this case Eemicro(Xminusθ) = infin for anymicro gt ) then

P(S gt x) sim θ int

infin

x( minus F(t))dt x rarrinfin

L Linear Mixed Models

isLTenablesone tondapproximations forP(S gt x)for large xFor both approaches the obtained approximations

can be rened

An important part of Probabilityeory is concernedwith LTs for random processes eir main objective isto nd conditions under which random processes con-verge in some sense to some limit process An extensionof the CLT to that context is the so-called FunctionalCLT (aka the DonskerndashProkhorov invariance princi-ple) which states that as n rarr infin the processesζn(t) = (Slfloorntrfloor minus ant)σ

radicntisin[] converge in distribu-

tion to the standard Wiener process w(t)tisin[]e LTs(including large deviation theorems) for a broad class offunctionals of the sequence (7random walk) S Sncan also be classied as LTs for 7stochastic processese same can be said about Law of iterated logarithmwhich states that for an arbitrary ε gt the random walkSkinfink= crosses the boundary (minusε)σ

radick ln ln k innitely

many times but crosses the boundary ( + ε)σradick ln ln k

nitely many times with probability Similar results holdtrue for trajectories of Wiener processes w(t)tisin[] andw(t)tisin[infin)In mathematical statistics a closely related to func-

tional CLT result says that the so-called ldquoempirical processrdquoradicn(Flowastn (u) minus F(u))uisin(minusinfininfin) converges in distribution

to w(F(u))uisin(minusinfininfin) where w(t) = w(t) minus tw() isthe Brownian bridge processis LT implies7asymptoticnormality of a great many estimators that can be repre-sented as smooth functionals of the empirical distributionfunction Flowastn (u)

ere are many other areas in Probability eoryand its applications where various LTs appear and areextensively used For instance convergence theoremsfor 7martingales asymptotics of extinction probabilityof a branching processes and conditional (under non-extinction condition) LTs on a number of particles etc

About the AuthorProfessor Borovkov is Head of Probability and Statis-tics Department of the Sobolev Institute of MathematicsNovosibirsk (RussianAcademy of Sciences) since Heis Head of Probability and Statistics Chair at the Novosi-birsk University since He is Concelour of RussianAcademy of Sciences and fullmember of RussianAcademyof Sciences (Academician) () He was awarded theState Prize of the USSR () the Markov Prize of theRussian Academy of Sciences () and GovernmentPrize in Education () Professor Borovkov is Editor-in-Chief of the journal ldquoSiberianAdvances inMathematicsrdquo

and Associated Editor of journals ldquoeory of Probabil-ity and its Applicationsrdquo ldquoSiberian Mathematical JournalrdquoldquoMathematical Methods of Statisticsrdquo ldquoElectronic Journal ofProbabilityrdquo

Cross References7Almost Sure Convergence of Random Variables7Approximations to Distributions7Asymptotic Normality7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Central Limiteorems7Empirical Processes7Ergodiceorem7Glivenko-Cantellieorems7Large Deviations and Applications7Laws of Large Numbers7Martingale Central Limiteorem7Probabilityeory An Outline7RandomMatrixeory7Strong Approximations in Probability and Statistics7Weak Convergence of Probability Measures

References and Further ReadingBillingsley P () Convergence of probability measures Wiley

New YorkBorovkov AA () Mathematical statistics Gordon amp Breach

AmsterdamGnedenko BV Kolmogorov AN () Limit distributions for sums

of independent random variables Addison-Wesley CambridgeLeacutevy P () Theacuteorie de lrsquoAddition des Variables Aleacuteatories

Gauthier-Villars ParisLoegraveve M (ndash) Probability theory vols I and II th edn

Springer New YorkPetrov VV () Limit theorems of probability theory sequences of

independent random variables ClarendonOxford UniversityPress New York

Linear Mixed Models

GeertMolenberghsProfessorUniversiteit Hasselt amp Katholieke Universiteit LeuvenLeuven Belgium

In observational studies repeated measurements may betaken at almost arbitrary time points resulting in anextremely large number of time points at which only one

Linear Mixed Models L

L

or only a few measurements have been taken Many of theparametric covariance models described so far may thencontain toomany parameters tomake them useful in prac-tice while othermore parsimoniousmodelsmay be basedon assumptions which are too simplistic to be realisticA general and very exible class of parametric models forcontinuous longitudinal data is formulated as follows

yi∣bi sim N(Xiβ + ZibiΣi) ()bi sim N(D) ()

where Xi and Zi are (ni times p) and (ni times q) dimensionalmatrices of known covariates β is a p-dimensional vec-tor of regression parameters called the xed eects D isa general (q times q) covariance matrix and Σi is a (ni times ni)covariance matrix which depends on i only through itsdimension ni ie the set of unknown parameters in Σi willnot depend upon i Finally bi is a vector of subject-specicor random eects

e above model can be interpreted as a linear regres-sion model (see 7Linear Regression Models) for the vec-tor yi of repeated measurements for each unit separatelywhere some of the regression parameters are specic (ran-dom eects bi) while others are not (xed eects β)edistributional assumptions in () with respect to the ran-dom eects can be motivated as follows First E(bi) = implies that the mean of yi still equals Xiβ such that thexed eects in the random-eects model () can also beinterpreted marginally Not only do they reect the eectof changing covariates within specic units they alsomea-sure the marginal eect in the population of changing thesame covariates Second the normality assumption imme-diately implies that marginally yi also follows a normaldistribution with mean vector Xiβ and with covariancematrix Vi = ZiDZprimei + ΣiNote that the random eects in () implicitly imply the

marginal covariance matrix Vi of yi to be of the very spe-cic form Vi = ZiDZprimei + Σi Let us consider two examplesunder the assumption of conditional independence ieassuming Σi = σ Ini First consider the case where therandom eects are univariate and represent unit-specicintercepts is corresponds to covariates Zi which areni-dimensional vectors containing only ones

e marginal model implied by expressions () and() is

yi sim N(XiβVi) Vi = ZiDZprimei + Σi

which can be viewed as a multivariate linear regressionmodel with a very particular parameterization of thecovariance matrix Vi

With respect to the estimation of unit-specic param-eters bi the posterior distribution of bi given the observeddata yi can be shown to be (multivariate) normal withmean vector equal to DZprimeiV

minusi (α)(yi minus Xiβ) Replacing β

and α by their maximum likelihood estimates we obtainthe so-called empirical Bayes estimates bi for the bi A keyproperty of these EB estimates is shrinkage which is bestillustrated by considering the prediction yi equiv Xi β +Zibi ofthe ith prole It can easily be shown that

yi = ΣiVminusi Xi β + (Ini minus ΣiVminusi ) yi

which can be interpreted as a weighted average of thepopulation-averaged prole Xi β and the observed data yiwith weights ΣiVminusi and Ini minusΣiVminusi respectively Note thatthe ldquonumeratorrdquo of ΣiVminusi represents within-unit variabil-ity and the ldquodenominatorrdquo is the overall covariance matrixVi Hence much weight will be given to the overall averageprole if the within-unit variability is large in comparisonto the between-unit variability (modeled by the randomeects) whereasmuchweight will be given to the observeddata if the opposite is trueis phenomenon is referred toas shrinkage toward the average proleXi β An immediateconsequence of shrinkage is that the EB estimates show lessvariability than actually present in the random-eects dis-tribution ie for any linear combination λ of the randomeects

var(λprimebi)le var(λprimebi) = λprimeDλ

About the AuthorGeert Molenberghs is Professor of Biostatistics at Uni-versiteit Hasselt and Katholieke Universiteit Leuven inBelgium He was Joint Editor of Applied Statistics (ndash) and Co-Editor of Biometrics (ndash) Cur-rently he is Co-Editor of Biostatistics (ndash) He wasPresident of the International Biometric Society (ndash) received the Guy Medal in Bronze from the RoyalStatistical Society and the Myrto Lefkopoulou award fromthe Harvard School of Public Health Geert Molenberghsis founding director of the Center for Statistics He isalso the director of the Interuniversity Institute for Bio-statistics and statistical Bioinformatics (I-BioStat) Jointlywith Geert Verbeke Mike Kenward Tomasz BurzykowskiMarc Buyse and Marc Aerts he authored books on lon-gitudinal and incomplete data and on surrogate markerevaluation GeertMolenberghs received several Excellencein Continuing Education Awards of the American Statisti-cal Association for courses at Joint Statistical Meetings

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 10: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

L Leacutevy Processes

Because of the deciencies in the BlackndashScholes modelresearchers in mathematical nance have been trying tondmore suitable models for asset prices Certain types ofLeacutevy processes have been found to provide a good modelfor creep of concrete fatigue crack growth corroded steelgates and chloride ingress into concrete Furthermore cer-tain types of Leacutevy processes have been used to model thewater input in damsIn this entry we will review Leacutevy processes and give

important examples of such processes and state somereferences to their applications

Leacutevy ProcessesA stochastic process X = Xt t ge that has right con-tinuous sample paths with le limits is said to be a Leacutevyprocess if the following hold

X has stationary increments ie for every s t ge thedistribution of Xt+s minus Xt is independent of t

X has independent increments ie for every t sge Xt+s minus Xt is independent of ( Xuu le t)

X is stochastically continuous ie for every t ge andє gt

limsrarrtP(∣Xt minus Xs∣ gt є) = at is to say a Leacutevy process is a stochastically continu-ous process with stationary and independent incrementswhose sample paths are right continuous with le handlimitsIf Φ(z) is the characteristic function of a Leacutevy process

then its characteristic component φ(z) def= ln Φ(z)t is of the

form

iza minus zb+ int

R[exp(izx) minus minus izxI∣x∣lt]ν(dx)

where a isin R b isin R+ and ν is a measure on R satisfyingν() = intR( and x

)ν(dx) ltinfine measure ν characterizes the size and frequency of

the jumps If the measure is innite then the process hasinnitelymany jumps of very small sizes in any small inter-vale constant a dened above is called the dri term ofthe process and b is the variance (volatility) term

e LeacutevyndashIt^o decomposition identify any Leacutevy process

as the sum of three independent processes it is stated asfollowsGiven any a isin R b isin R+ and measure ν on R satisfying

ν() = intR(andx)ν(dx) ltinfin there exists a probability

space (Ω P) on which a Leacutevy process X is denede

process X is the sum of three independent processes()X

()X and

()X where

()X is a Brownianmotionwith dri a and

volatility b (in the sense dened below)()X is a compound

Poisson process and()X is a square integrable martingale

e characteristic components of()X

()X and

()X (denoted

by()φ (z)

()φ (z) and

()φ (z) respectively) are as follows

()φ (z) = iza minus zb

()φ (z)= int∣x∣ge(exp(izx) minus )ν(dx)()φ (z)= int∣x∣lt(exp(izx) minus minus izx)ν(dx)

Examples of the Leacutevy ProcessesThe Brownian MotionA Leacutevy process is said to be a Brownian motion (see7Brownian Motion and Diusions) with dri micro andvolatility rate σ if micro = a b = σ and ν (R)= Brow-nian motion is the only nondeterministic Leacutevy processeswith continuous sample paths

The Inverse Brownian ProcessLet X be a Brownian motion with micro gt and volatility rateσ For any x gt let Tx = inft Xt gt xen Tx is anincreasing Leacutevy process (called inverse Brownianmotion)its Leacutevy measure is given by

υ(dx) = radicπσ x

exp(minusxmicro

σ )

The Compound Poisson Processe compound Poisson process (see 7Poisson Processes)is a Leacutevy process where b = and ν is a nite measure

The Gamma Processe gamma process is a nonnegative increasing Leacutevy pro-cess X where b = a minus int xν(dx) = and its Leacutevymeasure is given by

ν(dx) = αxexp(minusxβ)dx x gt

where α β gt It follows that the mean term (E(X ))and the variance term (V(X )) for the process are equalto αβ and αβ respectively

e following is a simulated sample path of a gammaprocess where α = and β = (Fig )

The Variance Gamma Processe variance gamma process is a Leacutevy process that can berepresented as either the dierence between two indepen-dent gamma processes or as a Brownian process subordi-nated by a gamma processe latter is accomplished by arandom time change replacing the time of the Brownianprocess by a gamma process with a mean term equal to e variance gamma process has three parameters micro ndash the

Life Expectancy L

L

0 2 4 6 8 100

2

4

6

8

10

12

t

x

Leacutevy Processes Fig Gamma process

0 2 4 6 8 10minus08

minus06

minus04

minus02

0

02

04

06

t

x

Brownian motionVariance gamma

Leacutevy Processes Fig Brownian motion and variance gammasample paths

Brownian process dri term σ ndash the volatility of theBrownian process and ν ndash the variance term of the thegamma process

e following are two simulated sample paths one fora Brownian motion with a dri term micro = and volatilityterm σ = and the other is for a variance gamma processwith the same values for the dri term and the volatilityterms and ν = (Fig )

About the AuthorProfessor Mohamed Abdel-Hameed was the chairman ofthe Department of Statistics and Operations ResearchKuwait University (ndash) He was on the editorialboard of Applied Stochastic Models and Data Analysis(ndash) He was the Editor (jointly with E Cinlarand J Quinn) of the text Survival Models Maintenance

Replacement Policies and Accelerated Life Testing (Aca-demic Press ) He is an elected member of the Inter-national Statistical Institute He has published numerouspapers in Statistics and Operations research

Cross References7Non-Uniform Random Variate Generations7Poisson Processes7Statistical Modeling of Financial Markets7Stochastic Processes7Stochastic Processes Classication

References and Further ReadingAbdel-Hameed MS () Life distribution of devices subject to a

Leacutevy wear process Math Oper Res ndashAbdel-Hameed M () Optimal control of a dam using PMλτ

policies and penalty cost when the input process is a com-pound Poisson process with a positive drift J Appl Prob ndash

Black F Scholes MS () The pricing of options and corporateliabilities J Pol Econ ndash

Cont R Tankov P () Financial modeling with jump processesChapman amp HallCRC Press Boca Raton

Madan DB Carr PP Chang EC () The variance gamma processand option pricing Eur Finance Rev ndash

Patel A Kosko B () Stochastic resonance in continuous andspiking Neutron models with Leacutevy noise IEEE Trans NeuralNetw ndash

van Noortwijk JM () A survey of the applications of gammaprocesses in maintenance Reliab Eng Syst Safety ndash

Life Expectancy

Maja Biljan-AugustFull ProfessorUniversity of Rijeka Rijeka Croatia

Life expectancy is dened as the average number of yearsa person is expected to live from age x as determined bystatisticsStatistics on life expectancy are derived from a mathe-

matical model known as the 7life table In order to calcu-late this indicator the mortality rate at each age is assumedto be constant Life expectancy (ex) can be evaluated atany age and in a hypothetical stationary population canbe written in discrete form as

ex =Txlx

where x is age Tx is the number of person-years livedaged x and over and lx is the number of survivors at agex according to the life table

L Life Expectancy

Life expectancy can be calculated for combined sexesor separately for males and femalesere can be signi-cant dierences between sexesLife expectancy at birth (e) is the average number of

years a newborn child can expect to live if currentmortalitytrends remain constant

e =Tl

where T is the total size of the population and l is thenumber of births (the original number of persons in thebirth cohort)Life expectancy declines with age Life expectancy at

birth is highly inuenced by infant mortality rate eparadox of the life table refers to a situation where lifeexpectancy at birth increases for several years aer birth(e lt e lt e and even beyond)e paradox reects thehigher rates of infant and child mortality in populationsin pre-transition and middle stages of the demographictransitionLife expectancy at birth is a summary measure of mor-

tality in a population It is a frequently used indicatorof health standards and socio-economic living standardsLife expectancy is also one of the most commonly usedindicators of social development is indicator is easilycomparable through time and between areas includingcountries Inequalities in life expectancy usually indicateinequalities in health and socio-economic developmentLife expectancy rose throughout human history In

ancient Greece and Rome the average life expectancywas below years between the years and life expectancy at birth rose from about years to aglobal average of years and to more than years inthe richest countries (Riley ) Furthermore in mostindustrialized countries in the early twenty-rst centurylife expectancy averaged at about years (WHO)esechanges called the ldquohealth transitionrdquo are essentially theresult of improvements in public health medicine andnutritionLife expectancy varies signicantly across regions and

continents from life expectancies of about years insome central African populations to life expectancies of years and above in many European countries emore developed regions have an average life expectancy of years while the population of less developed regionsis at birth expected to live an average years less etwo continents that display the most extreme dierencesin life expectancies are North America ( years) andAfrica ( years) where as of recently the gap betweenlife expectancies amounts to years (UN ndash data)

Countries with the highest life expectancies in theworld ( years) are Australia Iceland Italy Switzerlandand Japan ( years) Japanese men and women live anaverage of and years respectively (WHO )In countries with a high rate of HIV infection prin-

cipally in Sub-Saharan Africa the average life expectancyis years and below Some of the worldrsquos lowest lifeexpectancies are in Sierra Leone ( years) Afghanistan( years) Lesotho ( years) and Zimbabwe ( years)In nearly all countries women live longer than men

e worldrsquos average life expectancy at birth is years formales and years for females the gap is about ve yearse female-to-male gap is expected to narrow in the moredeveloped regions andwiden in the less developed regionse Russian Federation has the greatest dierence in lifeexpectancies between the sexes ( years less for men)whereas in Tonga life expectancy for males exceeds thatfor females by years (WHO )Life expectancy is assumed to rise continuously

According to estimation by the UN global life expectancyat birth is likely to rise to an average years by ndashBy life expectancy is expected to vary across countriesfrom to years Long-range United Nations popula-tion projections predict that by on average peoplecan expect to live more than years from (Liberia) upto years (Japan)For more details on the calculation of life expectancy

including continuous notation see Keytz ( )and Preston et al ()

Cross References7Biopharmaceutical Research Statistics in7Biostatistics7Demography7Life Table7Mean Residual Life7Measurement of Economic Progress

References and Further ReadingKeyfitz N () Introduction to the Mathematics of Population

Addison-Wesly MassachusettsKeyfitz N () Applied mathematical demography Springer

New YorkPreston SH Heuveline P Guillot M () Demography measuring

and modelling potion processes Blackwell OxfordRiley JC () Rising life expectancy a global history Cambridge

University Press CambridgeRowland DT () Demographic methods and concepts Oxford

University Press New YorkSiegel JS Swanson DA () The methods and materials of demog-

raphy nd edn Elsevier Academic Press AmsterdamWorld Health Organization () World health statistics

Geneva

Life Table L

L

World Population Prospects The revision Highlights UnitedNations New York

World Population to () United Nations New York

Life Table

JanM HoemProfessor Director EmeritusMax Planck Institute for Demographic Research RostockGermany

e life table is a classical tabular representation of centralfeatures of the distribution function F of a positive variablesay X which normally is taken to represent the lifetime ofa newborn individuale life table was introduced wellbefore modern conventions concerning statistical distri-butions were developed and it comes with some specialterminology and notation as follows Suppose that F has a

density f (x) = ddxF(x) and dene the force of mortality or

death intensity at age x as the function

micro(x) = minus ddxln minus F(x) = f (x)

minus F(x)

Heuristically it is interpreted by the relation micro(x)dx =Px lt X lt x + dx∣X gt x Conversely F(x) =

minus expminusx

intmicro(s)ds e survivor function is dened as

ℓ(x) = ℓ() minus F(x) normally with ℓ() = Inmortality applications ℓ(x) is the expected number of sur-vivors to exact age x out of an original cohort of newborn babiese survival probability is

tpx = PX gt x + t∣X gt x = ℓ(x + t)ℓ(x)

= expminusintt

micro(x + s)ds

and the non-survival probability is (the converse) tqx = minustpx For t = one writes qx = qx and px = px In partic-ular we get ℓ(x + ) = ℓ(x)pxis is a practical recursionformula that permits us to compute all values of ℓ(x) oncewe know the values of px for all relevant x

e life expectancy is eo = EX = int infin ℓ(x)dxℓ()(e subscript in eo indicates that the expected value iscomputed at age (ie for newborn individuals) and thesuperscript o indicates that the computation is made in

the continuous mode)e remaining life expectancy atage x is

eox = E(X minus x∣X gt x) = intinfin

ℓ(x + t)dtℓ(x)

ie it is the expected lifetime remaining to someone whohas attained age xTo turn to the statistical estimation of these various

quantities suppose that the function micro(x) is piecewiseconstant which means that we take it to equal some con-stant say microj over each interval (xj xj+) for some partitionxj of the age axis For a collection Xi of independentobservations of X let Dj be the number of Xi that fall inthe interval (xj xj+) In mortality applications this is thenumber of deaths observed in the given interval For thecohort of the initially newborn Dj is the number of indi-viduals whodie in the interval (called the occurrences in theinterval) If individual i dies in the interval he or she willof course have lived for Xi minus xj time units during the inter-val Individuals who survive the interval will have lived forxj+ minus xj time units in the interval and individuals whodo not survive to age xj will not have lived during thisinterval at all When we aggregate the time units lived in(xj xj+) over all individuals we get a total Rj which iscalled the exposures for the interval the idea being thatindividuals are exposed to the risk of death for as long asthey live in the interval In the simple case where there areno relations between the individual parameters microj the col-lection DjRj constitutes a statistically sucient set ofobservations with a likelihood Λ that satises the relationln Λ = sumjminusmicrojRj+Dj ln microjwhich is easily seen to bemax-imized by microj = DjRje latter fraction is therefore themaximum-likelihood estimator for microj (In some connec-tions an age schedule of mortality will be specied such asthe classical GompertzndashMakeham function microx = a + bcxwhich does represent a relationship between the intensityvalues at the dierent ages x normally for single-year agegroupsMaximum likelihood estimators can then be foundby plugging this functional specication of the intensitiesinto the likelihood function nding the values a b andc that maximize Λ and using a + bcx for the intensity inthe rest of the life table computations Methods that do notamount tomaximum likelihood estimationwill sometimesbe used because they involve simpler computations Withsome luck they provide starting values for the iterative pro-cess that must usually be applied to produce the maximumlikelihood estimators For an example see Forseacuten ())is whole schema can be extended trivially to cover cen-soring (withdrawals) provided the censoringmechanism isunrelated to the mortality process

L Likelihood

If the force of mortality is constant over a single-yearage interval (x x + ) say and is estimated by microx in thisinterval then px = eminusmicrox is an estimator of the single-yearsurvival probability pxis allows us to estimate the sur-vival function recursively for all corresponding ages usingℓ(x + ) = ℓ(x)px for x = and the rest of thelife table computations follow suit Life table constructionconsists in the estimation of the parameters and the tab-ulation of functions like those above from empirical datae data can be for age at death for individuals as in theexample indicated above but they can also be observa-tions of duration until recovery from an illness of intervalsbetween births of time until breakdown of some piece ofmachinery or of any other positive duration variableSo far we have argued as if the life table is computed for

a group of mutually independent individuals who have allbeen observed in parallel essentially a cohort that is fol-lowed from a signicant common starting point (namelyfrom birth in our mortality example) and which is dimin-ished over time due to decrements (attrition) caused bythe risk in question and also subject to reduction due tocensoring (withdrawals)e corresponding table is thencalled a cohort life table It is more common however toestimate a px schedule from data collected for the mem-bers of a population during a limited time period and touse the mechanics of life-table construction to produce aperiod life table from the px valuesLife table techniques are described in detail in most

introductory textbooks in actuarial statistics7biostatistics7demography and epidemiology See eg Chiang ()Elandt-Johnson and Johnson () Manton and Stallard() Preston et al () For the history of thetopic consult Seal () Smith and Keytz () andDupacircquier ()

About the AuthorFor biography see the entry 7Demography

Cross References7Demographic Analysis A Stochastic Approach7Demography7Event History Analysis7Kaplan-Meier Estimator7Population Projections7Statistics An Overview

References and Further ReadingChiang CL () The life table and its applications Krieger

MalabarDupacircquier J () Lrsquoinvention de la table de mortaliteacute Presses

universitaires de France Paris

Elandt-Johnson RC Johnson NL () Survival models and dataanalysis Wiley New York

Forseacuten L () The efficiency of selected moment methods inGompertzndashMakeham graduation of mortality Scand Actuarial Jndash

Manton KG Stallard E () Recent trends in mortality analysisAcademic Press Orlando

Preston SH Heuveline P Guillot M () Demography Measuringand modeling populations Blackwell Oxford

Seal H () Studies in history of probability and statistics mul-tiple decrements of competing risks Biometrica ()ndash

Smith D Keyfitz N () Mathematical demography selectedpapers Springer Heidelberg

Likelihood

Nancy ReidProfessorUniversity of Toronto Toronto ON Canada

Introductione likelihood function in a statistical model is propor-tional to the density function for the random variable tobe observed in the model Most oen in applications oflikelihood we have a parametric model f (y θ) where theparameter θ is assumed to take values in a subset of Rkand the variable y is assumed to take values in a subset ofRn the likelihood function is dened by

L(θ) = L(θ y) = cf (y θ) ()

where c can depend on y but not on θ In more gen-eral settings where the model is semi-parametric or non-parametric the explicit denition is more dicult becausethe density needs to be dened relative to a dominatingmeasure whichmay not exist seeVan derVaart () andMurphy andVan derVaart ()is article will consideronly nite-dimensional parametric modelsWithin the context of the given parametric model the

likelihood function measures the relative plausibility ofvarious values of θ for a given observed data point y Val-ues of the likelihood function are only meaningful relativeto each other and for this reason are sometimes stan-dardized by the maximum value of the likelihood func-tion although other reference points might be of interestdepending on the contextIf ourmodel is f (y θ) = (ny)θy(minusθ)nminusy y = n

θ isin [ ] then the likelihood function is (any functionproportional to)

L(θ y) = θy( minus θ)nminusy

Likelihood L

L

and can be plotted as a function of θ for any xed valueof y e likelihood function is maximized at θ = ynis model might be appropriate for a sampling schemewhich recorded the number of successes amongn indepen-dent trials that result in success or failure each trial havingthe same probability of success θ Another example is thelikelihood function for the mean and variance parameterswhen sampling from a normal distribution with mean microand variance σ

L(θ y) = expminusn log σ minus (σ )Σ(yi minus micro)

where θ = (micro σ )is could also be plotted as a functionof micro and σ for a given sample y yn and it is not dif-cult to show that this likelihood function only dependson the sample through the sample mean y = nminusΣyi andsample variance s = (n minus )minusΣ(yi minus y) or equivalentlythrough Σyi and Σyi It is a general property of likelihoodfunctions that they depend on the data only through theminimal sucient statistic

Inferencee likelihood function was dened in a seminal paperof Fisher () and has since become the basis for mostmethods of statistical inference One version of likelihoodinference suggested by Fisher is to use some rule suchas L(θ)L(θ) gt k to dene a range of ldquolikelyrdquo or ldquoplau-siblerdquo values of θ Many authors including Royall ()and Edwards () have promoted the use of plots ofthe likelihood function along with interval estimates ofplausible valuesis approach is somewhat limited how-ever as it requires that θ have dimension or possibly or that a likelihood function can be constructed that onlydepends on a component of θ that is of interest see sectionldquo7Nuisance Parametersrdquo belowIn general we would wish to calibrate our inference

for θ by referring to the probabilistic properties of theinferential method One way to do this is to introduce aprobability measure on the unknown parameter θ typi-cally called a prior distribution and use Bayesrsquo rule forconditional probabilities to conclude

π(θ ∣ y) = L(θ y)π(θ)intθL(θ y)π(θ)dθ

where π(θ) is the density for the prior measure and π(θ ∣y) provides a probabilistic assessment of θ aer observingY = y in the model f (y θ) We could then make con-clusions of the form ldquohaving observed successes in trials and assuming π(θ) = the posterior probabilitythat θ gt is rdquo and so on

is is a very brief description of Bayesian inference inwhich probability statements refer to that generated from

the prior through the likelihood to the posterior A majordiculty with this approach is the choice of prior prob-ability function In some applications there may be anaccumulation of previous data that can be incorporatedinto a probability distribution but in general there is notand some rather ad hoc choices are oen made Anotherdiculty is themeaning to be attached to probability state-ments about the parameterInference based on the likelihood function can also be

calibrated with reference to the probability model f (y θ)by examining the distribution ofL(θY) as a random func-tion or more usually by examining the distribution ofvarious derived quantitiesis is the basis for likelihoodinference from a frequentist point of view In particularit can be shown that logL(θY)L(θY) where θ =θ(Y) is the value of θ at which L(θY) is maximized isapproximately distributed as a χk random variable wherek is the dimension of θ To make this precise requires anasymptotic theory for likelihood which is based on a cen-tral limit theorem (see 7Central Limiteorems) for thescore function

U(θY) = partpartθlogL(θY)

If Y = (Y Yn) has independent components thenU(θ) is a sum of n independent components which undermild regularity conditions will be asymptotically normalTo obtain the χ result quoted above it is also necessary toinvestigate the convergence of θ to the true value govern-ing the model f (y θ) Showing this convergence usuallyin probability but sometimes almost surely can be di-cult see Scholz () for a summary of some of the issuesthat ariseAssuming that θ is consistent for θ and that L(θY)

has sucient regularity the follow asymptotic results canbe established

(θ minus θ)T i(θ)(θ minus θ) drarr χk ()

U(θ)T iminus(θ)U(θ) drarr χk ()

ℓ(θ) minus ℓ(θ) drarr χk ()

where i(θ) = Eminusℓprimeprime(θY) θ is the expected Fisher infor-mation function ℓ(θ) = logL(θ) is the log-likelihoodfunction and χk is the 7chi-square distribution with kdegrees of freedom

ese results are all versions of a more general resultthat the log-likelihood function converges to the quadraticform corresponding to a multivariate normal distribution(see 7Multivariate Normal Distributions) under suitablystated limiting conditions ere is a similar asymptoticresult showing that the posterior density is asymptotically

L Likelihood

normal and in fact asymptotically free of the prior distri-bution although this result requires that the prior distribu-tion be a proper probability density ie has integral overthe parameter space equal to

Nuisance ParametersIn models where the dimension of θ is large plotting thelikelihood function is not possible and inference based onthe multivariate normal distribution for θ or the χk dis-tribution of the log-likelihood ratio doesnrsquot lead easily tointerval estimates for components of θ However it is pos-sible to use the likelihood function to construct inferencefor parameters of interest using variousmethods that havebeen proposed to eliminate nuisance parametersSuppose in the model f (y θ) that θ = (ψ λ) where ψ

is a k-dimensional parameter of interest (which will oenbe )e prole log-likelihood function of ψ is

ℓP(ψ) = ℓ(ψ λψ)

where λψ is the constrainedmaximum likelihood estimateit maximizes the likelihood function L(ψ λ) when ψ isheld xede prole log-likelihood function is also calledthe concentrated log-likelihood function especially ineconometrics If the parameter of interest is not expressedexplicitly as a subvector of θ then the constrained maxi-mum likelihood estimate is found using Lagrange multi-pliersIt can be veried under suitable smoothness conditions

that results similar to those at ( ndash ) hold as well for theprole log-likelihood function in particular

ℓP(ψ) minus ℓP(ψ) = ℓ(ψ λ) minus ℓ(ψ λψ)drarr χk

is method of eliminating nuisance parameters is notcompletely satisfactory especially when there are manynuisance parameters in particular it doesnrsquot allow forerrors in estimation of λ For example the prole likeli-hood approach to estimation of σ in the linear regressionmodel (see 7Linear Regression Models) y sim N(Xβ σ )will lead to the estimator σ = Σ(yi minus yi)n whereas theestimator usually preferred has divisor nminusp where p is thedimension of β

us a large literature has developed on improvementsto the prole log-likelihood For Bayesian inference suchimprovements are ldquoautomaticallyrdquo included in the formu-lation of the marginal posterior density for ψ

πM(ψ ∣ y)prop int π(ψ λ ∣ y)dλ

but it is typically quite dicult to specify priors for possiblyhigh-dimensional nuisance parameters For non-Bayesian

inference most modications to the prole log-likelihoodare derived by considering conditional or marginal infer-ence in models that admit factorizations at least approxi-mately like the following

f (y θ) = f(yψ)f(y ∣ y λ) orf (y θ) = f(y ∣ yψ)f(y λ)

A discussion of conditional inference and density factori-sations is given in Reid ()is literature is closely tiedto that on higher order asymptotic theory for likelihoode latter theory builds on saddlepoint and Laplace expan-sions to derive more accurate versions of (ndash) see forexample Severini () and Brazzale et al () edirect likelihood approach of Royall () and others doesnot generalize very well to the nuisance parameter settingalthough Royall and Tsou () present some results inthis direction

Extensions to Likelihoode likelihood function is such an important aspect ofinference based on models that it has been extended toldquolikelihood-likerdquo functions formore complex data settingsExamples include nonparametric and semi-parametriclikelihoods the most famous semi-parametric likelihoodis the proportional hazards model of Cox () Butmany other extensions have been suggested to empiri-cal likelihood (Owen ) which is a type of nonpara-metric likelihood supported on the observed sample toquasi-likelihood (McCullagh ) which starts from thescore function U(θ) and works backwards to an infer-ence function to bootstrap likelihood (Davison et al )and many modications of prole likelihood (Barndor-Nielsen andCox Fraser )ere is recent interestfor multi-dimensional responses Yi in composite likeli-hoods which are products of lower dimensional condi-tional or marginal distributions (Varin ) Reid ()concluded a review article on likelihood by stating

7 From either a Bayesian or frequentist perspective the like-lihood function plays an essential role in inference Themaximum likelihood estimator once regarded on an equalfooting among competing point estimators is now typi-cally the estimator of choice although some refinementis needed in problems with large numbers of nuisanceparameters The likelihood ratio statistic is the basis formost tests of hypotheses and interval estimates The emer-gence of the centrality of the likelihood function for infer-ence partly due to the large increase in computing poweris one of the central developments in the theory of statisticsduring the latter half of the twentieth century

Likelihood L

L

Further Readinge book by Cox and Hinkley () gives a detailedaccount of likelihood inference and principles of statis-tical inference see also Cox () ere are severalbook-length treatments of likelihood inference includingEdwards () Azzalini () Pawitan () and Sev-erini () this last discusses higher order asymptotictheory in detail as does Barndor-Nielsen andCox ()and Brazzale Davison and Reid () A short reviewpaper is Reid () An excellent overview of consis-tency results for maximum likelihood estimators is Scholz() see also Lehmann andCasella () Foundationalissues surrounding likelihood inference are discussed inBerger and Wolpert ()

About the AuthorProfessor Reid is a Past President of the Statistical Societyof Canada (ndash) During (ndash) she servedas the President of the Institute of Mathematical Statis-tics Among many awards she received the Emanuel andCarol Parzen Prize for Statistical Innovation () ldquoforleadership in statistical science for outstanding researchin theoretical statistics and highly accurate inference fromthe likelihood function and for inuential contributionsto statistical methods in biology environmental sciencehigh energy physics and complex social surveysrdquo She wasawarded the Gold Medal Statistical Society of Canada() and FlorenceNightingaleDavidAward Committeeof Presidents of Statistical Societies () She is AssociateEditor of Statistical Science (ndash) Bernoulli (ndash) andMetrika (ndash)

Cross References7Bayesian Analysis or Evidence Based Statistics7Bayesian Statistics7Bayesian Versus Frequentist Statistical Reasoning7Bayesian vs Classical Point Estimation A ComparativeOverview7Chi-Square Test Analysis of Contingency Tables7Empirical Likelihood Approach to Inference fromSample Survey Data7Estimation7Fiducial Inference7General Linear Models7Generalized Linear Models7Generalized Quasi-Likelihood (GQL) Inferences7Mixture Models7Philosophical Foundations of Statistics

7Risk Analysis7Statistical Evidence7Statistical Inference7Statistical Inference An Overview7Statistics An Overview7Statistics Nelderrsquos view7Testing Variance Components in Mixed LinearModels7Uniform Distribution in Statistics

References and Further ReadingAzzalini A () Statistical inference Chapman and Hall LondonBarndorff-Nielsen OE Cox DR () Inference and asymptotics

Chapman and Hall LondonBerger JO Wolpert R () The likelihood principle Institute of

Mathematical Statistics HaywardBirnbaum A () On the foundations of statistical inference Am

Stat Assoc ndashBrazzale AR Davison AC Reid N () Applied asymptotics

Cambridge University Press CambridgeCox DR () Regression models and life tables J R Stat Soc B

ndash (with discussion)Cox DR () Principles of statistical inference Cambridge Uni-

versity Press CambridgeCox DR Hinkley DV () Theoretical statistics Chapman and

Hall LondonDavison AC Hinkley DV Worton B () Bootstrap likelihoods

Biometrika ndashEdwards AF () Likelihood Oxford University Press OxfordFisher RA () On the mathematical foundations of theoretical

statistics Phil Trans R Soc A ndashLehmann EL Casella G () Theory of point estimation nd edn

Springer New YorkMcCullagh P () Quasi-likelihood functions Ann Stat ndashMurphy SA Van der Vaart A () Semiparametric likelihood ratio

inference Ann Stat ndashOwen A () Empirical likelihood ratio confidence intervals for a

single functional Biometrika ndashPawitan Y () In all likelihood Oxford University Press OxfordReid N () The roles of conditioning in inference Stat Sci

ndashReid N () Likelihood J Am Stat Assoc ndashRoyall RM () Statistical evidence a likelihood paradigm

Chapman and Hall LondonRoyall RM Tsou TS () Interpreting statistical evidence using

imperfect models robust adjusted likelihoodfunctions J R StatSoc B

Scholz F () Maximum likelihood estimation In Ency-clopedia of statistical sciences Wiley New York doiesspub Accessed Aug

Severini TA () Likelihood methods in statistics Oxford Univer-sity Press Oxford

Van der Vaart AW () Infinite-dimensional likelihood meth-ods in statistics httpwwwstieltjesorgarchiefbiennialframenodehtml Accessed Aug

Varin C () On composite marginal likelihood Adv Stat Analndash

L Limit Theorems of Probability Theory

Limit Theorems of ProbabilityTheory

Alexandr Alekseevich BorovkovProfessor Head of Probability and Statistics Chair at theNovosibirsk UniversityNovosibirsk University Novosibirsk Russia

Limit eorems of Probability eory is a broad namereferring to the most essential and extensive research areain Probability eory which at the same time has thegreatest impact on the numerous applications of the latterBy its very nature Probability eory is concerned

with asymptotic (limiting) laws that emerge in a long seriesof observations on random events Because of this in theearly twentieth century even the very denition of prob-ability of an event was given by a group of specialists(R von Mises and some others) as the limit of the rela-tive frequency of the occurrence of this event in a longrow of independent random experiments e ldquostabilityrdquoof this frequency (ie that such a limit always exists) waspostulated Aer the s Kolmogorovrsquos axiomatic con-struction of probability theory has prevailed One of themain assertions in this axiomatic theory is the Law of LargeNumbers (LLN) on the convergence of the averages of largenumbers of random variables to their expectation islaw implies the aforementioned stability of the relative fre-quencies and their convergence to the probability of thecorresponding event

e LLN is the simplest limit theorem (LT) of probabil-ity theory elucidating the physical meaning of probabilitye LLN is stated as follows if XXX is a sequenceof iid random variables

Sn =n

sumj=Xj

and the expectation a = EX exists then nminusSnasETHrarr a

(almost surely ie with probability )us the value nacan be called the rst order approximation for the sums Sne Central Limiteorem (CLT) gives one a more preciseapproximation for Sn It says that if σ = E(X minus a) ltinfinthen the distribution of the standardized sum ζn = (Sn minusna)σ

radicn converges as n rarr infin to the standard normal

(Gaussian) lawat is for all x

P(ζn lt x)rarr Φ(x) = radicπ

x

intminusinfin

eminustdt

e quantity nEξ + ζσradicn where ζ is a standard normal

random variable (so that P(ζ lt x) = Φ(x)) can be calledthe second order approximation for Sn

e rst LLN (for the Bernoulli scheme) was provedby Jacob Bernoulli in the late s (published posthu-mously in ) e rst CLT (also for the Bernoullischeme) was established by A de Moivre (rst publishedin and referred nowadays to as the deMoivrendashLaplacetheorem) In the beginning of the nineteenth centuryPS Laplace and CF Gauss contributed to the generaliza-tion of these assertions and appreciation of their enormousapplied importance (in particular for the theory of errorsof observations) while later in that century further break-throughs in both methodology and applicability rangeof the CLT were achieved by PL Chebyshev () andAM Lyapunov ()

e main directions in which the two aforementionedmain LTs have been extended and rened since then are

Relaxing the assumption EX lt infin When the sec-ond moment is innite one needs to assume that theldquotailrdquo P(x) = P(X gt x) + P(X lt minusx) is a func-tion regularly varying at innity such that the limitlimxrarrinfinP(X gt x)P(x) existsen the distribution of thenormalized sum Snσ(n) where σ(n) = Pminus(nminus)Pminus being the generalized inverse of the function P andwe assume that Eξ = when the expectation is niteconverges to one of the so-called stable laws as nrarrinfine7characteristic functions of these laws have simpleclosed-form representations

Relaxing the assumption that the Xjrsquos are identicallydistributed and proceeding to study the so-called tri-angular array scheme where the distributions of thesummands Xj = Xjn forming the sum Sn depend notonly on j but on n as well In this case the class of alllimit laws for the distribution of Sn (under suitable nor-malization) is substantially wider it coincides with theclass of the so-called innitely divisible distributionsAn important special case here is the Poisson limit the-orem on convergence in distribution of the number ofoccurrences of rare events to a Poisson law

Relaxing the assumption of independence of the XjrsquosSeveral types of ldquoweak dependencerdquo assumptions onXjunder which the LLN and CLT still hold true havebeen suggested and investigatedOne should alsomen-tion here the so-called ergodic theorems (see7Ergodiceorem) for a wide class of random sequences andprocesses

Renement of the main LTs and derivation of asymp-totic expansions For instance in the CLT bounds ofthe rate of convergence P(ζn lt x) minus Φ(x) rarr and

Limit Theorems of Probability Theory L

L

asymptotic expansions for this dierence (in the pow-ers of nminus in the case of iid summands) have beenobtained under broad assumptions

Studying large deviation probabilities for the sums Sn(theorems on rare events) If x rarr infin together with nthen the CLT can only assert that P(ζn gt x) rarr eorems on large deviation probabilities aim to nd afunction P(xn) such that

P(ζn gt x)P(xn) rarr as nrarrinfin x rarrinfin

e nature of the function P(xn) essentially dependson the rate of decay of P(X gt x) as x rarrinfin and on theldquodeviation zonerdquo ie on the asymptotic behavior of theratio xn as nrarrinfin

Considering observations X Xn of a more com-plex nature ndash rst of all multivariate random vectorsIf Xj isin Rd then the role of the limit law in the CLTwill be played by a d-dimensional normal (Gaussian)distribution with the covariance matrix E(X minus EX)(X minus EX)T

e variety of application areas of the LLN and CLTis enormous us Mathematical Statistics is based onthese LTs Let Xlowastn = (X Xn) be a sample froma distribution F and Flowastn (u) the corresponding empiricaldistribution functione fundamental GlivenkondashCantellitheorem (see 7Glivenko-Cantelli eorems) stating thatsupu ∣F

lowastn (u)minus F(u)∣

asETHrarr as nrarrinfin is of the same natureas the LLN and basically means that the unknown distri-bution F can be estimated arbitrary well from the randomsample Xlowastn of a large enough size n

e existence of consistent estimators for the unknownparameters a = Eξ and σ = E(X minus a) also follows fromthe LLN since as nrarrinfin

alowast = n

n

sumj=Xj

asETHrarr a (σ )lowast = n

n

sumj=

(Xj minus alowast)

= n

n

sumj=Xj minus (alowast) asETHrarr σ

Under additional moment assumptions on the distri-bution F one can also construct asymptotic condenceintervals for the parameters a and σ as the distributionsof the quantities

radicn(alowast minus a) and

radicn((σ )lowast minus σ ) con-

verge as n rarr infin to the normal onese same can alsobe said about other parameters that are ldquosmoothrdquo enoughfunctionals of the unknown distribution F

e theorem on the 7asymptotic normality andasymptotic eciency of maximum likelihood estimatorsis another classical example of LTsrsquo applications in mathe-matical statistics (see eg Borovkov ) Furthermore in

estimation theory and hypotheses testing one also needstheorems on large deviation probabilities for the respec-tive statistics as it is statistical procedures with small errorprobabilities that are oen required in applicationsIt is worth noting that summation of random variables

is by no means the only situation in which LTs appear inProbabilityeoryGenerally speaking the main objective of Probabil-

ityeory in applications is nding appropriate stochasticmodels for objects under study and then determining thedistributions or parameters one is interested in As a rulethe explicit form of these distributions and parameters isnot known LTs can be used to nd suitable approximationsto the characteristics in questionAt least two possible approaches to this problem should

be noted here

Suppose that the unknown distribution Fθ depends ona parameter θ such that as θ approaches some ldquocriticalrdquovalue θ the distributions Fθ become ldquodegeneraterdquo inone sense or anotheren in a number of cases onecan nd an approximation for Fθ which is valid for thevalues of θ that are close to θ For instance in actuarialstudies7queueing theory and some other applicationsone of the main problems is concerned with the dis-tribution of S = sup

kge(Sk minus θk) under the assumption

that EX = If θ gt then S is a proper random vari-able If however θ rarr then S asETHrarr infin Here we dealwith the so-called ldquotransient phenomenardquo It turns outthat if σ = Var (X) ltinfin then there exists the limit

limθdarrP(θS gt x) = eminusxσ x gt

is (KingmanndashProkhorov) LT enables one to ndapproximations for the distribution of S in situationswhere θ is small

Sometimes one can estimate the ldquotailsrdquo of the unknowndistributions ie their asymptotic behavior at innityis is of importance in those applications where oneneeds to evaluate the probabilities of rare events If theequation Eemicro(Xminusθ) = has a solution micro gt then inthe above example one has

P(S gt x) sim ceminusmicrox x rarrinfin

where c is a known constant If the distribution F of Xis subexponential (in this case Eemicro(Xminusθ) = infin for anymicro gt ) then

P(S gt x) sim θ int

infin

x( minus F(t))dt x rarrinfin

L Linear Mixed Models

isLTenablesone tondapproximations forP(S gt x)for large xFor both approaches the obtained approximations

can be rened

An important part of Probabilityeory is concernedwith LTs for random processes eir main objective isto nd conditions under which random processes con-verge in some sense to some limit process An extensionof the CLT to that context is the so-called FunctionalCLT (aka the DonskerndashProkhorov invariance princi-ple) which states that as n rarr infin the processesζn(t) = (Slfloorntrfloor minus ant)σ

radicntisin[] converge in distribu-

tion to the standard Wiener process w(t)tisin[]e LTs(including large deviation theorems) for a broad class offunctionals of the sequence (7random walk) S Sncan also be classied as LTs for 7stochastic processese same can be said about Law of iterated logarithmwhich states that for an arbitrary ε gt the random walkSkinfink= crosses the boundary (minusε)σ

radick ln ln k innitely

many times but crosses the boundary ( + ε)σradick ln ln k

nitely many times with probability Similar results holdtrue for trajectories of Wiener processes w(t)tisin[] andw(t)tisin[infin)In mathematical statistics a closely related to func-

tional CLT result says that the so-called ldquoempirical processrdquoradicn(Flowastn (u) minus F(u))uisin(minusinfininfin) converges in distribution

to w(F(u))uisin(minusinfininfin) where w(t) = w(t) minus tw() isthe Brownian bridge processis LT implies7asymptoticnormality of a great many estimators that can be repre-sented as smooth functionals of the empirical distributionfunction Flowastn (u)

ere are many other areas in Probability eoryand its applications where various LTs appear and areextensively used For instance convergence theoremsfor 7martingales asymptotics of extinction probabilityof a branching processes and conditional (under non-extinction condition) LTs on a number of particles etc

About the AuthorProfessor Borovkov is Head of Probability and Statis-tics Department of the Sobolev Institute of MathematicsNovosibirsk (RussianAcademy of Sciences) since Heis Head of Probability and Statistics Chair at the Novosi-birsk University since He is Concelour of RussianAcademy of Sciences and fullmember of RussianAcademyof Sciences (Academician) () He was awarded theState Prize of the USSR () the Markov Prize of theRussian Academy of Sciences () and GovernmentPrize in Education () Professor Borovkov is Editor-in-Chief of the journal ldquoSiberianAdvances inMathematicsrdquo

and Associated Editor of journals ldquoeory of Probabil-ity and its Applicationsrdquo ldquoSiberian Mathematical JournalrdquoldquoMathematical Methods of Statisticsrdquo ldquoElectronic Journal ofProbabilityrdquo

Cross References7Almost Sure Convergence of Random Variables7Approximations to Distributions7Asymptotic Normality7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Central Limiteorems7Empirical Processes7Ergodiceorem7Glivenko-Cantellieorems7Large Deviations and Applications7Laws of Large Numbers7Martingale Central Limiteorem7Probabilityeory An Outline7RandomMatrixeory7Strong Approximations in Probability and Statistics7Weak Convergence of Probability Measures

References and Further ReadingBillingsley P () Convergence of probability measures Wiley

New YorkBorovkov AA () Mathematical statistics Gordon amp Breach

AmsterdamGnedenko BV Kolmogorov AN () Limit distributions for sums

of independent random variables Addison-Wesley CambridgeLeacutevy P () Theacuteorie de lrsquoAddition des Variables Aleacuteatories

Gauthier-Villars ParisLoegraveve M (ndash) Probability theory vols I and II th edn

Springer New YorkPetrov VV () Limit theorems of probability theory sequences of

independent random variables ClarendonOxford UniversityPress New York

Linear Mixed Models

GeertMolenberghsProfessorUniversiteit Hasselt amp Katholieke Universiteit LeuvenLeuven Belgium

In observational studies repeated measurements may betaken at almost arbitrary time points resulting in anextremely large number of time points at which only one

Linear Mixed Models L

L

or only a few measurements have been taken Many of theparametric covariance models described so far may thencontain toomany parameters tomake them useful in prac-tice while othermore parsimoniousmodelsmay be basedon assumptions which are too simplistic to be realisticA general and very exible class of parametric models forcontinuous longitudinal data is formulated as follows

yi∣bi sim N(Xiβ + ZibiΣi) ()bi sim N(D) ()

where Xi and Zi are (ni times p) and (ni times q) dimensionalmatrices of known covariates β is a p-dimensional vec-tor of regression parameters called the xed eects D isa general (q times q) covariance matrix and Σi is a (ni times ni)covariance matrix which depends on i only through itsdimension ni ie the set of unknown parameters in Σi willnot depend upon i Finally bi is a vector of subject-specicor random eects

e above model can be interpreted as a linear regres-sion model (see 7Linear Regression Models) for the vec-tor yi of repeated measurements for each unit separatelywhere some of the regression parameters are specic (ran-dom eects bi) while others are not (xed eects β)edistributional assumptions in () with respect to the ran-dom eects can be motivated as follows First E(bi) = implies that the mean of yi still equals Xiβ such that thexed eects in the random-eects model () can also beinterpreted marginally Not only do they reect the eectof changing covariates within specic units they alsomea-sure the marginal eect in the population of changing thesame covariates Second the normality assumption imme-diately implies that marginally yi also follows a normaldistribution with mean vector Xiβ and with covariancematrix Vi = ZiDZprimei + ΣiNote that the random eects in () implicitly imply the

marginal covariance matrix Vi of yi to be of the very spe-cic form Vi = ZiDZprimei + Σi Let us consider two examplesunder the assumption of conditional independence ieassuming Σi = σ Ini First consider the case where therandom eects are univariate and represent unit-specicintercepts is corresponds to covariates Zi which areni-dimensional vectors containing only ones

e marginal model implied by expressions () and() is

yi sim N(XiβVi) Vi = ZiDZprimei + Σi

which can be viewed as a multivariate linear regressionmodel with a very particular parameterization of thecovariance matrix Vi

With respect to the estimation of unit-specic param-eters bi the posterior distribution of bi given the observeddata yi can be shown to be (multivariate) normal withmean vector equal to DZprimeiV

minusi (α)(yi minus Xiβ) Replacing β

and α by their maximum likelihood estimates we obtainthe so-called empirical Bayes estimates bi for the bi A keyproperty of these EB estimates is shrinkage which is bestillustrated by considering the prediction yi equiv Xi β +Zibi ofthe ith prole It can easily be shown that

yi = ΣiVminusi Xi β + (Ini minus ΣiVminusi ) yi

which can be interpreted as a weighted average of thepopulation-averaged prole Xi β and the observed data yiwith weights ΣiVminusi and Ini minusΣiVminusi respectively Note thatthe ldquonumeratorrdquo of ΣiVminusi represents within-unit variabil-ity and the ldquodenominatorrdquo is the overall covariance matrixVi Hence much weight will be given to the overall averageprole if the within-unit variability is large in comparisonto the between-unit variability (modeled by the randomeects) whereasmuchweight will be given to the observeddata if the opposite is trueis phenomenon is referred toas shrinkage toward the average proleXi β An immediateconsequence of shrinkage is that the EB estimates show lessvariability than actually present in the random-eects dis-tribution ie for any linear combination λ of the randomeects

var(λprimebi)le var(λprimebi) = λprimeDλ

About the AuthorGeert Molenberghs is Professor of Biostatistics at Uni-versiteit Hasselt and Katholieke Universiteit Leuven inBelgium He was Joint Editor of Applied Statistics (ndash) and Co-Editor of Biometrics (ndash) Cur-rently he is Co-Editor of Biostatistics (ndash) He wasPresident of the International Biometric Society (ndash) received the Guy Medal in Bronze from the RoyalStatistical Society and the Myrto Lefkopoulou award fromthe Harvard School of Public Health Geert Molenberghsis founding director of the Center for Statistics He isalso the director of the Interuniversity Institute for Bio-statistics and statistical Bioinformatics (I-BioStat) Jointlywith Geert Verbeke Mike Kenward Tomasz BurzykowskiMarc Buyse and Marc Aerts he authored books on lon-gitudinal and incomplete data and on surrogate markerevaluation GeertMolenberghs received several Excellencein Continuing Education Awards of the American Statisti-cal Association for courses at Joint Statistical Meetings

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 11: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

Life Expectancy L

L

0 2 4 6 8 100

2

4

6

8

10

12

t

x

Leacutevy Processes Fig Gamma process

0 2 4 6 8 10minus08

minus06

minus04

minus02

0

02

04

06

t

x

Brownian motionVariance gamma

Leacutevy Processes Fig Brownian motion and variance gammasample paths

Brownian process dri term σ ndash the volatility of theBrownian process and ν ndash the variance term of the thegamma process

e following are two simulated sample paths one fora Brownian motion with a dri term micro = and volatilityterm σ = and the other is for a variance gamma processwith the same values for the dri term and the volatilityterms and ν = (Fig )

About the AuthorProfessor Mohamed Abdel-Hameed was the chairman ofthe Department of Statistics and Operations ResearchKuwait University (ndash) He was on the editorialboard of Applied Stochastic Models and Data Analysis(ndash) He was the Editor (jointly with E Cinlarand J Quinn) of the text Survival Models Maintenance

Replacement Policies and Accelerated Life Testing (Aca-demic Press ) He is an elected member of the Inter-national Statistical Institute He has published numerouspapers in Statistics and Operations research

Cross References7Non-Uniform Random Variate Generations7Poisson Processes7Statistical Modeling of Financial Markets7Stochastic Processes7Stochastic Processes Classication

References and Further ReadingAbdel-Hameed MS () Life distribution of devices subject to a

Leacutevy wear process Math Oper Res ndashAbdel-Hameed M () Optimal control of a dam using PMλτ

policies and penalty cost when the input process is a com-pound Poisson process with a positive drift J Appl Prob ndash

Black F Scholes MS () The pricing of options and corporateliabilities J Pol Econ ndash

Cont R Tankov P () Financial modeling with jump processesChapman amp HallCRC Press Boca Raton

Madan DB Carr PP Chang EC () The variance gamma processand option pricing Eur Finance Rev ndash

Patel A Kosko B () Stochastic resonance in continuous andspiking Neutron models with Leacutevy noise IEEE Trans NeuralNetw ndash

van Noortwijk JM () A survey of the applications of gammaprocesses in maintenance Reliab Eng Syst Safety ndash

Life Expectancy

Maja Biljan-AugustFull ProfessorUniversity of Rijeka Rijeka Croatia

Life expectancy is dened as the average number of yearsa person is expected to live from age x as determined bystatisticsStatistics on life expectancy are derived from a mathe-

matical model known as the 7life table In order to calcu-late this indicator the mortality rate at each age is assumedto be constant Life expectancy (ex) can be evaluated atany age and in a hypothetical stationary population canbe written in discrete form as

ex =Txlx

where x is age Tx is the number of person-years livedaged x and over and lx is the number of survivors at agex according to the life table

L Life Expectancy

Life expectancy can be calculated for combined sexesor separately for males and femalesere can be signi-cant dierences between sexesLife expectancy at birth (e) is the average number of

years a newborn child can expect to live if currentmortalitytrends remain constant

e =Tl

where T is the total size of the population and l is thenumber of births (the original number of persons in thebirth cohort)Life expectancy declines with age Life expectancy at

birth is highly inuenced by infant mortality rate eparadox of the life table refers to a situation where lifeexpectancy at birth increases for several years aer birth(e lt e lt e and even beyond)e paradox reects thehigher rates of infant and child mortality in populationsin pre-transition and middle stages of the demographictransitionLife expectancy at birth is a summary measure of mor-

tality in a population It is a frequently used indicatorof health standards and socio-economic living standardsLife expectancy is also one of the most commonly usedindicators of social development is indicator is easilycomparable through time and between areas includingcountries Inequalities in life expectancy usually indicateinequalities in health and socio-economic developmentLife expectancy rose throughout human history In

ancient Greece and Rome the average life expectancywas below years between the years and life expectancy at birth rose from about years to aglobal average of years and to more than years inthe richest countries (Riley ) Furthermore in mostindustrialized countries in the early twenty-rst centurylife expectancy averaged at about years (WHO)esechanges called the ldquohealth transitionrdquo are essentially theresult of improvements in public health medicine andnutritionLife expectancy varies signicantly across regions and

continents from life expectancies of about years insome central African populations to life expectancies of years and above in many European countries emore developed regions have an average life expectancy of years while the population of less developed regionsis at birth expected to live an average years less etwo continents that display the most extreme dierencesin life expectancies are North America ( years) andAfrica ( years) where as of recently the gap betweenlife expectancies amounts to years (UN ndash data)

Countries with the highest life expectancies in theworld ( years) are Australia Iceland Italy Switzerlandand Japan ( years) Japanese men and women live anaverage of and years respectively (WHO )In countries with a high rate of HIV infection prin-

cipally in Sub-Saharan Africa the average life expectancyis years and below Some of the worldrsquos lowest lifeexpectancies are in Sierra Leone ( years) Afghanistan( years) Lesotho ( years) and Zimbabwe ( years)In nearly all countries women live longer than men

e worldrsquos average life expectancy at birth is years formales and years for females the gap is about ve yearse female-to-male gap is expected to narrow in the moredeveloped regions andwiden in the less developed regionse Russian Federation has the greatest dierence in lifeexpectancies between the sexes ( years less for men)whereas in Tonga life expectancy for males exceeds thatfor females by years (WHO )Life expectancy is assumed to rise continuously

According to estimation by the UN global life expectancyat birth is likely to rise to an average years by ndashBy life expectancy is expected to vary across countriesfrom to years Long-range United Nations popula-tion projections predict that by on average peoplecan expect to live more than years from (Liberia) upto years (Japan)For more details on the calculation of life expectancy

including continuous notation see Keytz ( )and Preston et al ()

Cross References7Biopharmaceutical Research Statistics in7Biostatistics7Demography7Life Table7Mean Residual Life7Measurement of Economic Progress

References and Further ReadingKeyfitz N () Introduction to the Mathematics of Population

Addison-Wesly MassachusettsKeyfitz N () Applied mathematical demography Springer

New YorkPreston SH Heuveline P Guillot M () Demography measuring

and modelling potion processes Blackwell OxfordRiley JC () Rising life expectancy a global history Cambridge

University Press CambridgeRowland DT () Demographic methods and concepts Oxford

University Press New YorkSiegel JS Swanson DA () The methods and materials of demog-

raphy nd edn Elsevier Academic Press AmsterdamWorld Health Organization () World health statistics

Geneva

Life Table L

L

World Population Prospects The revision Highlights UnitedNations New York

World Population to () United Nations New York

Life Table

JanM HoemProfessor Director EmeritusMax Planck Institute for Demographic Research RostockGermany

e life table is a classical tabular representation of centralfeatures of the distribution function F of a positive variablesay X which normally is taken to represent the lifetime ofa newborn individuale life table was introduced wellbefore modern conventions concerning statistical distri-butions were developed and it comes with some specialterminology and notation as follows Suppose that F has a

density f (x) = ddxF(x) and dene the force of mortality or

death intensity at age x as the function

micro(x) = minus ddxln minus F(x) = f (x)

minus F(x)

Heuristically it is interpreted by the relation micro(x)dx =Px lt X lt x + dx∣X gt x Conversely F(x) =

minus expminusx

intmicro(s)ds e survivor function is dened as

ℓ(x) = ℓ() minus F(x) normally with ℓ() = Inmortality applications ℓ(x) is the expected number of sur-vivors to exact age x out of an original cohort of newborn babiese survival probability is

tpx = PX gt x + t∣X gt x = ℓ(x + t)ℓ(x)

= expminusintt

micro(x + s)ds

and the non-survival probability is (the converse) tqx = minustpx For t = one writes qx = qx and px = px In partic-ular we get ℓ(x + ) = ℓ(x)pxis is a practical recursionformula that permits us to compute all values of ℓ(x) oncewe know the values of px for all relevant x

e life expectancy is eo = EX = int infin ℓ(x)dxℓ()(e subscript in eo indicates that the expected value iscomputed at age (ie for newborn individuals) and thesuperscript o indicates that the computation is made in

the continuous mode)e remaining life expectancy atage x is

eox = E(X minus x∣X gt x) = intinfin

ℓ(x + t)dtℓ(x)

ie it is the expected lifetime remaining to someone whohas attained age xTo turn to the statistical estimation of these various

quantities suppose that the function micro(x) is piecewiseconstant which means that we take it to equal some con-stant say microj over each interval (xj xj+) for some partitionxj of the age axis For a collection Xi of independentobservations of X let Dj be the number of Xi that fall inthe interval (xj xj+) In mortality applications this is thenumber of deaths observed in the given interval For thecohort of the initially newborn Dj is the number of indi-viduals whodie in the interval (called the occurrences in theinterval) If individual i dies in the interval he or she willof course have lived for Xi minus xj time units during the inter-val Individuals who survive the interval will have lived forxj+ minus xj time units in the interval and individuals whodo not survive to age xj will not have lived during thisinterval at all When we aggregate the time units lived in(xj xj+) over all individuals we get a total Rj which iscalled the exposures for the interval the idea being thatindividuals are exposed to the risk of death for as long asthey live in the interval In the simple case where there areno relations between the individual parameters microj the col-lection DjRj constitutes a statistically sucient set ofobservations with a likelihood Λ that satises the relationln Λ = sumjminusmicrojRj+Dj ln microjwhich is easily seen to bemax-imized by microj = DjRje latter fraction is therefore themaximum-likelihood estimator for microj (In some connec-tions an age schedule of mortality will be specied such asthe classical GompertzndashMakeham function microx = a + bcxwhich does represent a relationship between the intensityvalues at the dierent ages x normally for single-year agegroupsMaximum likelihood estimators can then be foundby plugging this functional specication of the intensitiesinto the likelihood function nding the values a b andc that maximize Λ and using a + bcx for the intensity inthe rest of the life table computations Methods that do notamount tomaximum likelihood estimationwill sometimesbe used because they involve simpler computations Withsome luck they provide starting values for the iterative pro-cess that must usually be applied to produce the maximumlikelihood estimators For an example see Forseacuten ())is whole schema can be extended trivially to cover cen-soring (withdrawals) provided the censoringmechanism isunrelated to the mortality process

L Likelihood

If the force of mortality is constant over a single-yearage interval (x x + ) say and is estimated by microx in thisinterval then px = eminusmicrox is an estimator of the single-yearsurvival probability pxis allows us to estimate the sur-vival function recursively for all corresponding ages usingℓ(x + ) = ℓ(x)px for x = and the rest of thelife table computations follow suit Life table constructionconsists in the estimation of the parameters and the tab-ulation of functions like those above from empirical datae data can be for age at death for individuals as in theexample indicated above but they can also be observa-tions of duration until recovery from an illness of intervalsbetween births of time until breakdown of some piece ofmachinery or of any other positive duration variableSo far we have argued as if the life table is computed for

a group of mutually independent individuals who have allbeen observed in parallel essentially a cohort that is fol-lowed from a signicant common starting point (namelyfrom birth in our mortality example) and which is dimin-ished over time due to decrements (attrition) caused bythe risk in question and also subject to reduction due tocensoring (withdrawals)e corresponding table is thencalled a cohort life table It is more common however toestimate a px schedule from data collected for the mem-bers of a population during a limited time period and touse the mechanics of life-table construction to produce aperiod life table from the px valuesLife table techniques are described in detail in most

introductory textbooks in actuarial statistics7biostatistics7demography and epidemiology See eg Chiang ()Elandt-Johnson and Johnson () Manton and Stallard() Preston et al () For the history of thetopic consult Seal () Smith and Keytz () andDupacircquier ()

About the AuthorFor biography see the entry 7Demography

Cross References7Demographic Analysis A Stochastic Approach7Demography7Event History Analysis7Kaplan-Meier Estimator7Population Projections7Statistics An Overview

References and Further ReadingChiang CL () The life table and its applications Krieger

MalabarDupacircquier J () Lrsquoinvention de la table de mortaliteacute Presses

universitaires de France Paris

Elandt-Johnson RC Johnson NL () Survival models and dataanalysis Wiley New York

Forseacuten L () The efficiency of selected moment methods inGompertzndashMakeham graduation of mortality Scand Actuarial Jndash

Manton KG Stallard E () Recent trends in mortality analysisAcademic Press Orlando

Preston SH Heuveline P Guillot M () Demography Measuringand modeling populations Blackwell Oxford

Seal H () Studies in history of probability and statistics mul-tiple decrements of competing risks Biometrica ()ndash

Smith D Keyfitz N () Mathematical demography selectedpapers Springer Heidelberg

Likelihood

Nancy ReidProfessorUniversity of Toronto Toronto ON Canada

Introductione likelihood function in a statistical model is propor-tional to the density function for the random variable tobe observed in the model Most oen in applications oflikelihood we have a parametric model f (y θ) where theparameter θ is assumed to take values in a subset of Rkand the variable y is assumed to take values in a subset ofRn the likelihood function is dened by

L(θ) = L(θ y) = cf (y θ) ()

where c can depend on y but not on θ In more gen-eral settings where the model is semi-parametric or non-parametric the explicit denition is more dicult becausethe density needs to be dened relative to a dominatingmeasure whichmay not exist seeVan derVaart () andMurphy andVan derVaart ()is article will consideronly nite-dimensional parametric modelsWithin the context of the given parametric model the

likelihood function measures the relative plausibility ofvarious values of θ for a given observed data point y Val-ues of the likelihood function are only meaningful relativeto each other and for this reason are sometimes stan-dardized by the maximum value of the likelihood func-tion although other reference points might be of interestdepending on the contextIf ourmodel is f (y θ) = (ny)θy(minusθ)nminusy y = n

θ isin [ ] then the likelihood function is (any functionproportional to)

L(θ y) = θy( minus θ)nminusy

Likelihood L

L

and can be plotted as a function of θ for any xed valueof y e likelihood function is maximized at θ = ynis model might be appropriate for a sampling schemewhich recorded the number of successes amongn indepen-dent trials that result in success or failure each trial havingthe same probability of success θ Another example is thelikelihood function for the mean and variance parameterswhen sampling from a normal distribution with mean microand variance σ

L(θ y) = expminusn log σ minus (σ )Σ(yi minus micro)

where θ = (micro σ )is could also be plotted as a functionof micro and σ for a given sample y yn and it is not dif-cult to show that this likelihood function only dependson the sample through the sample mean y = nminusΣyi andsample variance s = (n minus )minusΣ(yi minus y) or equivalentlythrough Σyi and Σyi It is a general property of likelihoodfunctions that they depend on the data only through theminimal sucient statistic

Inferencee likelihood function was dened in a seminal paperof Fisher () and has since become the basis for mostmethods of statistical inference One version of likelihoodinference suggested by Fisher is to use some rule suchas L(θ)L(θ) gt k to dene a range of ldquolikelyrdquo or ldquoplau-siblerdquo values of θ Many authors including Royall ()and Edwards () have promoted the use of plots ofthe likelihood function along with interval estimates ofplausible valuesis approach is somewhat limited how-ever as it requires that θ have dimension or possibly or that a likelihood function can be constructed that onlydepends on a component of θ that is of interest see sectionldquo7Nuisance Parametersrdquo belowIn general we would wish to calibrate our inference

for θ by referring to the probabilistic properties of theinferential method One way to do this is to introduce aprobability measure on the unknown parameter θ typi-cally called a prior distribution and use Bayesrsquo rule forconditional probabilities to conclude

π(θ ∣ y) = L(θ y)π(θ)intθL(θ y)π(θ)dθ

where π(θ) is the density for the prior measure and π(θ ∣y) provides a probabilistic assessment of θ aer observingY = y in the model f (y θ) We could then make con-clusions of the form ldquohaving observed successes in trials and assuming π(θ) = the posterior probabilitythat θ gt is rdquo and so on

is is a very brief description of Bayesian inference inwhich probability statements refer to that generated from

the prior through the likelihood to the posterior A majordiculty with this approach is the choice of prior prob-ability function In some applications there may be anaccumulation of previous data that can be incorporatedinto a probability distribution but in general there is notand some rather ad hoc choices are oen made Anotherdiculty is themeaning to be attached to probability state-ments about the parameterInference based on the likelihood function can also be

calibrated with reference to the probability model f (y θ)by examining the distribution ofL(θY) as a random func-tion or more usually by examining the distribution ofvarious derived quantitiesis is the basis for likelihoodinference from a frequentist point of view In particularit can be shown that logL(θY)L(θY) where θ =θ(Y) is the value of θ at which L(θY) is maximized isapproximately distributed as a χk random variable wherek is the dimension of θ To make this precise requires anasymptotic theory for likelihood which is based on a cen-tral limit theorem (see 7Central Limiteorems) for thescore function

U(θY) = partpartθlogL(θY)

If Y = (Y Yn) has independent components thenU(θ) is a sum of n independent components which undermild regularity conditions will be asymptotically normalTo obtain the χ result quoted above it is also necessary toinvestigate the convergence of θ to the true value govern-ing the model f (y θ) Showing this convergence usuallyin probability but sometimes almost surely can be di-cult see Scholz () for a summary of some of the issuesthat ariseAssuming that θ is consistent for θ and that L(θY)

has sucient regularity the follow asymptotic results canbe established

(θ minus θ)T i(θ)(θ minus θ) drarr χk ()

U(θ)T iminus(θ)U(θ) drarr χk ()

ℓ(θ) minus ℓ(θ) drarr χk ()

where i(θ) = Eminusℓprimeprime(θY) θ is the expected Fisher infor-mation function ℓ(θ) = logL(θ) is the log-likelihoodfunction and χk is the 7chi-square distribution with kdegrees of freedom

ese results are all versions of a more general resultthat the log-likelihood function converges to the quadraticform corresponding to a multivariate normal distribution(see 7Multivariate Normal Distributions) under suitablystated limiting conditions ere is a similar asymptoticresult showing that the posterior density is asymptotically

L Likelihood

normal and in fact asymptotically free of the prior distri-bution although this result requires that the prior distribu-tion be a proper probability density ie has integral overthe parameter space equal to

Nuisance ParametersIn models where the dimension of θ is large plotting thelikelihood function is not possible and inference based onthe multivariate normal distribution for θ or the χk dis-tribution of the log-likelihood ratio doesnrsquot lead easily tointerval estimates for components of θ However it is pos-sible to use the likelihood function to construct inferencefor parameters of interest using variousmethods that havebeen proposed to eliminate nuisance parametersSuppose in the model f (y θ) that θ = (ψ λ) where ψ

is a k-dimensional parameter of interest (which will oenbe )e prole log-likelihood function of ψ is

ℓP(ψ) = ℓ(ψ λψ)

where λψ is the constrainedmaximum likelihood estimateit maximizes the likelihood function L(ψ λ) when ψ isheld xede prole log-likelihood function is also calledthe concentrated log-likelihood function especially ineconometrics If the parameter of interest is not expressedexplicitly as a subvector of θ then the constrained maxi-mum likelihood estimate is found using Lagrange multi-pliersIt can be veried under suitable smoothness conditions

that results similar to those at ( ndash ) hold as well for theprole log-likelihood function in particular

ℓP(ψ) minus ℓP(ψ) = ℓ(ψ λ) minus ℓ(ψ λψ)drarr χk

is method of eliminating nuisance parameters is notcompletely satisfactory especially when there are manynuisance parameters in particular it doesnrsquot allow forerrors in estimation of λ For example the prole likeli-hood approach to estimation of σ in the linear regressionmodel (see 7Linear Regression Models) y sim N(Xβ σ )will lead to the estimator σ = Σ(yi minus yi)n whereas theestimator usually preferred has divisor nminusp where p is thedimension of β

us a large literature has developed on improvementsto the prole log-likelihood For Bayesian inference suchimprovements are ldquoautomaticallyrdquo included in the formu-lation of the marginal posterior density for ψ

πM(ψ ∣ y)prop int π(ψ λ ∣ y)dλ

but it is typically quite dicult to specify priors for possiblyhigh-dimensional nuisance parameters For non-Bayesian

inference most modications to the prole log-likelihoodare derived by considering conditional or marginal infer-ence in models that admit factorizations at least approxi-mately like the following

f (y θ) = f(yψ)f(y ∣ y λ) orf (y θ) = f(y ∣ yψ)f(y λ)

A discussion of conditional inference and density factori-sations is given in Reid ()is literature is closely tiedto that on higher order asymptotic theory for likelihoode latter theory builds on saddlepoint and Laplace expan-sions to derive more accurate versions of (ndash) see forexample Severini () and Brazzale et al () edirect likelihood approach of Royall () and others doesnot generalize very well to the nuisance parameter settingalthough Royall and Tsou () present some results inthis direction

Extensions to Likelihoode likelihood function is such an important aspect ofinference based on models that it has been extended toldquolikelihood-likerdquo functions formore complex data settingsExamples include nonparametric and semi-parametriclikelihoods the most famous semi-parametric likelihoodis the proportional hazards model of Cox () Butmany other extensions have been suggested to empiri-cal likelihood (Owen ) which is a type of nonpara-metric likelihood supported on the observed sample toquasi-likelihood (McCullagh ) which starts from thescore function U(θ) and works backwards to an infer-ence function to bootstrap likelihood (Davison et al )and many modications of prole likelihood (Barndor-Nielsen andCox Fraser )ere is recent interestfor multi-dimensional responses Yi in composite likeli-hoods which are products of lower dimensional condi-tional or marginal distributions (Varin ) Reid ()concluded a review article on likelihood by stating

7 From either a Bayesian or frequentist perspective the like-lihood function plays an essential role in inference Themaximum likelihood estimator once regarded on an equalfooting among competing point estimators is now typi-cally the estimator of choice although some refinementis needed in problems with large numbers of nuisanceparameters The likelihood ratio statistic is the basis formost tests of hypotheses and interval estimates The emer-gence of the centrality of the likelihood function for infer-ence partly due to the large increase in computing poweris one of the central developments in the theory of statisticsduring the latter half of the twentieth century

Likelihood L

L

Further Readinge book by Cox and Hinkley () gives a detailedaccount of likelihood inference and principles of statis-tical inference see also Cox () ere are severalbook-length treatments of likelihood inference includingEdwards () Azzalini () Pawitan () and Sev-erini () this last discusses higher order asymptotictheory in detail as does Barndor-Nielsen andCox ()and Brazzale Davison and Reid () A short reviewpaper is Reid () An excellent overview of consis-tency results for maximum likelihood estimators is Scholz() see also Lehmann andCasella () Foundationalissues surrounding likelihood inference are discussed inBerger and Wolpert ()

About the AuthorProfessor Reid is a Past President of the Statistical Societyof Canada (ndash) During (ndash) she servedas the President of the Institute of Mathematical Statis-tics Among many awards she received the Emanuel andCarol Parzen Prize for Statistical Innovation () ldquoforleadership in statistical science for outstanding researchin theoretical statistics and highly accurate inference fromthe likelihood function and for inuential contributionsto statistical methods in biology environmental sciencehigh energy physics and complex social surveysrdquo She wasawarded the Gold Medal Statistical Society of Canada() and FlorenceNightingaleDavidAward Committeeof Presidents of Statistical Societies () She is AssociateEditor of Statistical Science (ndash) Bernoulli (ndash) andMetrika (ndash)

Cross References7Bayesian Analysis or Evidence Based Statistics7Bayesian Statistics7Bayesian Versus Frequentist Statistical Reasoning7Bayesian vs Classical Point Estimation A ComparativeOverview7Chi-Square Test Analysis of Contingency Tables7Empirical Likelihood Approach to Inference fromSample Survey Data7Estimation7Fiducial Inference7General Linear Models7Generalized Linear Models7Generalized Quasi-Likelihood (GQL) Inferences7Mixture Models7Philosophical Foundations of Statistics

7Risk Analysis7Statistical Evidence7Statistical Inference7Statistical Inference An Overview7Statistics An Overview7Statistics Nelderrsquos view7Testing Variance Components in Mixed LinearModels7Uniform Distribution in Statistics

References and Further ReadingAzzalini A () Statistical inference Chapman and Hall LondonBarndorff-Nielsen OE Cox DR () Inference and asymptotics

Chapman and Hall LondonBerger JO Wolpert R () The likelihood principle Institute of

Mathematical Statistics HaywardBirnbaum A () On the foundations of statistical inference Am

Stat Assoc ndashBrazzale AR Davison AC Reid N () Applied asymptotics

Cambridge University Press CambridgeCox DR () Regression models and life tables J R Stat Soc B

ndash (with discussion)Cox DR () Principles of statistical inference Cambridge Uni-

versity Press CambridgeCox DR Hinkley DV () Theoretical statistics Chapman and

Hall LondonDavison AC Hinkley DV Worton B () Bootstrap likelihoods

Biometrika ndashEdwards AF () Likelihood Oxford University Press OxfordFisher RA () On the mathematical foundations of theoretical

statistics Phil Trans R Soc A ndashLehmann EL Casella G () Theory of point estimation nd edn

Springer New YorkMcCullagh P () Quasi-likelihood functions Ann Stat ndashMurphy SA Van der Vaart A () Semiparametric likelihood ratio

inference Ann Stat ndashOwen A () Empirical likelihood ratio confidence intervals for a

single functional Biometrika ndashPawitan Y () In all likelihood Oxford University Press OxfordReid N () The roles of conditioning in inference Stat Sci

ndashReid N () Likelihood J Am Stat Assoc ndashRoyall RM () Statistical evidence a likelihood paradigm

Chapman and Hall LondonRoyall RM Tsou TS () Interpreting statistical evidence using

imperfect models robust adjusted likelihoodfunctions J R StatSoc B

Scholz F () Maximum likelihood estimation In Ency-clopedia of statistical sciences Wiley New York doiesspub Accessed Aug

Severini TA () Likelihood methods in statistics Oxford Univer-sity Press Oxford

Van der Vaart AW () Infinite-dimensional likelihood meth-ods in statistics httpwwwstieltjesorgarchiefbiennialframenodehtml Accessed Aug

Varin C () On composite marginal likelihood Adv Stat Analndash

L Limit Theorems of Probability Theory

Limit Theorems of ProbabilityTheory

Alexandr Alekseevich BorovkovProfessor Head of Probability and Statistics Chair at theNovosibirsk UniversityNovosibirsk University Novosibirsk Russia

Limit eorems of Probability eory is a broad namereferring to the most essential and extensive research areain Probability eory which at the same time has thegreatest impact on the numerous applications of the latterBy its very nature Probability eory is concerned

with asymptotic (limiting) laws that emerge in a long seriesof observations on random events Because of this in theearly twentieth century even the very denition of prob-ability of an event was given by a group of specialists(R von Mises and some others) as the limit of the rela-tive frequency of the occurrence of this event in a longrow of independent random experiments e ldquostabilityrdquoof this frequency (ie that such a limit always exists) waspostulated Aer the s Kolmogorovrsquos axiomatic con-struction of probability theory has prevailed One of themain assertions in this axiomatic theory is the Law of LargeNumbers (LLN) on the convergence of the averages of largenumbers of random variables to their expectation islaw implies the aforementioned stability of the relative fre-quencies and their convergence to the probability of thecorresponding event

e LLN is the simplest limit theorem (LT) of probabil-ity theory elucidating the physical meaning of probabilitye LLN is stated as follows if XXX is a sequenceof iid random variables

Sn =n

sumj=Xj

and the expectation a = EX exists then nminusSnasETHrarr a

(almost surely ie with probability )us the value nacan be called the rst order approximation for the sums Sne Central Limiteorem (CLT) gives one a more preciseapproximation for Sn It says that if σ = E(X minus a) ltinfinthen the distribution of the standardized sum ζn = (Sn minusna)σ

radicn converges as n rarr infin to the standard normal

(Gaussian) lawat is for all x

P(ζn lt x)rarr Φ(x) = radicπ

x

intminusinfin

eminustdt

e quantity nEξ + ζσradicn where ζ is a standard normal

random variable (so that P(ζ lt x) = Φ(x)) can be calledthe second order approximation for Sn

e rst LLN (for the Bernoulli scheme) was provedby Jacob Bernoulli in the late s (published posthu-mously in ) e rst CLT (also for the Bernoullischeme) was established by A de Moivre (rst publishedin and referred nowadays to as the deMoivrendashLaplacetheorem) In the beginning of the nineteenth centuryPS Laplace and CF Gauss contributed to the generaliza-tion of these assertions and appreciation of their enormousapplied importance (in particular for the theory of errorsof observations) while later in that century further break-throughs in both methodology and applicability rangeof the CLT were achieved by PL Chebyshev () andAM Lyapunov ()

e main directions in which the two aforementionedmain LTs have been extended and rened since then are

Relaxing the assumption EX lt infin When the sec-ond moment is innite one needs to assume that theldquotailrdquo P(x) = P(X gt x) + P(X lt minusx) is a func-tion regularly varying at innity such that the limitlimxrarrinfinP(X gt x)P(x) existsen the distribution of thenormalized sum Snσ(n) where σ(n) = Pminus(nminus)Pminus being the generalized inverse of the function P andwe assume that Eξ = when the expectation is niteconverges to one of the so-called stable laws as nrarrinfine7characteristic functions of these laws have simpleclosed-form representations

Relaxing the assumption that the Xjrsquos are identicallydistributed and proceeding to study the so-called tri-angular array scheme where the distributions of thesummands Xj = Xjn forming the sum Sn depend notonly on j but on n as well In this case the class of alllimit laws for the distribution of Sn (under suitable nor-malization) is substantially wider it coincides with theclass of the so-called innitely divisible distributionsAn important special case here is the Poisson limit the-orem on convergence in distribution of the number ofoccurrences of rare events to a Poisson law

Relaxing the assumption of independence of the XjrsquosSeveral types of ldquoweak dependencerdquo assumptions onXjunder which the LLN and CLT still hold true havebeen suggested and investigatedOne should alsomen-tion here the so-called ergodic theorems (see7Ergodiceorem) for a wide class of random sequences andprocesses

Renement of the main LTs and derivation of asymp-totic expansions For instance in the CLT bounds ofthe rate of convergence P(ζn lt x) minus Φ(x) rarr and

Limit Theorems of Probability Theory L

L

asymptotic expansions for this dierence (in the pow-ers of nminus in the case of iid summands) have beenobtained under broad assumptions

Studying large deviation probabilities for the sums Sn(theorems on rare events) If x rarr infin together with nthen the CLT can only assert that P(ζn gt x) rarr eorems on large deviation probabilities aim to nd afunction P(xn) such that

P(ζn gt x)P(xn) rarr as nrarrinfin x rarrinfin

e nature of the function P(xn) essentially dependson the rate of decay of P(X gt x) as x rarrinfin and on theldquodeviation zonerdquo ie on the asymptotic behavior of theratio xn as nrarrinfin

Considering observations X Xn of a more com-plex nature ndash rst of all multivariate random vectorsIf Xj isin Rd then the role of the limit law in the CLTwill be played by a d-dimensional normal (Gaussian)distribution with the covariance matrix E(X minus EX)(X minus EX)T

e variety of application areas of the LLN and CLTis enormous us Mathematical Statistics is based onthese LTs Let Xlowastn = (X Xn) be a sample froma distribution F and Flowastn (u) the corresponding empiricaldistribution functione fundamental GlivenkondashCantellitheorem (see 7Glivenko-Cantelli eorems) stating thatsupu ∣F

lowastn (u)minus F(u)∣

asETHrarr as nrarrinfin is of the same natureas the LLN and basically means that the unknown distri-bution F can be estimated arbitrary well from the randomsample Xlowastn of a large enough size n

e existence of consistent estimators for the unknownparameters a = Eξ and σ = E(X minus a) also follows fromthe LLN since as nrarrinfin

alowast = n

n

sumj=Xj

asETHrarr a (σ )lowast = n

n

sumj=

(Xj minus alowast)

= n

n

sumj=Xj minus (alowast) asETHrarr σ

Under additional moment assumptions on the distri-bution F one can also construct asymptotic condenceintervals for the parameters a and σ as the distributionsof the quantities

radicn(alowast minus a) and

radicn((σ )lowast minus σ ) con-

verge as n rarr infin to the normal onese same can alsobe said about other parameters that are ldquosmoothrdquo enoughfunctionals of the unknown distribution F

e theorem on the 7asymptotic normality andasymptotic eciency of maximum likelihood estimatorsis another classical example of LTsrsquo applications in mathe-matical statistics (see eg Borovkov ) Furthermore in

estimation theory and hypotheses testing one also needstheorems on large deviation probabilities for the respec-tive statistics as it is statistical procedures with small errorprobabilities that are oen required in applicationsIt is worth noting that summation of random variables

is by no means the only situation in which LTs appear inProbabilityeoryGenerally speaking the main objective of Probabil-

ityeory in applications is nding appropriate stochasticmodels for objects under study and then determining thedistributions or parameters one is interested in As a rulethe explicit form of these distributions and parameters isnot known LTs can be used to nd suitable approximationsto the characteristics in questionAt least two possible approaches to this problem should

be noted here

Suppose that the unknown distribution Fθ depends ona parameter θ such that as θ approaches some ldquocriticalrdquovalue θ the distributions Fθ become ldquodegeneraterdquo inone sense or anotheren in a number of cases onecan nd an approximation for Fθ which is valid for thevalues of θ that are close to θ For instance in actuarialstudies7queueing theory and some other applicationsone of the main problems is concerned with the dis-tribution of S = sup

kge(Sk minus θk) under the assumption

that EX = If θ gt then S is a proper random vari-able If however θ rarr then S asETHrarr infin Here we dealwith the so-called ldquotransient phenomenardquo It turns outthat if σ = Var (X) ltinfin then there exists the limit

limθdarrP(θS gt x) = eminusxσ x gt

is (KingmanndashProkhorov) LT enables one to ndapproximations for the distribution of S in situationswhere θ is small

Sometimes one can estimate the ldquotailsrdquo of the unknowndistributions ie their asymptotic behavior at innityis is of importance in those applications where oneneeds to evaluate the probabilities of rare events If theequation Eemicro(Xminusθ) = has a solution micro gt then inthe above example one has

P(S gt x) sim ceminusmicrox x rarrinfin

where c is a known constant If the distribution F of Xis subexponential (in this case Eemicro(Xminusθ) = infin for anymicro gt ) then

P(S gt x) sim θ int

infin

x( minus F(t))dt x rarrinfin

L Linear Mixed Models

isLTenablesone tondapproximations forP(S gt x)for large xFor both approaches the obtained approximations

can be rened

An important part of Probabilityeory is concernedwith LTs for random processes eir main objective isto nd conditions under which random processes con-verge in some sense to some limit process An extensionof the CLT to that context is the so-called FunctionalCLT (aka the DonskerndashProkhorov invariance princi-ple) which states that as n rarr infin the processesζn(t) = (Slfloorntrfloor minus ant)σ

radicntisin[] converge in distribu-

tion to the standard Wiener process w(t)tisin[]e LTs(including large deviation theorems) for a broad class offunctionals of the sequence (7random walk) S Sncan also be classied as LTs for 7stochastic processese same can be said about Law of iterated logarithmwhich states that for an arbitrary ε gt the random walkSkinfink= crosses the boundary (minusε)σ

radick ln ln k innitely

many times but crosses the boundary ( + ε)σradick ln ln k

nitely many times with probability Similar results holdtrue for trajectories of Wiener processes w(t)tisin[] andw(t)tisin[infin)In mathematical statistics a closely related to func-

tional CLT result says that the so-called ldquoempirical processrdquoradicn(Flowastn (u) minus F(u))uisin(minusinfininfin) converges in distribution

to w(F(u))uisin(minusinfininfin) where w(t) = w(t) minus tw() isthe Brownian bridge processis LT implies7asymptoticnormality of a great many estimators that can be repre-sented as smooth functionals of the empirical distributionfunction Flowastn (u)

ere are many other areas in Probability eoryand its applications where various LTs appear and areextensively used For instance convergence theoremsfor 7martingales asymptotics of extinction probabilityof a branching processes and conditional (under non-extinction condition) LTs on a number of particles etc

About the AuthorProfessor Borovkov is Head of Probability and Statis-tics Department of the Sobolev Institute of MathematicsNovosibirsk (RussianAcademy of Sciences) since Heis Head of Probability and Statistics Chair at the Novosi-birsk University since He is Concelour of RussianAcademy of Sciences and fullmember of RussianAcademyof Sciences (Academician) () He was awarded theState Prize of the USSR () the Markov Prize of theRussian Academy of Sciences () and GovernmentPrize in Education () Professor Borovkov is Editor-in-Chief of the journal ldquoSiberianAdvances inMathematicsrdquo

and Associated Editor of journals ldquoeory of Probabil-ity and its Applicationsrdquo ldquoSiberian Mathematical JournalrdquoldquoMathematical Methods of Statisticsrdquo ldquoElectronic Journal ofProbabilityrdquo

Cross References7Almost Sure Convergence of Random Variables7Approximations to Distributions7Asymptotic Normality7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Central Limiteorems7Empirical Processes7Ergodiceorem7Glivenko-Cantellieorems7Large Deviations and Applications7Laws of Large Numbers7Martingale Central Limiteorem7Probabilityeory An Outline7RandomMatrixeory7Strong Approximations in Probability and Statistics7Weak Convergence of Probability Measures

References and Further ReadingBillingsley P () Convergence of probability measures Wiley

New YorkBorovkov AA () Mathematical statistics Gordon amp Breach

AmsterdamGnedenko BV Kolmogorov AN () Limit distributions for sums

of independent random variables Addison-Wesley CambridgeLeacutevy P () Theacuteorie de lrsquoAddition des Variables Aleacuteatories

Gauthier-Villars ParisLoegraveve M (ndash) Probability theory vols I and II th edn

Springer New YorkPetrov VV () Limit theorems of probability theory sequences of

independent random variables ClarendonOxford UniversityPress New York

Linear Mixed Models

GeertMolenberghsProfessorUniversiteit Hasselt amp Katholieke Universiteit LeuvenLeuven Belgium

In observational studies repeated measurements may betaken at almost arbitrary time points resulting in anextremely large number of time points at which only one

Linear Mixed Models L

L

or only a few measurements have been taken Many of theparametric covariance models described so far may thencontain toomany parameters tomake them useful in prac-tice while othermore parsimoniousmodelsmay be basedon assumptions which are too simplistic to be realisticA general and very exible class of parametric models forcontinuous longitudinal data is formulated as follows

yi∣bi sim N(Xiβ + ZibiΣi) ()bi sim N(D) ()

where Xi and Zi are (ni times p) and (ni times q) dimensionalmatrices of known covariates β is a p-dimensional vec-tor of regression parameters called the xed eects D isa general (q times q) covariance matrix and Σi is a (ni times ni)covariance matrix which depends on i only through itsdimension ni ie the set of unknown parameters in Σi willnot depend upon i Finally bi is a vector of subject-specicor random eects

e above model can be interpreted as a linear regres-sion model (see 7Linear Regression Models) for the vec-tor yi of repeated measurements for each unit separatelywhere some of the regression parameters are specic (ran-dom eects bi) while others are not (xed eects β)edistributional assumptions in () with respect to the ran-dom eects can be motivated as follows First E(bi) = implies that the mean of yi still equals Xiβ such that thexed eects in the random-eects model () can also beinterpreted marginally Not only do they reect the eectof changing covariates within specic units they alsomea-sure the marginal eect in the population of changing thesame covariates Second the normality assumption imme-diately implies that marginally yi also follows a normaldistribution with mean vector Xiβ and with covariancematrix Vi = ZiDZprimei + ΣiNote that the random eects in () implicitly imply the

marginal covariance matrix Vi of yi to be of the very spe-cic form Vi = ZiDZprimei + Σi Let us consider two examplesunder the assumption of conditional independence ieassuming Σi = σ Ini First consider the case where therandom eects are univariate and represent unit-specicintercepts is corresponds to covariates Zi which areni-dimensional vectors containing only ones

e marginal model implied by expressions () and() is

yi sim N(XiβVi) Vi = ZiDZprimei + Σi

which can be viewed as a multivariate linear regressionmodel with a very particular parameterization of thecovariance matrix Vi

With respect to the estimation of unit-specic param-eters bi the posterior distribution of bi given the observeddata yi can be shown to be (multivariate) normal withmean vector equal to DZprimeiV

minusi (α)(yi minus Xiβ) Replacing β

and α by their maximum likelihood estimates we obtainthe so-called empirical Bayes estimates bi for the bi A keyproperty of these EB estimates is shrinkage which is bestillustrated by considering the prediction yi equiv Xi β +Zibi ofthe ith prole It can easily be shown that

yi = ΣiVminusi Xi β + (Ini minus ΣiVminusi ) yi

which can be interpreted as a weighted average of thepopulation-averaged prole Xi β and the observed data yiwith weights ΣiVminusi and Ini minusΣiVminusi respectively Note thatthe ldquonumeratorrdquo of ΣiVminusi represents within-unit variabil-ity and the ldquodenominatorrdquo is the overall covariance matrixVi Hence much weight will be given to the overall averageprole if the within-unit variability is large in comparisonto the between-unit variability (modeled by the randomeects) whereasmuchweight will be given to the observeddata if the opposite is trueis phenomenon is referred toas shrinkage toward the average proleXi β An immediateconsequence of shrinkage is that the EB estimates show lessvariability than actually present in the random-eects dis-tribution ie for any linear combination λ of the randomeects

var(λprimebi)le var(λprimebi) = λprimeDλ

About the AuthorGeert Molenberghs is Professor of Biostatistics at Uni-versiteit Hasselt and Katholieke Universiteit Leuven inBelgium He was Joint Editor of Applied Statistics (ndash) and Co-Editor of Biometrics (ndash) Cur-rently he is Co-Editor of Biostatistics (ndash) He wasPresident of the International Biometric Society (ndash) received the Guy Medal in Bronze from the RoyalStatistical Society and the Myrto Lefkopoulou award fromthe Harvard School of Public Health Geert Molenberghsis founding director of the Center for Statistics He isalso the director of the Interuniversity Institute for Bio-statistics and statistical Bioinformatics (I-BioStat) Jointlywith Geert Verbeke Mike Kenward Tomasz BurzykowskiMarc Buyse and Marc Aerts he authored books on lon-gitudinal and incomplete data and on surrogate markerevaluation GeertMolenberghs received several Excellencein Continuing Education Awards of the American Statisti-cal Association for courses at Joint Statistical Meetings

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 12: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

L Life Expectancy

Life expectancy can be calculated for combined sexesor separately for males and femalesere can be signi-cant dierences between sexesLife expectancy at birth (e) is the average number of

years a newborn child can expect to live if currentmortalitytrends remain constant

e =Tl

where T is the total size of the population and l is thenumber of births (the original number of persons in thebirth cohort)Life expectancy declines with age Life expectancy at

birth is highly inuenced by infant mortality rate eparadox of the life table refers to a situation where lifeexpectancy at birth increases for several years aer birth(e lt e lt e and even beyond)e paradox reects thehigher rates of infant and child mortality in populationsin pre-transition and middle stages of the demographictransitionLife expectancy at birth is a summary measure of mor-

tality in a population It is a frequently used indicatorof health standards and socio-economic living standardsLife expectancy is also one of the most commonly usedindicators of social development is indicator is easilycomparable through time and between areas includingcountries Inequalities in life expectancy usually indicateinequalities in health and socio-economic developmentLife expectancy rose throughout human history In

ancient Greece and Rome the average life expectancywas below years between the years and life expectancy at birth rose from about years to aglobal average of years and to more than years inthe richest countries (Riley ) Furthermore in mostindustrialized countries in the early twenty-rst centurylife expectancy averaged at about years (WHO)esechanges called the ldquohealth transitionrdquo are essentially theresult of improvements in public health medicine andnutritionLife expectancy varies signicantly across regions and

continents from life expectancies of about years insome central African populations to life expectancies of years and above in many European countries emore developed regions have an average life expectancy of years while the population of less developed regionsis at birth expected to live an average years less etwo continents that display the most extreme dierencesin life expectancies are North America ( years) andAfrica ( years) where as of recently the gap betweenlife expectancies amounts to years (UN ndash data)

Countries with the highest life expectancies in theworld ( years) are Australia Iceland Italy Switzerlandand Japan ( years) Japanese men and women live anaverage of and years respectively (WHO )In countries with a high rate of HIV infection prin-

cipally in Sub-Saharan Africa the average life expectancyis years and below Some of the worldrsquos lowest lifeexpectancies are in Sierra Leone ( years) Afghanistan( years) Lesotho ( years) and Zimbabwe ( years)In nearly all countries women live longer than men

e worldrsquos average life expectancy at birth is years formales and years for females the gap is about ve yearse female-to-male gap is expected to narrow in the moredeveloped regions andwiden in the less developed regionse Russian Federation has the greatest dierence in lifeexpectancies between the sexes ( years less for men)whereas in Tonga life expectancy for males exceeds thatfor females by years (WHO )Life expectancy is assumed to rise continuously

According to estimation by the UN global life expectancyat birth is likely to rise to an average years by ndashBy life expectancy is expected to vary across countriesfrom to years Long-range United Nations popula-tion projections predict that by on average peoplecan expect to live more than years from (Liberia) upto years (Japan)For more details on the calculation of life expectancy

including continuous notation see Keytz ( )and Preston et al ()

Cross References7Biopharmaceutical Research Statistics in7Biostatistics7Demography7Life Table7Mean Residual Life7Measurement of Economic Progress

References and Further ReadingKeyfitz N () Introduction to the Mathematics of Population

Addison-Wesly MassachusettsKeyfitz N () Applied mathematical demography Springer

New YorkPreston SH Heuveline P Guillot M () Demography measuring

and modelling potion processes Blackwell OxfordRiley JC () Rising life expectancy a global history Cambridge

University Press CambridgeRowland DT () Demographic methods and concepts Oxford

University Press New YorkSiegel JS Swanson DA () The methods and materials of demog-

raphy nd edn Elsevier Academic Press AmsterdamWorld Health Organization () World health statistics

Geneva

Life Table L

L

World Population Prospects The revision Highlights UnitedNations New York

World Population to () United Nations New York

Life Table

JanM HoemProfessor Director EmeritusMax Planck Institute for Demographic Research RostockGermany

e life table is a classical tabular representation of centralfeatures of the distribution function F of a positive variablesay X which normally is taken to represent the lifetime ofa newborn individuale life table was introduced wellbefore modern conventions concerning statistical distri-butions were developed and it comes with some specialterminology and notation as follows Suppose that F has a

density f (x) = ddxF(x) and dene the force of mortality or

death intensity at age x as the function

micro(x) = minus ddxln minus F(x) = f (x)

minus F(x)

Heuristically it is interpreted by the relation micro(x)dx =Px lt X lt x + dx∣X gt x Conversely F(x) =

minus expminusx

intmicro(s)ds e survivor function is dened as

ℓ(x) = ℓ() minus F(x) normally with ℓ() = Inmortality applications ℓ(x) is the expected number of sur-vivors to exact age x out of an original cohort of newborn babiese survival probability is

tpx = PX gt x + t∣X gt x = ℓ(x + t)ℓ(x)

= expminusintt

micro(x + s)ds

and the non-survival probability is (the converse) tqx = minustpx For t = one writes qx = qx and px = px In partic-ular we get ℓ(x + ) = ℓ(x)pxis is a practical recursionformula that permits us to compute all values of ℓ(x) oncewe know the values of px for all relevant x

e life expectancy is eo = EX = int infin ℓ(x)dxℓ()(e subscript in eo indicates that the expected value iscomputed at age (ie for newborn individuals) and thesuperscript o indicates that the computation is made in

the continuous mode)e remaining life expectancy atage x is

eox = E(X minus x∣X gt x) = intinfin

ℓ(x + t)dtℓ(x)

ie it is the expected lifetime remaining to someone whohas attained age xTo turn to the statistical estimation of these various

quantities suppose that the function micro(x) is piecewiseconstant which means that we take it to equal some con-stant say microj over each interval (xj xj+) for some partitionxj of the age axis For a collection Xi of independentobservations of X let Dj be the number of Xi that fall inthe interval (xj xj+) In mortality applications this is thenumber of deaths observed in the given interval For thecohort of the initially newborn Dj is the number of indi-viduals whodie in the interval (called the occurrences in theinterval) If individual i dies in the interval he or she willof course have lived for Xi minus xj time units during the inter-val Individuals who survive the interval will have lived forxj+ minus xj time units in the interval and individuals whodo not survive to age xj will not have lived during thisinterval at all When we aggregate the time units lived in(xj xj+) over all individuals we get a total Rj which iscalled the exposures for the interval the idea being thatindividuals are exposed to the risk of death for as long asthey live in the interval In the simple case where there areno relations between the individual parameters microj the col-lection DjRj constitutes a statistically sucient set ofobservations with a likelihood Λ that satises the relationln Λ = sumjminusmicrojRj+Dj ln microjwhich is easily seen to bemax-imized by microj = DjRje latter fraction is therefore themaximum-likelihood estimator for microj (In some connec-tions an age schedule of mortality will be specied such asthe classical GompertzndashMakeham function microx = a + bcxwhich does represent a relationship between the intensityvalues at the dierent ages x normally for single-year agegroupsMaximum likelihood estimators can then be foundby plugging this functional specication of the intensitiesinto the likelihood function nding the values a b andc that maximize Λ and using a + bcx for the intensity inthe rest of the life table computations Methods that do notamount tomaximum likelihood estimationwill sometimesbe used because they involve simpler computations Withsome luck they provide starting values for the iterative pro-cess that must usually be applied to produce the maximumlikelihood estimators For an example see Forseacuten ())is whole schema can be extended trivially to cover cen-soring (withdrawals) provided the censoringmechanism isunrelated to the mortality process

L Likelihood

If the force of mortality is constant over a single-yearage interval (x x + ) say and is estimated by microx in thisinterval then px = eminusmicrox is an estimator of the single-yearsurvival probability pxis allows us to estimate the sur-vival function recursively for all corresponding ages usingℓ(x + ) = ℓ(x)px for x = and the rest of thelife table computations follow suit Life table constructionconsists in the estimation of the parameters and the tab-ulation of functions like those above from empirical datae data can be for age at death for individuals as in theexample indicated above but they can also be observa-tions of duration until recovery from an illness of intervalsbetween births of time until breakdown of some piece ofmachinery or of any other positive duration variableSo far we have argued as if the life table is computed for

a group of mutually independent individuals who have allbeen observed in parallel essentially a cohort that is fol-lowed from a signicant common starting point (namelyfrom birth in our mortality example) and which is dimin-ished over time due to decrements (attrition) caused bythe risk in question and also subject to reduction due tocensoring (withdrawals)e corresponding table is thencalled a cohort life table It is more common however toestimate a px schedule from data collected for the mem-bers of a population during a limited time period and touse the mechanics of life-table construction to produce aperiod life table from the px valuesLife table techniques are described in detail in most

introductory textbooks in actuarial statistics7biostatistics7demography and epidemiology See eg Chiang ()Elandt-Johnson and Johnson () Manton and Stallard() Preston et al () For the history of thetopic consult Seal () Smith and Keytz () andDupacircquier ()

About the AuthorFor biography see the entry 7Demography

Cross References7Demographic Analysis A Stochastic Approach7Demography7Event History Analysis7Kaplan-Meier Estimator7Population Projections7Statistics An Overview

References and Further ReadingChiang CL () The life table and its applications Krieger

MalabarDupacircquier J () Lrsquoinvention de la table de mortaliteacute Presses

universitaires de France Paris

Elandt-Johnson RC Johnson NL () Survival models and dataanalysis Wiley New York

Forseacuten L () The efficiency of selected moment methods inGompertzndashMakeham graduation of mortality Scand Actuarial Jndash

Manton KG Stallard E () Recent trends in mortality analysisAcademic Press Orlando

Preston SH Heuveline P Guillot M () Demography Measuringand modeling populations Blackwell Oxford

Seal H () Studies in history of probability and statistics mul-tiple decrements of competing risks Biometrica ()ndash

Smith D Keyfitz N () Mathematical demography selectedpapers Springer Heidelberg

Likelihood

Nancy ReidProfessorUniversity of Toronto Toronto ON Canada

Introductione likelihood function in a statistical model is propor-tional to the density function for the random variable tobe observed in the model Most oen in applications oflikelihood we have a parametric model f (y θ) where theparameter θ is assumed to take values in a subset of Rkand the variable y is assumed to take values in a subset ofRn the likelihood function is dened by

L(θ) = L(θ y) = cf (y θ) ()

where c can depend on y but not on θ In more gen-eral settings where the model is semi-parametric or non-parametric the explicit denition is more dicult becausethe density needs to be dened relative to a dominatingmeasure whichmay not exist seeVan derVaart () andMurphy andVan derVaart ()is article will consideronly nite-dimensional parametric modelsWithin the context of the given parametric model the

likelihood function measures the relative plausibility ofvarious values of θ for a given observed data point y Val-ues of the likelihood function are only meaningful relativeto each other and for this reason are sometimes stan-dardized by the maximum value of the likelihood func-tion although other reference points might be of interestdepending on the contextIf ourmodel is f (y θ) = (ny)θy(minusθ)nminusy y = n

θ isin [ ] then the likelihood function is (any functionproportional to)

L(θ y) = θy( minus θ)nminusy

Likelihood L

L

and can be plotted as a function of θ for any xed valueof y e likelihood function is maximized at θ = ynis model might be appropriate for a sampling schemewhich recorded the number of successes amongn indepen-dent trials that result in success or failure each trial havingthe same probability of success θ Another example is thelikelihood function for the mean and variance parameterswhen sampling from a normal distribution with mean microand variance σ

L(θ y) = expminusn log σ minus (σ )Σ(yi minus micro)

where θ = (micro σ )is could also be plotted as a functionof micro and σ for a given sample y yn and it is not dif-cult to show that this likelihood function only dependson the sample through the sample mean y = nminusΣyi andsample variance s = (n minus )minusΣ(yi minus y) or equivalentlythrough Σyi and Σyi It is a general property of likelihoodfunctions that they depend on the data only through theminimal sucient statistic

Inferencee likelihood function was dened in a seminal paperof Fisher () and has since become the basis for mostmethods of statistical inference One version of likelihoodinference suggested by Fisher is to use some rule suchas L(θ)L(θ) gt k to dene a range of ldquolikelyrdquo or ldquoplau-siblerdquo values of θ Many authors including Royall ()and Edwards () have promoted the use of plots ofthe likelihood function along with interval estimates ofplausible valuesis approach is somewhat limited how-ever as it requires that θ have dimension or possibly or that a likelihood function can be constructed that onlydepends on a component of θ that is of interest see sectionldquo7Nuisance Parametersrdquo belowIn general we would wish to calibrate our inference

for θ by referring to the probabilistic properties of theinferential method One way to do this is to introduce aprobability measure on the unknown parameter θ typi-cally called a prior distribution and use Bayesrsquo rule forconditional probabilities to conclude

π(θ ∣ y) = L(θ y)π(θ)intθL(θ y)π(θ)dθ

where π(θ) is the density for the prior measure and π(θ ∣y) provides a probabilistic assessment of θ aer observingY = y in the model f (y θ) We could then make con-clusions of the form ldquohaving observed successes in trials and assuming π(θ) = the posterior probabilitythat θ gt is rdquo and so on

is is a very brief description of Bayesian inference inwhich probability statements refer to that generated from

the prior through the likelihood to the posterior A majordiculty with this approach is the choice of prior prob-ability function In some applications there may be anaccumulation of previous data that can be incorporatedinto a probability distribution but in general there is notand some rather ad hoc choices are oen made Anotherdiculty is themeaning to be attached to probability state-ments about the parameterInference based on the likelihood function can also be

calibrated with reference to the probability model f (y θ)by examining the distribution ofL(θY) as a random func-tion or more usually by examining the distribution ofvarious derived quantitiesis is the basis for likelihoodinference from a frequentist point of view In particularit can be shown that logL(θY)L(θY) where θ =θ(Y) is the value of θ at which L(θY) is maximized isapproximately distributed as a χk random variable wherek is the dimension of θ To make this precise requires anasymptotic theory for likelihood which is based on a cen-tral limit theorem (see 7Central Limiteorems) for thescore function

U(θY) = partpartθlogL(θY)

If Y = (Y Yn) has independent components thenU(θ) is a sum of n independent components which undermild regularity conditions will be asymptotically normalTo obtain the χ result quoted above it is also necessary toinvestigate the convergence of θ to the true value govern-ing the model f (y θ) Showing this convergence usuallyin probability but sometimes almost surely can be di-cult see Scholz () for a summary of some of the issuesthat ariseAssuming that θ is consistent for θ and that L(θY)

has sucient regularity the follow asymptotic results canbe established

(θ minus θ)T i(θ)(θ minus θ) drarr χk ()

U(θ)T iminus(θ)U(θ) drarr χk ()

ℓ(θ) minus ℓ(θ) drarr χk ()

where i(θ) = Eminusℓprimeprime(θY) θ is the expected Fisher infor-mation function ℓ(θ) = logL(θ) is the log-likelihoodfunction and χk is the 7chi-square distribution with kdegrees of freedom

ese results are all versions of a more general resultthat the log-likelihood function converges to the quadraticform corresponding to a multivariate normal distribution(see 7Multivariate Normal Distributions) under suitablystated limiting conditions ere is a similar asymptoticresult showing that the posterior density is asymptotically

L Likelihood

normal and in fact asymptotically free of the prior distri-bution although this result requires that the prior distribu-tion be a proper probability density ie has integral overthe parameter space equal to

Nuisance ParametersIn models where the dimension of θ is large plotting thelikelihood function is not possible and inference based onthe multivariate normal distribution for θ or the χk dis-tribution of the log-likelihood ratio doesnrsquot lead easily tointerval estimates for components of θ However it is pos-sible to use the likelihood function to construct inferencefor parameters of interest using variousmethods that havebeen proposed to eliminate nuisance parametersSuppose in the model f (y θ) that θ = (ψ λ) where ψ

is a k-dimensional parameter of interest (which will oenbe )e prole log-likelihood function of ψ is

ℓP(ψ) = ℓ(ψ λψ)

where λψ is the constrainedmaximum likelihood estimateit maximizes the likelihood function L(ψ λ) when ψ isheld xede prole log-likelihood function is also calledthe concentrated log-likelihood function especially ineconometrics If the parameter of interest is not expressedexplicitly as a subvector of θ then the constrained maxi-mum likelihood estimate is found using Lagrange multi-pliersIt can be veried under suitable smoothness conditions

that results similar to those at ( ndash ) hold as well for theprole log-likelihood function in particular

ℓP(ψ) minus ℓP(ψ) = ℓ(ψ λ) minus ℓ(ψ λψ)drarr χk

is method of eliminating nuisance parameters is notcompletely satisfactory especially when there are manynuisance parameters in particular it doesnrsquot allow forerrors in estimation of λ For example the prole likeli-hood approach to estimation of σ in the linear regressionmodel (see 7Linear Regression Models) y sim N(Xβ σ )will lead to the estimator σ = Σ(yi minus yi)n whereas theestimator usually preferred has divisor nminusp where p is thedimension of β

us a large literature has developed on improvementsto the prole log-likelihood For Bayesian inference suchimprovements are ldquoautomaticallyrdquo included in the formu-lation of the marginal posterior density for ψ

πM(ψ ∣ y)prop int π(ψ λ ∣ y)dλ

but it is typically quite dicult to specify priors for possiblyhigh-dimensional nuisance parameters For non-Bayesian

inference most modications to the prole log-likelihoodare derived by considering conditional or marginal infer-ence in models that admit factorizations at least approxi-mately like the following

f (y θ) = f(yψ)f(y ∣ y λ) orf (y θ) = f(y ∣ yψ)f(y λ)

A discussion of conditional inference and density factori-sations is given in Reid ()is literature is closely tiedto that on higher order asymptotic theory for likelihoode latter theory builds on saddlepoint and Laplace expan-sions to derive more accurate versions of (ndash) see forexample Severini () and Brazzale et al () edirect likelihood approach of Royall () and others doesnot generalize very well to the nuisance parameter settingalthough Royall and Tsou () present some results inthis direction

Extensions to Likelihoode likelihood function is such an important aspect ofinference based on models that it has been extended toldquolikelihood-likerdquo functions formore complex data settingsExamples include nonparametric and semi-parametriclikelihoods the most famous semi-parametric likelihoodis the proportional hazards model of Cox () Butmany other extensions have been suggested to empiri-cal likelihood (Owen ) which is a type of nonpara-metric likelihood supported on the observed sample toquasi-likelihood (McCullagh ) which starts from thescore function U(θ) and works backwards to an infer-ence function to bootstrap likelihood (Davison et al )and many modications of prole likelihood (Barndor-Nielsen andCox Fraser )ere is recent interestfor multi-dimensional responses Yi in composite likeli-hoods which are products of lower dimensional condi-tional or marginal distributions (Varin ) Reid ()concluded a review article on likelihood by stating

7 From either a Bayesian or frequentist perspective the like-lihood function plays an essential role in inference Themaximum likelihood estimator once regarded on an equalfooting among competing point estimators is now typi-cally the estimator of choice although some refinementis needed in problems with large numbers of nuisanceparameters The likelihood ratio statistic is the basis formost tests of hypotheses and interval estimates The emer-gence of the centrality of the likelihood function for infer-ence partly due to the large increase in computing poweris one of the central developments in the theory of statisticsduring the latter half of the twentieth century

Likelihood L

L

Further Readinge book by Cox and Hinkley () gives a detailedaccount of likelihood inference and principles of statis-tical inference see also Cox () ere are severalbook-length treatments of likelihood inference includingEdwards () Azzalini () Pawitan () and Sev-erini () this last discusses higher order asymptotictheory in detail as does Barndor-Nielsen andCox ()and Brazzale Davison and Reid () A short reviewpaper is Reid () An excellent overview of consis-tency results for maximum likelihood estimators is Scholz() see also Lehmann andCasella () Foundationalissues surrounding likelihood inference are discussed inBerger and Wolpert ()

About the AuthorProfessor Reid is a Past President of the Statistical Societyof Canada (ndash) During (ndash) she servedas the President of the Institute of Mathematical Statis-tics Among many awards she received the Emanuel andCarol Parzen Prize for Statistical Innovation () ldquoforleadership in statistical science for outstanding researchin theoretical statistics and highly accurate inference fromthe likelihood function and for inuential contributionsto statistical methods in biology environmental sciencehigh energy physics and complex social surveysrdquo She wasawarded the Gold Medal Statistical Society of Canada() and FlorenceNightingaleDavidAward Committeeof Presidents of Statistical Societies () She is AssociateEditor of Statistical Science (ndash) Bernoulli (ndash) andMetrika (ndash)

Cross References7Bayesian Analysis or Evidence Based Statistics7Bayesian Statistics7Bayesian Versus Frequentist Statistical Reasoning7Bayesian vs Classical Point Estimation A ComparativeOverview7Chi-Square Test Analysis of Contingency Tables7Empirical Likelihood Approach to Inference fromSample Survey Data7Estimation7Fiducial Inference7General Linear Models7Generalized Linear Models7Generalized Quasi-Likelihood (GQL) Inferences7Mixture Models7Philosophical Foundations of Statistics

7Risk Analysis7Statistical Evidence7Statistical Inference7Statistical Inference An Overview7Statistics An Overview7Statistics Nelderrsquos view7Testing Variance Components in Mixed LinearModels7Uniform Distribution in Statistics

References and Further ReadingAzzalini A () Statistical inference Chapman and Hall LondonBarndorff-Nielsen OE Cox DR () Inference and asymptotics

Chapman and Hall LondonBerger JO Wolpert R () The likelihood principle Institute of

Mathematical Statistics HaywardBirnbaum A () On the foundations of statistical inference Am

Stat Assoc ndashBrazzale AR Davison AC Reid N () Applied asymptotics

Cambridge University Press CambridgeCox DR () Regression models and life tables J R Stat Soc B

ndash (with discussion)Cox DR () Principles of statistical inference Cambridge Uni-

versity Press CambridgeCox DR Hinkley DV () Theoretical statistics Chapman and

Hall LondonDavison AC Hinkley DV Worton B () Bootstrap likelihoods

Biometrika ndashEdwards AF () Likelihood Oxford University Press OxfordFisher RA () On the mathematical foundations of theoretical

statistics Phil Trans R Soc A ndashLehmann EL Casella G () Theory of point estimation nd edn

Springer New YorkMcCullagh P () Quasi-likelihood functions Ann Stat ndashMurphy SA Van der Vaart A () Semiparametric likelihood ratio

inference Ann Stat ndashOwen A () Empirical likelihood ratio confidence intervals for a

single functional Biometrika ndashPawitan Y () In all likelihood Oxford University Press OxfordReid N () The roles of conditioning in inference Stat Sci

ndashReid N () Likelihood J Am Stat Assoc ndashRoyall RM () Statistical evidence a likelihood paradigm

Chapman and Hall LondonRoyall RM Tsou TS () Interpreting statistical evidence using

imperfect models robust adjusted likelihoodfunctions J R StatSoc B

Scholz F () Maximum likelihood estimation In Ency-clopedia of statistical sciences Wiley New York doiesspub Accessed Aug

Severini TA () Likelihood methods in statistics Oxford Univer-sity Press Oxford

Van der Vaart AW () Infinite-dimensional likelihood meth-ods in statistics httpwwwstieltjesorgarchiefbiennialframenodehtml Accessed Aug

Varin C () On composite marginal likelihood Adv Stat Analndash

L Limit Theorems of Probability Theory

Limit Theorems of ProbabilityTheory

Alexandr Alekseevich BorovkovProfessor Head of Probability and Statistics Chair at theNovosibirsk UniversityNovosibirsk University Novosibirsk Russia

Limit eorems of Probability eory is a broad namereferring to the most essential and extensive research areain Probability eory which at the same time has thegreatest impact on the numerous applications of the latterBy its very nature Probability eory is concerned

with asymptotic (limiting) laws that emerge in a long seriesof observations on random events Because of this in theearly twentieth century even the very denition of prob-ability of an event was given by a group of specialists(R von Mises and some others) as the limit of the rela-tive frequency of the occurrence of this event in a longrow of independent random experiments e ldquostabilityrdquoof this frequency (ie that such a limit always exists) waspostulated Aer the s Kolmogorovrsquos axiomatic con-struction of probability theory has prevailed One of themain assertions in this axiomatic theory is the Law of LargeNumbers (LLN) on the convergence of the averages of largenumbers of random variables to their expectation islaw implies the aforementioned stability of the relative fre-quencies and their convergence to the probability of thecorresponding event

e LLN is the simplest limit theorem (LT) of probabil-ity theory elucidating the physical meaning of probabilitye LLN is stated as follows if XXX is a sequenceof iid random variables

Sn =n

sumj=Xj

and the expectation a = EX exists then nminusSnasETHrarr a

(almost surely ie with probability )us the value nacan be called the rst order approximation for the sums Sne Central Limiteorem (CLT) gives one a more preciseapproximation for Sn It says that if σ = E(X minus a) ltinfinthen the distribution of the standardized sum ζn = (Sn minusna)σ

radicn converges as n rarr infin to the standard normal

(Gaussian) lawat is for all x

P(ζn lt x)rarr Φ(x) = radicπ

x

intminusinfin

eminustdt

e quantity nEξ + ζσradicn where ζ is a standard normal

random variable (so that P(ζ lt x) = Φ(x)) can be calledthe second order approximation for Sn

e rst LLN (for the Bernoulli scheme) was provedby Jacob Bernoulli in the late s (published posthu-mously in ) e rst CLT (also for the Bernoullischeme) was established by A de Moivre (rst publishedin and referred nowadays to as the deMoivrendashLaplacetheorem) In the beginning of the nineteenth centuryPS Laplace and CF Gauss contributed to the generaliza-tion of these assertions and appreciation of their enormousapplied importance (in particular for the theory of errorsof observations) while later in that century further break-throughs in both methodology and applicability rangeof the CLT were achieved by PL Chebyshev () andAM Lyapunov ()

e main directions in which the two aforementionedmain LTs have been extended and rened since then are

Relaxing the assumption EX lt infin When the sec-ond moment is innite one needs to assume that theldquotailrdquo P(x) = P(X gt x) + P(X lt minusx) is a func-tion regularly varying at innity such that the limitlimxrarrinfinP(X gt x)P(x) existsen the distribution of thenormalized sum Snσ(n) where σ(n) = Pminus(nminus)Pminus being the generalized inverse of the function P andwe assume that Eξ = when the expectation is niteconverges to one of the so-called stable laws as nrarrinfine7characteristic functions of these laws have simpleclosed-form representations

Relaxing the assumption that the Xjrsquos are identicallydistributed and proceeding to study the so-called tri-angular array scheme where the distributions of thesummands Xj = Xjn forming the sum Sn depend notonly on j but on n as well In this case the class of alllimit laws for the distribution of Sn (under suitable nor-malization) is substantially wider it coincides with theclass of the so-called innitely divisible distributionsAn important special case here is the Poisson limit the-orem on convergence in distribution of the number ofoccurrences of rare events to a Poisson law

Relaxing the assumption of independence of the XjrsquosSeveral types of ldquoweak dependencerdquo assumptions onXjunder which the LLN and CLT still hold true havebeen suggested and investigatedOne should alsomen-tion here the so-called ergodic theorems (see7Ergodiceorem) for a wide class of random sequences andprocesses

Renement of the main LTs and derivation of asymp-totic expansions For instance in the CLT bounds ofthe rate of convergence P(ζn lt x) minus Φ(x) rarr and

Limit Theorems of Probability Theory L

L

asymptotic expansions for this dierence (in the pow-ers of nminus in the case of iid summands) have beenobtained under broad assumptions

Studying large deviation probabilities for the sums Sn(theorems on rare events) If x rarr infin together with nthen the CLT can only assert that P(ζn gt x) rarr eorems on large deviation probabilities aim to nd afunction P(xn) such that

P(ζn gt x)P(xn) rarr as nrarrinfin x rarrinfin

e nature of the function P(xn) essentially dependson the rate of decay of P(X gt x) as x rarrinfin and on theldquodeviation zonerdquo ie on the asymptotic behavior of theratio xn as nrarrinfin

Considering observations X Xn of a more com-plex nature ndash rst of all multivariate random vectorsIf Xj isin Rd then the role of the limit law in the CLTwill be played by a d-dimensional normal (Gaussian)distribution with the covariance matrix E(X minus EX)(X minus EX)T

e variety of application areas of the LLN and CLTis enormous us Mathematical Statistics is based onthese LTs Let Xlowastn = (X Xn) be a sample froma distribution F and Flowastn (u) the corresponding empiricaldistribution functione fundamental GlivenkondashCantellitheorem (see 7Glivenko-Cantelli eorems) stating thatsupu ∣F

lowastn (u)minus F(u)∣

asETHrarr as nrarrinfin is of the same natureas the LLN and basically means that the unknown distri-bution F can be estimated arbitrary well from the randomsample Xlowastn of a large enough size n

e existence of consistent estimators for the unknownparameters a = Eξ and σ = E(X minus a) also follows fromthe LLN since as nrarrinfin

alowast = n

n

sumj=Xj

asETHrarr a (σ )lowast = n

n

sumj=

(Xj minus alowast)

= n

n

sumj=Xj minus (alowast) asETHrarr σ

Under additional moment assumptions on the distri-bution F one can also construct asymptotic condenceintervals for the parameters a and σ as the distributionsof the quantities

radicn(alowast minus a) and

radicn((σ )lowast minus σ ) con-

verge as n rarr infin to the normal onese same can alsobe said about other parameters that are ldquosmoothrdquo enoughfunctionals of the unknown distribution F

e theorem on the 7asymptotic normality andasymptotic eciency of maximum likelihood estimatorsis another classical example of LTsrsquo applications in mathe-matical statistics (see eg Borovkov ) Furthermore in

estimation theory and hypotheses testing one also needstheorems on large deviation probabilities for the respec-tive statistics as it is statistical procedures with small errorprobabilities that are oen required in applicationsIt is worth noting that summation of random variables

is by no means the only situation in which LTs appear inProbabilityeoryGenerally speaking the main objective of Probabil-

ityeory in applications is nding appropriate stochasticmodels for objects under study and then determining thedistributions or parameters one is interested in As a rulethe explicit form of these distributions and parameters isnot known LTs can be used to nd suitable approximationsto the characteristics in questionAt least two possible approaches to this problem should

be noted here

Suppose that the unknown distribution Fθ depends ona parameter θ such that as θ approaches some ldquocriticalrdquovalue θ the distributions Fθ become ldquodegeneraterdquo inone sense or anotheren in a number of cases onecan nd an approximation for Fθ which is valid for thevalues of θ that are close to θ For instance in actuarialstudies7queueing theory and some other applicationsone of the main problems is concerned with the dis-tribution of S = sup

kge(Sk minus θk) under the assumption

that EX = If θ gt then S is a proper random vari-able If however θ rarr then S asETHrarr infin Here we dealwith the so-called ldquotransient phenomenardquo It turns outthat if σ = Var (X) ltinfin then there exists the limit

limθdarrP(θS gt x) = eminusxσ x gt

is (KingmanndashProkhorov) LT enables one to ndapproximations for the distribution of S in situationswhere θ is small

Sometimes one can estimate the ldquotailsrdquo of the unknowndistributions ie their asymptotic behavior at innityis is of importance in those applications where oneneeds to evaluate the probabilities of rare events If theequation Eemicro(Xminusθ) = has a solution micro gt then inthe above example one has

P(S gt x) sim ceminusmicrox x rarrinfin

where c is a known constant If the distribution F of Xis subexponential (in this case Eemicro(Xminusθ) = infin for anymicro gt ) then

P(S gt x) sim θ int

infin

x( minus F(t))dt x rarrinfin

L Linear Mixed Models

isLTenablesone tondapproximations forP(S gt x)for large xFor both approaches the obtained approximations

can be rened

An important part of Probabilityeory is concernedwith LTs for random processes eir main objective isto nd conditions under which random processes con-verge in some sense to some limit process An extensionof the CLT to that context is the so-called FunctionalCLT (aka the DonskerndashProkhorov invariance princi-ple) which states that as n rarr infin the processesζn(t) = (Slfloorntrfloor minus ant)σ

radicntisin[] converge in distribu-

tion to the standard Wiener process w(t)tisin[]e LTs(including large deviation theorems) for a broad class offunctionals of the sequence (7random walk) S Sncan also be classied as LTs for 7stochastic processese same can be said about Law of iterated logarithmwhich states that for an arbitrary ε gt the random walkSkinfink= crosses the boundary (minusε)σ

radick ln ln k innitely

many times but crosses the boundary ( + ε)σradick ln ln k

nitely many times with probability Similar results holdtrue for trajectories of Wiener processes w(t)tisin[] andw(t)tisin[infin)In mathematical statistics a closely related to func-

tional CLT result says that the so-called ldquoempirical processrdquoradicn(Flowastn (u) minus F(u))uisin(minusinfininfin) converges in distribution

to w(F(u))uisin(minusinfininfin) where w(t) = w(t) minus tw() isthe Brownian bridge processis LT implies7asymptoticnormality of a great many estimators that can be repre-sented as smooth functionals of the empirical distributionfunction Flowastn (u)

ere are many other areas in Probability eoryand its applications where various LTs appear and areextensively used For instance convergence theoremsfor 7martingales asymptotics of extinction probabilityof a branching processes and conditional (under non-extinction condition) LTs on a number of particles etc

About the AuthorProfessor Borovkov is Head of Probability and Statis-tics Department of the Sobolev Institute of MathematicsNovosibirsk (RussianAcademy of Sciences) since Heis Head of Probability and Statistics Chair at the Novosi-birsk University since He is Concelour of RussianAcademy of Sciences and fullmember of RussianAcademyof Sciences (Academician) () He was awarded theState Prize of the USSR () the Markov Prize of theRussian Academy of Sciences () and GovernmentPrize in Education () Professor Borovkov is Editor-in-Chief of the journal ldquoSiberianAdvances inMathematicsrdquo

and Associated Editor of journals ldquoeory of Probabil-ity and its Applicationsrdquo ldquoSiberian Mathematical JournalrdquoldquoMathematical Methods of Statisticsrdquo ldquoElectronic Journal ofProbabilityrdquo

Cross References7Almost Sure Convergence of Random Variables7Approximations to Distributions7Asymptotic Normality7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Central Limiteorems7Empirical Processes7Ergodiceorem7Glivenko-Cantellieorems7Large Deviations and Applications7Laws of Large Numbers7Martingale Central Limiteorem7Probabilityeory An Outline7RandomMatrixeory7Strong Approximations in Probability and Statistics7Weak Convergence of Probability Measures

References and Further ReadingBillingsley P () Convergence of probability measures Wiley

New YorkBorovkov AA () Mathematical statistics Gordon amp Breach

AmsterdamGnedenko BV Kolmogorov AN () Limit distributions for sums

of independent random variables Addison-Wesley CambridgeLeacutevy P () Theacuteorie de lrsquoAddition des Variables Aleacuteatories

Gauthier-Villars ParisLoegraveve M (ndash) Probability theory vols I and II th edn

Springer New YorkPetrov VV () Limit theorems of probability theory sequences of

independent random variables ClarendonOxford UniversityPress New York

Linear Mixed Models

GeertMolenberghsProfessorUniversiteit Hasselt amp Katholieke Universiteit LeuvenLeuven Belgium

In observational studies repeated measurements may betaken at almost arbitrary time points resulting in anextremely large number of time points at which only one

Linear Mixed Models L

L

or only a few measurements have been taken Many of theparametric covariance models described so far may thencontain toomany parameters tomake them useful in prac-tice while othermore parsimoniousmodelsmay be basedon assumptions which are too simplistic to be realisticA general and very exible class of parametric models forcontinuous longitudinal data is formulated as follows

yi∣bi sim N(Xiβ + ZibiΣi) ()bi sim N(D) ()

where Xi and Zi are (ni times p) and (ni times q) dimensionalmatrices of known covariates β is a p-dimensional vec-tor of regression parameters called the xed eects D isa general (q times q) covariance matrix and Σi is a (ni times ni)covariance matrix which depends on i only through itsdimension ni ie the set of unknown parameters in Σi willnot depend upon i Finally bi is a vector of subject-specicor random eects

e above model can be interpreted as a linear regres-sion model (see 7Linear Regression Models) for the vec-tor yi of repeated measurements for each unit separatelywhere some of the regression parameters are specic (ran-dom eects bi) while others are not (xed eects β)edistributional assumptions in () with respect to the ran-dom eects can be motivated as follows First E(bi) = implies that the mean of yi still equals Xiβ such that thexed eects in the random-eects model () can also beinterpreted marginally Not only do they reect the eectof changing covariates within specic units they alsomea-sure the marginal eect in the population of changing thesame covariates Second the normality assumption imme-diately implies that marginally yi also follows a normaldistribution with mean vector Xiβ and with covariancematrix Vi = ZiDZprimei + ΣiNote that the random eects in () implicitly imply the

marginal covariance matrix Vi of yi to be of the very spe-cic form Vi = ZiDZprimei + Σi Let us consider two examplesunder the assumption of conditional independence ieassuming Σi = σ Ini First consider the case where therandom eects are univariate and represent unit-specicintercepts is corresponds to covariates Zi which areni-dimensional vectors containing only ones

e marginal model implied by expressions () and() is

yi sim N(XiβVi) Vi = ZiDZprimei + Σi

which can be viewed as a multivariate linear regressionmodel with a very particular parameterization of thecovariance matrix Vi

With respect to the estimation of unit-specic param-eters bi the posterior distribution of bi given the observeddata yi can be shown to be (multivariate) normal withmean vector equal to DZprimeiV

minusi (α)(yi minus Xiβ) Replacing β

and α by their maximum likelihood estimates we obtainthe so-called empirical Bayes estimates bi for the bi A keyproperty of these EB estimates is shrinkage which is bestillustrated by considering the prediction yi equiv Xi β +Zibi ofthe ith prole It can easily be shown that

yi = ΣiVminusi Xi β + (Ini minus ΣiVminusi ) yi

which can be interpreted as a weighted average of thepopulation-averaged prole Xi β and the observed data yiwith weights ΣiVminusi and Ini minusΣiVminusi respectively Note thatthe ldquonumeratorrdquo of ΣiVminusi represents within-unit variabil-ity and the ldquodenominatorrdquo is the overall covariance matrixVi Hence much weight will be given to the overall averageprole if the within-unit variability is large in comparisonto the between-unit variability (modeled by the randomeects) whereasmuchweight will be given to the observeddata if the opposite is trueis phenomenon is referred toas shrinkage toward the average proleXi β An immediateconsequence of shrinkage is that the EB estimates show lessvariability than actually present in the random-eects dis-tribution ie for any linear combination λ of the randomeects

var(λprimebi)le var(λprimebi) = λprimeDλ

About the AuthorGeert Molenberghs is Professor of Biostatistics at Uni-versiteit Hasselt and Katholieke Universiteit Leuven inBelgium He was Joint Editor of Applied Statistics (ndash) and Co-Editor of Biometrics (ndash) Cur-rently he is Co-Editor of Biostatistics (ndash) He wasPresident of the International Biometric Society (ndash) received the Guy Medal in Bronze from the RoyalStatistical Society and the Myrto Lefkopoulou award fromthe Harvard School of Public Health Geert Molenberghsis founding director of the Center for Statistics He isalso the director of the Interuniversity Institute for Bio-statistics and statistical Bioinformatics (I-BioStat) Jointlywith Geert Verbeke Mike Kenward Tomasz BurzykowskiMarc Buyse and Marc Aerts he authored books on lon-gitudinal and incomplete data and on surrogate markerevaluation GeertMolenberghs received several Excellencein Continuing Education Awards of the American Statisti-cal Association for courses at Joint Statistical Meetings

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 13: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

Life Table L

L

World Population Prospects The revision Highlights UnitedNations New York

World Population to () United Nations New York

Life Table

JanM HoemProfessor Director EmeritusMax Planck Institute for Demographic Research RostockGermany

e life table is a classical tabular representation of centralfeatures of the distribution function F of a positive variablesay X which normally is taken to represent the lifetime ofa newborn individuale life table was introduced wellbefore modern conventions concerning statistical distri-butions were developed and it comes with some specialterminology and notation as follows Suppose that F has a

density f (x) = ddxF(x) and dene the force of mortality or

death intensity at age x as the function

micro(x) = minus ddxln minus F(x) = f (x)

minus F(x)

Heuristically it is interpreted by the relation micro(x)dx =Px lt X lt x + dx∣X gt x Conversely F(x) =

minus expminusx

intmicro(s)ds e survivor function is dened as

ℓ(x) = ℓ() minus F(x) normally with ℓ() = Inmortality applications ℓ(x) is the expected number of sur-vivors to exact age x out of an original cohort of newborn babiese survival probability is

tpx = PX gt x + t∣X gt x = ℓ(x + t)ℓ(x)

= expminusintt

micro(x + s)ds

and the non-survival probability is (the converse) tqx = minustpx For t = one writes qx = qx and px = px In partic-ular we get ℓ(x + ) = ℓ(x)pxis is a practical recursionformula that permits us to compute all values of ℓ(x) oncewe know the values of px for all relevant x

e life expectancy is eo = EX = int infin ℓ(x)dxℓ()(e subscript in eo indicates that the expected value iscomputed at age (ie for newborn individuals) and thesuperscript o indicates that the computation is made in

the continuous mode)e remaining life expectancy atage x is

eox = E(X minus x∣X gt x) = intinfin

ℓ(x + t)dtℓ(x)

ie it is the expected lifetime remaining to someone whohas attained age xTo turn to the statistical estimation of these various

quantities suppose that the function micro(x) is piecewiseconstant which means that we take it to equal some con-stant say microj over each interval (xj xj+) for some partitionxj of the age axis For a collection Xi of independentobservations of X let Dj be the number of Xi that fall inthe interval (xj xj+) In mortality applications this is thenumber of deaths observed in the given interval For thecohort of the initially newborn Dj is the number of indi-viduals whodie in the interval (called the occurrences in theinterval) If individual i dies in the interval he or she willof course have lived for Xi minus xj time units during the inter-val Individuals who survive the interval will have lived forxj+ minus xj time units in the interval and individuals whodo not survive to age xj will not have lived during thisinterval at all When we aggregate the time units lived in(xj xj+) over all individuals we get a total Rj which iscalled the exposures for the interval the idea being thatindividuals are exposed to the risk of death for as long asthey live in the interval In the simple case where there areno relations between the individual parameters microj the col-lection DjRj constitutes a statistically sucient set ofobservations with a likelihood Λ that satises the relationln Λ = sumjminusmicrojRj+Dj ln microjwhich is easily seen to bemax-imized by microj = DjRje latter fraction is therefore themaximum-likelihood estimator for microj (In some connec-tions an age schedule of mortality will be specied such asthe classical GompertzndashMakeham function microx = a + bcxwhich does represent a relationship between the intensityvalues at the dierent ages x normally for single-year agegroupsMaximum likelihood estimators can then be foundby plugging this functional specication of the intensitiesinto the likelihood function nding the values a b andc that maximize Λ and using a + bcx for the intensity inthe rest of the life table computations Methods that do notamount tomaximum likelihood estimationwill sometimesbe used because they involve simpler computations Withsome luck they provide starting values for the iterative pro-cess that must usually be applied to produce the maximumlikelihood estimators For an example see Forseacuten ())is whole schema can be extended trivially to cover cen-soring (withdrawals) provided the censoringmechanism isunrelated to the mortality process

L Likelihood

If the force of mortality is constant over a single-yearage interval (x x + ) say and is estimated by microx in thisinterval then px = eminusmicrox is an estimator of the single-yearsurvival probability pxis allows us to estimate the sur-vival function recursively for all corresponding ages usingℓ(x + ) = ℓ(x)px for x = and the rest of thelife table computations follow suit Life table constructionconsists in the estimation of the parameters and the tab-ulation of functions like those above from empirical datae data can be for age at death for individuals as in theexample indicated above but they can also be observa-tions of duration until recovery from an illness of intervalsbetween births of time until breakdown of some piece ofmachinery or of any other positive duration variableSo far we have argued as if the life table is computed for

a group of mutually independent individuals who have allbeen observed in parallel essentially a cohort that is fol-lowed from a signicant common starting point (namelyfrom birth in our mortality example) and which is dimin-ished over time due to decrements (attrition) caused bythe risk in question and also subject to reduction due tocensoring (withdrawals)e corresponding table is thencalled a cohort life table It is more common however toestimate a px schedule from data collected for the mem-bers of a population during a limited time period and touse the mechanics of life-table construction to produce aperiod life table from the px valuesLife table techniques are described in detail in most

introductory textbooks in actuarial statistics7biostatistics7demography and epidemiology See eg Chiang ()Elandt-Johnson and Johnson () Manton and Stallard() Preston et al () For the history of thetopic consult Seal () Smith and Keytz () andDupacircquier ()

About the AuthorFor biography see the entry 7Demography

Cross References7Demographic Analysis A Stochastic Approach7Demography7Event History Analysis7Kaplan-Meier Estimator7Population Projections7Statistics An Overview

References and Further ReadingChiang CL () The life table and its applications Krieger

MalabarDupacircquier J () Lrsquoinvention de la table de mortaliteacute Presses

universitaires de France Paris

Elandt-Johnson RC Johnson NL () Survival models and dataanalysis Wiley New York

Forseacuten L () The efficiency of selected moment methods inGompertzndashMakeham graduation of mortality Scand Actuarial Jndash

Manton KG Stallard E () Recent trends in mortality analysisAcademic Press Orlando

Preston SH Heuveline P Guillot M () Demography Measuringand modeling populations Blackwell Oxford

Seal H () Studies in history of probability and statistics mul-tiple decrements of competing risks Biometrica ()ndash

Smith D Keyfitz N () Mathematical demography selectedpapers Springer Heidelberg

Likelihood

Nancy ReidProfessorUniversity of Toronto Toronto ON Canada

Introductione likelihood function in a statistical model is propor-tional to the density function for the random variable tobe observed in the model Most oen in applications oflikelihood we have a parametric model f (y θ) where theparameter θ is assumed to take values in a subset of Rkand the variable y is assumed to take values in a subset ofRn the likelihood function is dened by

L(θ) = L(θ y) = cf (y θ) ()

where c can depend on y but not on θ In more gen-eral settings where the model is semi-parametric or non-parametric the explicit denition is more dicult becausethe density needs to be dened relative to a dominatingmeasure whichmay not exist seeVan derVaart () andMurphy andVan derVaart ()is article will consideronly nite-dimensional parametric modelsWithin the context of the given parametric model the

likelihood function measures the relative plausibility ofvarious values of θ for a given observed data point y Val-ues of the likelihood function are only meaningful relativeto each other and for this reason are sometimes stan-dardized by the maximum value of the likelihood func-tion although other reference points might be of interestdepending on the contextIf ourmodel is f (y θ) = (ny)θy(minusθ)nminusy y = n

θ isin [ ] then the likelihood function is (any functionproportional to)

L(θ y) = θy( minus θ)nminusy

Likelihood L

L

and can be plotted as a function of θ for any xed valueof y e likelihood function is maximized at θ = ynis model might be appropriate for a sampling schemewhich recorded the number of successes amongn indepen-dent trials that result in success or failure each trial havingthe same probability of success θ Another example is thelikelihood function for the mean and variance parameterswhen sampling from a normal distribution with mean microand variance σ

L(θ y) = expminusn log σ minus (σ )Σ(yi minus micro)

where θ = (micro σ )is could also be plotted as a functionof micro and σ for a given sample y yn and it is not dif-cult to show that this likelihood function only dependson the sample through the sample mean y = nminusΣyi andsample variance s = (n minus )minusΣ(yi minus y) or equivalentlythrough Σyi and Σyi It is a general property of likelihoodfunctions that they depend on the data only through theminimal sucient statistic

Inferencee likelihood function was dened in a seminal paperof Fisher () and has since become the basis for mostmethods of statistical inference One version of likelihoodinference suggested by Fisher is to use some rule suchas L(θ)L(θ) gt k to dene a range of ldquolikelyrdquo or ldquoplau-siblerdquo values of θ Many authors including Royall ()and Edwards () have promoted the use of plots ofthe likelihood function along with interval estimates ofplausible valuesis approach is somewhat limited how-ever as it requires that θ have dimension or possibly or that a likelihood function can be constructed that onlydepends on a component of θ that is of interest see sectionldquo7Nuisance Parametersrdquo belowIn general we would wish to calibrate our inference

for θ by referring to the probabilistic properties of theinferential method One way to do this is to introduce aprobability measure on the unknown parameter θ typi-cally called a prior distribution and use Bayesrsquo rule forconditional probabilities to conclude

π(θ ∣ y) = L(θ y)π(θ)intθL(θ y)π(θ)dθ

where π(θ) is the density for the prior measure and π(θ ∣y) provides a probabilistic assessment of θ aer observingY = y in the model f (y θ) We could then make con-clusions of the form ldquohaving observed successes in trials and assuming π(θ) = the posterior probabilitythat θ gt is rdquo and so on

is is a very brief description of Bayesian inference inwhich probability statements refer to that generated from

the prior through the likelihood to the posterior A majordiculty with this approach is the choice of prior prob-ability function In some applications there may be anaccumulation of previous data that can be incorporatedinto a probability distribution but in general there is notand some rather ad hoc choices are oen made Anotherdiculty is themeaning to be attached to probability state-ments about the parameterInference based on the likelihood function can also be

calibrated with reference to the probability model f (y θ)by examining the distribution ofL(θY) as a random func-tion or more usually by examining the distribution ofvarious derived quantitiesis is the basis for likelihoodinference from a frequentist point of view In particularit can be shown that logL(θY)L(θY) where θ =θ(Y) is the value of θ at which L(θY) is maximized isapproximately distributed as a χk random variable wherek is the dimension of θ To make this precise requires anasymptotic theory for likelihood which is based on a cen-tral limit theorem (see 7Central Limiteorems) for thescore function

U(θY) = partpartθlogL(θY)

If Y = (Y Yn) has independent components thenU(θ) is a sum of n independent components which undermild regularity conditions will be asymptotically normalTo obtain the χ result quoted above it is also necessary toinvestigate the convergence of θ to the true value govern-ing the model f (y θ) Showing this convergence usuallyin probability but sometimes almost surely can be di-cult see Scholz () for a summary of some of the issuesthat ariseAssuming that θ is consistent for θ and that L(θY)

has sucient regularity the follow asymptotic results canbe established

(θ minus θ)T i(θ)(θ minus θ) drarr χk ()

U(θ)T iminus(θ)U(θ) drarr χk ()

ℓ(θ) minus ℓ(θ) drarr χk ()

where i(θ) = Eminusℓprimeprime(θY) θ is the expected Fisher infor-mation function ℓ(θ) = logL(θ) is the log-likelihoodfunction and χk is the 7chi-square distribution with kdegrees of freedom

ese results are all versions of a more general resultthat the log-likelihood function converges to the quadraticform corresponding to a multivariate normal distribution(see 7Multivariate Normal Distributions) under suitablystated limiting conditions ere is a similar asymptoticresult showing that the posterior density is asymptotically

L Likelihood

normal and in fact asymptotically free of the prior distri-bution although this result requires that the prior distribu-tion be a proper probability density ie has integral overthe parameter space equal to

Nuisance ParametersIn models where the dimension of θ is large plotting thelikelihood function is not possible and inference based onthe multivariate normal distribution for θ or the χk dis-tribution of the log-likelihood ratio doesnrsquot lead easily tointerval estimates for components of θ However it is pos-sible to use the likelihood function to construct inferencefor parameters of interest using variousmethods that havebeen proposed to eliminate nuisance parametersSuppose in the model f (y θ) that θ = (ψ λ) where ψ

is a k-dimensional parameter of interest (which will oenbe )e prole log-likelihood function of ψ is

ℓP(ψ) = ℓ(ψ λψ)

where λψ is the constrainedmaximum likelihood estimateit maximizes the likelihood function L(ψ λ) when ψ isheld xede prole log-likelihood function is also calledthe concentrated log-likelihood function especially ineconometrics If the parameter of interest is not expressedexplicitly as a subvector of θ then the constrained maxi-mum likelihood estimate is found using Lagrange multi-pliersIt can be veried under suitable smoothness conditions

that results similar to those at ( ndash ) hold as well for theprole log-likelihood function in particular

ℓP(ψ) minus ℓP(ψ) = ℓ(ψ λ) minus ℓ(ψ λψ)drarr χk

is method of eliminating nuisance parameters is notcompletely satisfactory especially when there are manynuisance parameters in particular it doesnrsquot allow forerrors in estimation of λ For example the prole likeli-hood approach to estimation of σ in the linear regressionmodel (see 7Linear Regression Models) y sim N(Xβ σ )will lead to the estimator σ = Σ(yi minus yi)n whereas theestimator usually preferred has divisor nminusp where p is thedimension of β

us a large literature has developed on improvementsto the prole log-likelihood For Bayesian inference suchimprovements are ldquoautomaticallyrdquo included in the formu-lation of the marginal posterior density for ψ

πM(ψ ∣ y)prop int π(ψ λ ∣ y)dλ

but it is typically quite dicult to specify priors for possiblyhigh-dimensional nuisance parameters For non-Bayesian

inference most modications to the prole log-likelihoodare derived by considering conditional or marginal infer-ence in models that admit factorizations at least approxi-mately like the following

f (y θ) = f(yψ)f(y ∣ y λ) orf (y θ) = f(y ∣ yψ)f(y λ)

A discussion of conditional inference and density factori-sations is given in Reid ()is literature is closely tiedto that on higher order asymptotic theory for likelihoode latter theory builds on saddlepoint and Laplace expan-sions to derive more accurate versions of (ndash) see forexample Severini () and Brazzale et al () edirect likelihood approach of Royall () and others doesnot generalize very well to the nuisance parameter settingalthough Royall and Tsou () present some results inthis direction

Extensions to Likelihoode likelihood function is such an important aspect ofinference based on models that it has been extended toldquolikelihood-likerdquo functions formore complex data settingsExamples include nonparametric and semi-parametriclikelihoods the most famous semi-parametric likelihoodis the proportional hazards model of Cox () Butmany other extensions have been suggested to empiri-cal likelihood (Owen ) which is a type of nonpara-metric likelihood supported on the observed sample toquasi-likelihood (McCullagh ) which starts from thescore function U(θ) and works backwards to an infer-ence function to bootstrap likelihood (Davison et al )and many modications of prole likelihood (Barndor-Nielsen andCox Fraser )ere is recent interestfor multi-dimensional responses Yi in composite likeli-hoods which are products of lower dimensional condi-tional or marginal distributions (Varin ) Reid ()concluded a review article on likelihood by stating

7 From either a Bayesian or frequentist perspective the like-lihood function plays an essential role in inference Themaximum likelihood estimator once regarded on an equalfooting among competing point estimators is now typi-cally the estimator of choice although some refinementis needed in problems with large numbers of nuisanceparameters The likelihood ratio statistic is the basis formost tests of hypotheses and interval estimates The emer-gence of the centrality of the likelihood function for infer-ence partly due to the large increase in computing poweris one of the central developments in the theory of statisticsduring the latter half of the twentieth century

Likelihood L

L

Further Readinge book by Cox and Hinkley () gives a detailedaccount of likelihood inference and principles of statis-tical inference see also Cox () ere are severalbook-length treatments of likelihood inference includingEdwards () Azzalini () Pawitan () and Sev-erini () this last discusses higher order asymptotictheory in detail as does Barndor-Nielsen andCox ()and Brazzale Davison and Reid () A short reviewpaper is Reid () An excellent overview of consis-tency results for maximum likelihood estimators is Scholz() see also Lehmann andCasella () Foundationalissues surrounding likelihood inference are discussed inBerger and Wolpert ()

About the AuthorProfessor Reid is a Past President of the Statistical Societyof Canada (ndash) During (ndash) she servedas the President of the Institute of Mathematical Statis-tics Among many awards she received the Emanuel andCarol Parzen Prize for Statistical Innovation () ldquoforleadership in statistical science for outstanding researchin theoretical statistics and highly accurate inference fromthe likelihood function and for inuential contributionsto statistical methods in biology environmental sciencehigh energy physics and complex social surveysrdquo She wasawarded the Gold Medal Statistical Society of Canada() and FlorenceNightingaleDavidAward Committeeof Presidents of Statistical Societies () She is AssociateEditor of Statistical Science (ndash) Bernoulli (ndash) andMetrika (ndash)

Cross References7Bayesian Analysis or Evidence Based Statistics7Bayesian Statistics7Bayesian Versus Frequentist Statistical Reasoning7Bayesian vs Classical Point Estimation A ComparativeOverview7Chi-Square Test Analysis of Contingency Tables7Empirical Likelihood Approach to Inference fromSample Survey Data7Estimation7Fiducial Inference7General Linear Models7Generalized Linear Models7Generalized Quasi-Likelihood (GQL) Inferences7Mixture Models7Philosophical Foundations of Statistics

7Risk Analysis7Statistical Evidence7Statistical Inference7Statistical Inference An Overview7Statistics An Overview7Statistics Nelderrsquos view7Testing Variance Components in Mixed LinearModels7Uniform Distribution in Statistics

References and Further ReadingAzzalini A () Statistical inference Chapman and Hall LondonBarndorff-Nielsen OE Cox DR () Inference and asymptotics

Chapman and Hall LondonBerger JO Wolpert R () The likelihood principle Institute of

Mathematical Statistics HaywardBirnbaum A () On the foundations of statistical inference Am

Stat Assoc ndashBrazzale AR Davison AC Reid N () Applied asymptotics

Cambridge University Press CambridgeCox DR () Regression models and life tables J R Stat Soc B

ndash (with discussion)Cox DR () Principles of statistical inference Cambridge Uni-

versity Press CambridgeCox DR Hinkley DV () Theoretical statistics Chapman and

Hall LondonDavison AC Hinkley DV Worton B () Bootstrap likelihoods

Biometrika ndashEdwards AF () Likelihood Oxford University Press OxfordFisher RA () On the mathematical foundations of theoretical

statistics Phil Trans R Soc A ndashLehmann EL Casella G () Theory of point estimation nd edn

Springer New YorkMcCullagh P () Quasi-likelihood functions Ann Stat ndashMurphy SA Van der Vaart A () Semiparametric likelihood ratio

inference Ann Stat ndashOwen A () Empirical likelihood ratio confidence intervals for a

single functional Biometrika ndashPawitan Y () In all likelihood Oxford University Press OxfordReid N () The roles of conditioning in inference Stat Sci

ndashReid N () Likelihood J Am Stat Assoc ndashRoyall RM () Statistical evidence a likelihood paradigm

Chapman and Hall LondonRoyall RM Tsou TS () Interpreting statistical evidence using

imperfect models robust adjusted likelihoodfunctions J R StatSoc B

Scholz F () Maximum likelihood estimation In Ency-clopedia of statistical sciences Wiley New York doiesspub Accessed Aug

Severini TA () Likelihood methods in statistics Oxford Univer-sity Press Oxford

Van der Vaart AW () Infinite-dimensional likelihood meth-ods in statistics httpwwwstieltjesorgarchiefbiennialframenodehtml Accessed Aug

Varin C () On composite marginal likelihood Adv Stat Analndash

L Limit Theorems of Probability Theory

Limit Theorems of ProbabilityTheory

Alexandr Alekseevich BorovkovProfessor Head of Probability and Statistics Chair at theNovosibirsk UniversityNovosibirsk University Novosibirsk Russia

Limit eorems of Probability eory is a broad namereferring to the most essential and extensive research areain Probability eory which at the same time has thegreatest impact on the numerous applications of the latterBy its very nature Probability eory is concerned

with asymptotic (limiting) laws that emerge in a long seriesof observations on random events Because of this in theearly twentieth century even the very denition of prob-ability of an event was given by a group of specialists(R von Mises and some others) as the limit of the rela-tive frequency of the occurrence of this event in a longrow of independent random experiments e ldquostabilityrdquoof this frequency (ie that such a limit always exists) waspostulated Aer the s Kolmogorovrsquos axiomatic con-struction of probability theory has prevailed One of themain assertions in this axiomatic theory is the Law of LargeNumbers (LLN) on the convergence of the averages of largenumbers of random variables to their expectation islaw implies the aforementioned stability of the relative fre-quencies and their convergence to the probability of thecorresponding event

e LLN is the simplest limit theorem (LT) of probabil-ity theory elucidating the physical meaning of probabilitye LLN is stated as follows if XXX is a sequenceof iid random variables

Sn =n

sumj=Xj

and the expectation a = EX exists then nminusSnasETHrarr a

(almost surely ie with probability )us the value nacan be called the rst order approximation for the sums Sne Central Limiteorem (CLT) gives one a more preciseapproximation for Sn It says that if σ = E(X minus a) ltinfinthen the distribution of the standardized sum ζn = (Sn minusna)σ

radicn converges as n rarr infin to the standard normal

(Gaussian) lawat is for all x

P(ζn lt x)rarr Φ(x) = radicπ

x

intminusinfin

eminustdt

e quantity nEξ + ζσradicn where ζ is a standard normal

random variable (so that P(ζ lt x) = Φ(x)) can be calledthe second order approximation for Sn

e rst LLN (for the Bernoulli scheme) was provedby Jacob Bernoulli in the late s (published posthu-mously in ) e rst CLT (also for the Bernoullischeme) was established by A de Moivre (rst publishedin and referred nowadays to as the deMoivrendashLaplacetheorem) In the beginning of the nineteenth centuryPS Laplace and CF Gauss contributed to the generaliza-tion of these assertions and appreciation of their enormousapplied importance (in particular for the theory of errorsof observations) while later in that century further break-throughs in both methodology and applicability rangeof the CLT were achieved by PL Chebyshev () andAM Lyapunov ()

e main directions in which the two aforementionedmain LTs have been extended and rened since then are

Relaxing the assumption EX lt infin When the sec-ond moment is innite one needs to assume that theldquotailrdquo P(x) = P(X gt x) + P(X lt minusx) is a func-tion regularly varying at innity such that the limitlimxrarrinfinP(X gt x)P(x) existsen the distribution of thenormalized sum Snσ(n) where σ(n) = Pminus(nminus)Pminus being the generalized inverse of the function P andwe assume that Eξ = when the expectation is niteconverges to one of the so-called stable laws as nrarrinfine7characteristic functions of these laws have simpleclosed-form representations

Relaxing the assumption that the Xjrsquos are identicallydistributed and proceeding to study the so-called tri-angular array scheme where the distributions of thesummands Xj = Xjn forming the sum Sn depend notonly on j but on n as well In this case the class of alllimit laws for the distribution of Sn (under suitable nor-malization) is substantially wider it coincides with theclass of the so-called innitely divisible distributionsAn important special case here is the Poisson limit the-orem on convergence in distribution of the number ofoccurrences of rare events to a Poisson law

Relaxing the assumption of independence of the XjrsquosSeveral types of ldquoweak dependencerdquo assumptions onXjunder which the LLN and CLT still hold true havebeen suggested and investigatedOne should alsomen-tion here the so-called ergodic theorems (see7Ergodiceorem) for a wide class of random sequences andprocesses

Renement of the main LTs and derivation of asymp-totic expansions For instance in the CLT bounds ofthe rate of convergence P(ζn lt x) minus Φ(x) rarr and

Limit Theorems of Probability Theory L

L

asymptotic expansions for this dierence (in the pow-ers of nminus in the case of iid summands) have beenobtained under broad assumptions

Studying large deviation probabilities for the sums Sn(theorems on rare events) If x rarr infin together with nthen the CLT can only assert that P(ζn gt x) rarr eorems on large deviation probabilities aim to nd afunction P(xn) such that

P(ζn gt x)P(xn) rarr as nrarrinfin x rarrinfin

e nature of the function P(xn) essentially dependson the rate of decay of P(X gt x) as x rarrinfin and on theldquodeviation zonerdquo ie on the asymptotic behavior of theratio xn as nrarrinfin

Considering observations X Xn of a more com-plex nature ndash rst of all multivariate random vectorsIf Xj isin Rd then the role of the limit law in the CLTwill be played by a d-dimensional normal (Gaussian)distribution with the covariance matrix E(X minus EX)(X minus EX)T

e variety of application areas of the LLN and CLTis enormous us Mathematical Statistics is based onthese LTs Let Xlowastn = (X Xn) be a sample froma distribution F and Flowastn (u) the corresponding empiricaldistribution functione fundamental GlivenkondashCantellitheorem (see 7Glivenko-Cantelli eorems) stating thatsupu ∣F

lowastn (u)minus F(u)∣

asETHrarr as nrarrinfin is of the same natureas the LLN and basically means that the unknown distri-bution F can be estimated arbitrary well from the randomsample Xlowastn of a large enough size n

e existence of consistent estimators for the unknownparameters a = Eξ and σ = E(X minus a) also follows fromthe LLN since as nrarrinfin

alowast = n

n

sumj=Xj

asETHrarr a (σ )lowast = n

n

sumj=

(Xj minus alowast)

= n

n

sumj=Xj minus (alowast) asETHrarr σ

Under additional moment assumptions on the distri-bution F one can also construct asymptotic condenceintervals for the parameters a and σ as the distributionsof the quantities

radicn(alowast minus a) and

radicn((σ )lowast minus σ ) con-

verge as n rarr infin to the normal onese same can alsobe said about other parameters that are ldquosmoothrdquo enoughfunctionals of the unknown distribution F

e theorem on the 7asymptotic normality andasymptotic eciency of maximum likelihood estimatorsis another classical example of LTsrsquo applications in mathe-matical statistics (see eg Borovkov ) Furthermore in

estimation theory and hypotheses testing one also needstheorems on large deviation probabilities for the respec-tive statistics as it is statistical procedures with small errorprobabilities that are oen required in applicationsIt is worth noting that summation of random variables

is by no means the only situation in which LTs appear inProbabilityeoryGenerally speaking the main objective of Probabil-

ityeory in applications is nding appropriate stochasticmodels for objects under study and then determining thedistributions or parameters one is interested in As a rulethe explicit form of these distributions and parameters isnot known LTs can be used to nd suitable approximationsto the characteristics in questionAt least two possible approaches to this problem should

be noted here

Suppose that the unknown distribution Fθ depends ona parameter θ such that as θ approaches some ldquocriticalrdquovalue θ the distributions Fθ become ldquodegeneraterdquo inone sense or anotheren in a number of cases onecan nd an approximation for Fθ which is valid for thevalues of θ that are close to θ For instance in actuarialstudies7queueing theory and some other applicationsone of the main problems is concerned with the dis-tribution of S = sup

kge(Sk minus θk) under the assumption

that EX = If θ gt then S is a proper random vari-able If however θ rarr then S asETHrarr infin Here we dealwith the so-called ldquotransient phenomenardquo It turns outthat if σ = Var (X) ltinfin then there exists the limit

limθdarrP(θS gt x) = eminusxσ x gt

is (KingmanndashProkhorov) LT enables one to ndapproximations for the distribution of S in situationswhere θ is small

Sometimes one can estimate the ldquotailsrdquo of the unknowndistributions ie their asymptotic behavior at innityis is of importance in those applications where oneneeds to evaluate the probabilities of rare events If theequation Eemicro(Xminusθ) = has a solution micro gt then inthe above example one has

P(S gt x) sim ceminusmicrox x rarrinfin

where c is a known constant If the distribution F of Xis subexponential (in this case Eemicro(Xminusθ) = infin for anymicro gt ) then

P(S gt x) sim θ int

infin

x( minus F(t))dt x rarrinfin

L Linear Mixed Models

isLTenablesone tondapproximations forP(S gt x)for large xFor both approaches the obtained approximations

can be rened

An important part of Probabilityeory is concernedwith LTs for random processes eir main objective isto nd conditions under which random processes con-verge in some sense to some limit process An extensionof the CLT to that context is the so-called FunctionalCLT (aka the DonskerndashProkhorov invariance princi-ple) which states that as n rarr infin the processesζn(t) = (Slfloorntrfloor minus ant)σ

radicntisin[] converge in distribu-

tion to the standard Wiener process w(t)tisin[]e LTs(including large deviation theorems) for a broad class offunctionals of the sequence (7random walk) S Sncan also be classied as LTs for 7stochastic processese same can be said about Law of iterated logarithmwhich states that for an arbitrary ε gt the random walkSkinfink= crosses the boundary (minusε)σ

radick ln ln k innitely

many times but crosses the boundary ( + ε)σradick ln ln k

nitely many times with probability Similar results holdtrue for trajectories of Wiener processes w(t)tisin[] andw(t)tisin[infin)In mathematical statistics a closely related to func-

tional CLT result says that the so-called ldquoempirical processrdquoradicn(Flowastn (u) minus F(u))uisin(minusinfininfin) converges in distribution

to w(F(u))uisin(minusinfininfin) where w(t) = w(t) minus tw() isthe Brownian bridge processis LT implies7asymptoticnormality of a great many estimators that can be repre-sented as smooth functionals of the empirical distributionfunction Flowastn (u)

ere are many other areas in Probability eoryand its applications where various LTs appear and areextensively used For instance convergence theoremsfor 7martingales asymptotics of extinction probabilityof a branching processes and conditional (under non-extinction condition) LTs on a number of particles etc

About the AuthorProfessor Borovkov is Head of Probability and Statis-tics Department of the Sobolev Institute of MathematicsNovosibirsk (RussianAcademy of Sciences) since Heis Head of Probability and Statistics Chair at the Novosi-birsk University since He is Concelour of RussianAcademy of Sciences and fullmember of RussianAcademyof Sciences (Academician) () He was awarded theState Prize of the USSR () the Markov Prize of theRussian Academy of Sciences () and GovernmentPrize in Education () Professor Borovkov is Editor-in-Chief of the journal ldquoSiberianAdvances inMathematicsrdquo

and Associated Editor of journals ldquoeory of Probabil-ity and its Applicationsrdquo ldquoSiberian Mathematical JournalrdquoldquoMathematical Methods of Statisticsrdquo ldquoElectronic Journal ofProbabilityrdquo

Cross References7Almost Sure Convergence of Random Variables7Approximations to Distributions7Asymptotic Normality7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Central Limiteorems7Empirical Processes7Ergodiceorem7Glivenko-Cantellieorems7Large Deviations and Applications7Laws of Large Numbers7Martingale Central Limiteorem7Probabilityeory An Outline7RandomMatrixeory7Strong Approximations in Probability and Statistics7Weak Convergence of Probability Measures

References and Further ReadingBillingsley P () Convergence of probability measures Wiley

New YorkBorovkov AA () Mathematical statistics Gordon amp Breach

AmsterdamGnedenko BV Kolmogorov AN () Limit distributions for sums

of independent random variables Addison-Wesley CambridgeLeacutevy P () Theacuteorie de lrsquoAddition des Variables Aleacuteatories

Gauthier-Villars ParisLoegraveve M (ndash) Probability theory vols I and II th edn

Springer New YorkPetrov VV () Limit theorems of probability theory sequences of

independent random variables ClarendonOxford UniversityPress New York

Linear Mixed Models

GeertMolenberghsProfessorUniversiteit Hasselt amp Katholieke Universiteit LeuvenLeuven Belgium

In observational studies repeated measurements may betaken at almost arbitrary time points resulting in anextremely large number of time points at which only one

Linear Mixed Models L

L

or only a few measurements have been taken Many of theparametric covariance models described so far may thencontain toomany parameters tomake them useful in prac-tice while othermore parsimoniousmodelsmay be basedon assumptions which are too simplistic to be realisticA general and very exible class of parametric models forcontinuous longitudinal data is formulated as follows

yi∣bi sim N(Xiβ + ZibiΣi) ()bi sim N(D) ()

where Xi and Zi are (ni times p) and (ni times q) dimensionalmatrices of known covariates β is a p-dimensional vec-tor of regression parameters called the xed eects D isa general (q times q) covariance matrix and Σi is a (ni times ni)covariance matrix which depends on i only through itsdimension ni ie the set of unknown parameters in Σi willnot depend upon i Finally bi is a vector of subject-specicor random eects

e above model can be interpreted as a linear regres-sion model (see 7Linear Regression Models) for the vec-tor yi of repeated measurements for each unit separatelywhere some of the regression parameters are specic (ran-dom eects bi) while others are not (xed eects β)edistributional assumptions in () with respect to the ran-dom eects can be motivated as follows First E(bi) = implies that the mean of yi still equals Xiβ such that thexed eects in the random-eects model () can also beinterpreted marginally Not only do they reect the eectof changing covariates within specic units they alsomea-sure the marginal eect in the population of changing thesame covariates Second the normality assumption imme-diately implies that marginally yi also follows a normaldistribution with mean vector Xiβ and with covariancematrix Vi = ZiDZprimei + ΣiNote that the random eects in () implicitly imply the

marginal covariance matrix Vi of yi to be of the very spe-cic form Vi = ZiDZprimei + Σi Let us consider two examplesunder the assumption of conditional independence ieassuming Σi = σ Ini First consider the case where therandom eects are univariate and represent unit-specicintercepts is corresponds to covariates Zi which areni-dimensional vectors containing only ones

e marginal model implied by expressions () and() is

yi sim N(XiβVi) Vi = ZiDZprimei + Σi

which can be viewed as a multivariate linear regressionmodel with a very particular parameterization of thecovariance matrix Vi

With respect to the estimation of unit-specic param-eters bi the posterior distribution of bi given the observeddata yi can be shown to be (multivariate) normal withmean vector equal to DZprimeiV

minusi (α)(yi minus Xiβ) Replacing β

and α by their maximum likelihood estimates we obtainthe so-called empirical Bayes estimates bi for the bi A keyproperty of these EB estimates is shrinkage which is bestillustrated by considering the prediction yi equiv Xi β +Zibi ofthe ith prole It can easily be shown that

yi = ΣiVminusi Xi β + (Ini minus ΣiVminusi ) yi

which can be interpreted as a weighted average of thepopulation-averaged prole Xi β and the observed data yiwith weights ΣiVminusi and Ini minusΣiVminusi respectively Note thatthe ldquonumeratorrdquo of ΣiVminusi represents within-unit variabil-ity and the ldquodenominatorrdquo is the overall covariance matrixVi Hence much weight will be given to the overall averageprole if the within-unit variability is large in comparisonto the between-unit variability (modeled by the randomeects) whereasmuchweight will be given to the observeddata if the opposite is trueis phenomenon is referred toas shrinkage toward the average proleXi β An immediateconsequence of shrinkage is that the EB estimates show lessvariability than actually present in the random-eects dis-tribution ie for any linear combination λ of the randomeects

var(λprimebi)le var(λprimebi) = λprimeDλ

About the AuthorGeert Molenberghs is Professor of Biostatistics at Uni-versiteit Hasselt and Katholieke Universiteit Leuven inBelgium He was Joint Editor of Applied Statistics (ndash) and Co-Editor of Biometrics (ndash) Cur-rently he is Co-Editor of Biostatistics (ndash) He wasPresident of the International Biometric Society (ndash) received the Guy Medal in Bronze from the RoyalStatistical Society and the Myrto Lefkopoulou award fromthe Harvard School of Public Health Geert Molenberghsis founding director of the Center for Statistics He isalso the director of the Interuniversity Institute for Bio-statistics and statistical Bioinformatics (I-BioStat) Jointlywith Geert Verbeke Mike Kenward Tomasz BurzykowskiMarc Buyse and Marc Aerts he authored books on lon-gitudinal and incomplete data and on surrogate markerevaluation GeertMolenberghs received several Excellencein Continuing Education Awards of the American Statisti-cal Association for courses at Joint Statistical Meetings

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 14: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

L Likelihood

If the force of mortality is constant over a single-yearage interval (x x + ) say and is estimated by microx in thisinterval then px = eminusmicrox is an estimator of the single-yearsurvival probability pxis allows us to estimate the sur-vival function recursively for all corresponding ages usingℓ(x + ) = ℓ(x)px for x = and the rest of thelife table computations follow suit Life table constructionconsists in the estimation of the parameters and the tab-ulation of functions like those above from empirical datae data can be for age at death for individuals as in theexample indicated above but they can also be observa-tions of duration until recovery from an illness of intervalsbetween births of time until breakdown of some piece ofmachinery or of any other positive duration variableSo far we have argued as if the life table is computed for

a group of mutually independent individuals who have allbeen observed in parallel essentially a cohort that is fol-lowed from a signicant common starting point (namelyfrom birth in our mortality example) and which is dimin-ished over time due to decrements (attrition) caused bythe risk in question and also subject to reduction due tocensoring (withdrawals)e corresponding table is thencalled a cohort life table It is more common however toestimate a px schedule from data collected for the mem-bers of a population during a limited time period and touse the mechanics of life-table construction to produce aperiod life table from the px valuesLife table techniques are described in detail in most

introductory textbooks in actuarial statistics7biostatistics7demography and epidemiology See eg Chiang ()Elandt-Johnson and Johnson () Manton and Stallard() Preston et al () For the history of thetopic consult Seal () Smith and Keytz () andDupacircquier ()

About the AuthorFor biography see the entry 7Demography

Cross References7Demographic Analysis A Stochastic Approach7Demography7Event History Analysis7Kaplan-Meier Estimator7Population Projections7Statistics An Overview

References and Further ReadingChiang CL () The life table and its applications Krieger

MalabarDupacircquier J () Lrsquoinvention de la table de mortaliteacute Presses

universitaires de France Paris

Elandt-Johnson RC Johnson NL () Survival models and dataanalysis Wiley New York

Forseacuten L () The efficiency of selected moment methods inGompertzndashMakeham graduation of mortality Scand Actuarial Jndash

Manton KG Stallard E () Recent trends in mortality analysisAcademic Press Orlando

Preston SH Heuveline P Guillot M () Demography Measuringand modeling populations Blackwell Oxford

Seal H () Studies in history of probability and statistics mul-tiple decrements of competing risks Biometrica ()ndash

Smith D Keyfitz N () Mathematical demography selectedpapers Springer Heidelberg

Likelihood

Nancy ReidProfessorUniversity of Toronto Toronto ON Canada

Introductione likelihood function in a statistical model is propor-tional to the density function for the random variable tobe observed in the model Most oen in applications oflikelihood we have a parametric model f (y θ) where theparameter θ is assumed to take values in a subset of Rkand the variable y is assumed to take values in a subset ofRn the likelihood function is dened by

L(θ) = L(θ y) = cf (y θ) ()

where c can depend on y but not on θ In more gen-eral settings where the model is semi-parametric or non-parametric the explicit denition is more dicult becausethe density needs to be dened relative to a dominatingmeasure whichmay not exist seeVan derVaart () andMurphy andVan derVaart ()is article will consideronly nite-dimensional parametric modelsWithin the context of the given parametric model the

likelihood function measures the relative plausibility ofvarious values of θ for a given observed data point y Val-ues of the likelihood function are only meaningful relativeto each other and for this reason are sometimes stan-dardized by the maximum value of the likelihood func-tion although other reference points might be of interestdepending on the contextIf ourmodel is f (y θ) = (ny)θy(minusθ)nminusy y = n

θ isin [ ] then the likelihood function is (any functionproportional to)

L(θ y) = θy( minus θ)nminusy

Likelihood L

L

and can be plotted as a function of θ for any xed valueof y e likelihood function is maximized at θ = ynis model might be appropriate for a sampling schemewhich recorded the number of successes amongn indepen-dent trials that result in success or failure each trial havingthe same probability of success θ Another example is thelikelihood function for the mean and variance parameterswhen sampling from a normal distribution with mean microand variance σ

L(θ y) = expminusn log σ minus (σ )Σ(yi minus micro)

where θ = (micro σ )is could also be plotted as a functionof micro and σ for a given sample y yn and it is not dif-cult to show that this likelihood function only dependson the sample through the sample mean y = nminusΣyi andsample variance s = (n minus )minusΣ(yi minus y) or equivalentlythrough Σyi and Σyi It is a general property of likelihoodfunctions that they depend on the data only through theminimal sucient statistic

Inferencee likelihood function was dened in a seminal paperof Fisher () and has since become the basis for mostmethods of statistical inference One version of likelihoodinference suggested by Fisher is to use some rule suchas L(θ)L(θ) gt k to dene a range of ldquolikelyrdquo or ldquoplau-siblerdquo values of θ Many authors including Royall ()and Edwards () have promoted the use of plots ofthe likelihood function along with interval estimates ofplausible valuesis approach is somewhat limited how-ever as it requires that θ have dimension or possibly or that a likelihood function can be constructed that onlydepends on a component of θ that is of interest see sectionldquo7Nuisance Parametersrdquo belowIn general we would wish to calibrate our inference

for θ by referring to the probabilistic properties of theinferential method One way to do this is to introduce aprobability measure on the unknown parameter θ typi-cally called a prior distribution and use Bayesrsquo rule forconditional probabilities to conclude

π(θ ∣ y) = L(θ y)π(θ)intθL(θ y)π(θ)dθ

where π(θ) is the density for the prior measure and π(θ ∣y) provides a probabilistic assessment of θ aer observingY = y in the model f (y θ) We could then make con-clusions of the form ldquohaving observed successes in trials and assuming π(θ) = the posterior probabilitythat θ gt is rdquo and so on

is is a very brief description of Bayesian inference inwhich probability statements refer to that generated from

the prior through the likelihood to the posterior A majordiculty with this approach is the choice of prior prob-ability function In some applications there may be anaccumulation of previous data that can be incorporatedinto a probability distribution but in general there is notand some rather ad hoc choices are oen made Anotherdiculty is themeaning to be attached to probability state-ments about the parameterInference based on the likelihood function can also be

calibrated with reference to the probability model f (y θ)by examining the distribution ofL(θY) as a random func-tion or more usually by examining the distribution ofvarious derived quantitiesis is the basis for likelihoodinference from a frequentist point of view In particularit can be shown that logL(θY)L(θY) where θ =θ(Y) is the value of θ at which L(θY) is maximized isapproximately distributed as a χk random variable wherek is the dimension of θ To make this precise requires anasymptotic theory for likelihood which is based on a cen-tral limit theorem (see 7Central Limiteorems) for thescore function

U(θY) = partpartθlogL(θY)

If Y = (Y Yn) has independent components thenU(θ) is a sum of n independent components which undermild regularity conditions will be asymptotically normalTo obtain the χ result quoted above it is also necessary toinvestigate the convergence of θ to the true value govern-ing the model f (y θ) Showing this convergence usuallyin probability but sometimes almost surely can be di-cult see Scholz () for a summary of some of the issuesthat ariseAssuming that θ is consistent for θ and that L(θY)

has sucient regularity the follow asymptotic results canbe established

(θ minus θ)T i(θ)(θ minus θ) drarr χk ()

U(θ)T iminus(θ)U(θ) drarr χk ()

ℓ(θ) minus ℓ(θ) drarr χk ()

where i(θ) = Eminusℓprimeprime(θY) θ is the expected Fisher infor-mation function ℓ(θ) = logL(θ) is the log-likelihoodfunction and χk is the 7chi-square distribution with kdegrees of freedom

ese results are all versions of a more general resultthat the log-likelihood function converges to the quadraticform corresponding to a multivariate normal distribution(see 7Multivariate Normal Distributions) under suitablystated limiting conditions ere is a similar asymptoticresult showing that the posterior density is asymptotically

L Likelihood

normal and in fact asymptotically free of the prior distri-bution although this result requires that the prior distribu-tion be a proper probability density ie has integral overthe parameter space equal to

Nuisance ParametersIn models where the dimension of θ is large plotting thelikelihood function is not possible and inference based onthe multivariate normal distribution for θ or the χk dis-tribution of the log-likelihood ratio doesnrsquot lead easily tointerval estimates for components of θ However it is pos-sible to use the likelihood function to construct inferencefor parameters of interest using variousmethods that havebeen proposed to eliminate nuisance parametersSuppose in the model f (y θ) that θ = (ψ λ) where ψ

is a k-dimensional parameter of interest (which will oenbe )e prole log-likelihood function of ψ is

ℓP(ψ) = ℓ(ψ λψ)

where λψ is the constrainedmaximum likelihood estimateit maximizes the likelihood function L(ψ λ) when ψ isheld xede prole log-likelihood function is also calledthe concentrated log-likelihood function especially ineconometrics If the parameter of interest is not expressedexplicitly as a subvector of θ then the constrained maxi-mum likelihood estimate is found using Lagrange multi-pliersIt can be veried under suitable smoothness conditions

that results similar to those at ( ndash ) hold as well for theprole log-likelihood function in particular

ℓP(ψ) minus ℓP(ψ) = ℓ(ψ λ) minus ℓ(ψ λψ)drarr χk

is method of eliminating nuisance parameters is notcompletely satisfactory especially when there are manynuisance parameters in particular it doesnrsquot allow forerrors in estimation of λ For example the prole likeli-hood approach to estimation of σ in the linear regressionmodel (see 7Linear Regression Models) y sim N(Xβ σ )will lead to the estimator σ = Σ(yi minus yi)n whereas theestimator usually preferred has divisor nminusp where p is thedimension of β

us a large literature has developed on improvementsto the prole log-likelihood For Bayesian inference suchimprovements are ldquoautomaticallyrdquo included in the formu-lation of the marginal posterior density for ψ

πM(ψ ∣ y)prop int π(ψ λ ∣ y)dλ

but it is typically quite dicult to specify priors for possiblyhigh-dimensional nuisance parameters For non-Bayesian

inference most modications to the prole log-likelihoodare derived by considering conditional or marginal infer-ence in models that admit factorizations at least approxi-mately like the following

f (y θ) = f(yψ)f(y ∣ y λ) orf (y θ) = f(y ∣ yψ)f(y λ)

A discussion of conditional inference and density factori-sations is given in Reid ()is literature is closely tiedto that on higher order asymptotic theory for likelihoode latter theory builds on saddlepoint and Laplace expan-sions to derive more accurate versions of (ndash) see forexample Severini () and Brazzale et al () edirect likelihood approach of Royall () and others doesnot generalize very well to the nuisance parameter settingalthough Royall and Tsou () present some results inthis direction

Extensions to Likelihoode likelihood function is such an important aspect ofinference based on models that it has been extended toldquolikelihood-likerdquo functions formore complex data settingsExamples include nonparametric and semi-parametriclikelihoods the most famous semi-parametric likelihoodis the proportional hazards model of Cox () Butmany other extensions have been suggested to empiri-cal likelihood (Owen ) which is a type of nonpara-metric likelihood supported on the observed sample toquasi-likelihood (McCullagh ) which starts from thescore function U(θ) and works backwards to an infer-ence function to bootstrap likelihood (Davison et al )and many modications of prole likelihood (Barndor-Nielsen andCox Fraser )ere is recent interestfor multi-dimensional responses Yi in composite likeli-hoods which are products of lower dimensional condi-tional or marginal distributions (Varin ) Reid ()concluded a review article on likelihood by stating

7 From either a Bayesian or frequentist perspective the like-lihood function plays an essential role in inference Themaximum likelihood estimator once regarded on an equalfooting among competing point estimators is now typi-cally the estimator of choice although some refinementis needed in problems with large numbers of nuisanceparameters The likelihood ratio statistic is the basis formost tests of hypotheses and interval estimates The emer-gence of the centrality of the likelihood function for infer-ence partly due to the large increase in computing poweris one of the central developments in the theory of statisticsduring the latter half of the twentieth century

Likelihood L

L

Further Readinge book by Cox and Hinkley () gives a detailedaccount of likelihood inference and principles of statis-tical inference see also Cox () ere are severalbook-length treatments of likelihood inference includingEdwards () Azzalini () Pawitan () and Sev-erini () this last discusses higher order asymptotictheory in detail as does Barndor-Nielsen andCox ()and Brazzale Davison and Reid () A short reviewpaper is Reid () An excellent overview of consis-tency results for maximum likelihood estimators is Scholz() see also Lehmann andCasella () Foundationalissues surrounding likelihood inference are discussed inBerger and Wolpert ()

About the AuthorProfessor Reid is a Past President of the Statistical Societyof Canada (ndash) During (ndash) she servedas the President of the Institute of Mathematical Statis-tics Among many awards she received the Emanuel andCarol Parzen Prize for Statistical Innovation () ldquoforleadership in statistical science for outstanding researchin theoretical statistics and highly accurate inference fromthe likelihood function and for inuential contributionsto statistical methods in biology environmental sciencehigh energy physics and complex social surveysrdquo She wasawarded the Gold Medal Statistical Society of Canada() and FlorenceNightingaleDavidAward Committeeof Presidents of Statistical Societies () She is AssociateEditor of Statistical Science (ndash) Bernoulli (ndash) andMetrika (ndash)

Cross References7Bayesian Analysis or Evidence Based Statistics7Bayesian Statistics7Bayesian Versus Frequentist Statistical Reasoning7Bayesian vs Classical Point Estimation A ComparativeOverview7Chi-Square Test Analysis of Contingency Tables7Empirical Likelihood Approach to Inference fromSample Survey Data7Estimation7Fiducial Inference7General Linear Models7Generalized Linear Models7Generalized Quasi-Likelihood (GQL) Inferences7Mixture Models7Philosophical Foundations of Statistics

7Risk Analysis7Statistical Evidence7Statistical Inference7Statistical Inference An Overview7Statistics An Overview7Statistics Nelderrsquos view7Testing Variance Components in Mixed LinearModels7Uniform Distribution in Statistics

References and Further ReadingAzzalini A () Statistical inference Chapman and Hall LondonBarndorff-Nielsen OE Cox DR () Inference and asymptotics

Chapman and Hall LondonBerger JO Wolpert R () The likelihood principle Institute of

Mathematical Statistics HaywardBirnbaum A () On the foundations of statistical inference Am

Stat Assoc ndashBrazzale AR Davison AC Reid N () Applied asymptotics

Cambridge University Press CambridgeCox DR () Regression models and life tables J R Stat Soc B

ndash (with discussion)Cox DR () Principles of statistical inference Cambridge Uni-

versity Press CambridgeCox DR Hinkley DV () Theoretical statistics Chapman and

Hall LondonDavison AC Hinkley DV Worton B () Bootstrap likelihoods

Biometrika ndashEdwards AF () Likelihood Oxford University Press OxfordFisher RA () On the mathematical foundations of theoretical

statistics Phil Trans R Soc A ndashLehmann EL Casella G () Theory of point estimation nd edn

Springer New YorkMcCullagh P () Quasi-likelihood functions Ann Stat ndashMurphy SA Van der Vaart A () Semiparametric likelihood ratio

inference Ann Stat ndashOwen A () Empirical likelihood ratio confidence intervals for a

single functional Biometrika ndashPawitan Y () In all likelihood Oxford University Press OxfordReid N () The roles of conditioning in inference Stat Sci

ndashReid N () Likelihood J Am Stat Assoc ndashRoyall RM () Statistical evidence a likelihood paradigm

Chapman and Hall LondonRoyall RM Tsou TS () Interpreting statistical evidence using

imperfect models robust adjusted likelihoodfunctions J R StatSoc B

Scholz F () Maximum likelihood estimation In Ency-clopedia of statistical sciences Wiley New York doiesspub Accessed Aug

Severini TA () Likelihood methods in statistics Oxford Univer-sity Press Oxford

Van der Vaart AW () Infinite-dimensional likelihood meth-ods in statistics httpwwwstieltjesorgarchiefbiennialframenodehtml Accessed Aug

Varin C () On composite marginal likelihood Adv Stat Analndash

L Limit Theorems of Probability Theory

Limit Theorems of ProbabilityTheory

Alexandr Alekseevich BorovkovProfessor Head of Probability and Statistics Chair at theNovosibirsk UniversityNovosibirsk University Novosibirsk Russia

Limit eorems of Probability eory is a broad namereferring to the most essential and extensive research areain Probability eory which at the same time has thegreatest impact on the numerous applications of the latterBy its very nature Probability eory is concerned

with asymptotic (limiting) laws that emerge in a long seriesof observations on random events Because of this in theearly twentieth century even the very denition of prob-ability of an event was given by a group of specialists(R von Mises and some others) as the limit of the rela-tive frequency of the occurrence of this event in a longrow of independent random experiments e ldquostabilityrdquoof this frequency (ie that such a limit always exists) waspostulated Aer the s Kolmogorovrsquos axiomatic con-struction of probability theory has prevailed One of themain assertions in this axiomatic theory is the Law of LargeNumbers (LLN) on the convergence of the averages of largenumbers of random variables to their expectation islaw implies the aforementioned stability of the relative fre-quencies and their convergence to the probability of thecorresponding event

e LLN is the simplest limit theorem (LT) of probabil-ity theory elucidating the physical meaning of probabilitye LLN is stated as follows if XXX is a sequenceof iid random variables

Sn =n

sumj=Xj

and the expectation a = EX exists then nminusSnasETHrarr a

(almost surely ie with probability )us the value nacan be called the rst order approximation for the sums Sne Central Limiteorem (CLT) gives one a more preciseapproximation for Sn It says that if σ = E(X minus a) ltinfinthen the distribution of the standardized sum ζn = (Sn minusna)σ

radicn converges as n rarr infin to the standard normal

(Gaussian) lawat is for all x

P(ζn lt x)rarr Φ(x) = radicπ

x

intminusinfin

eminustdt

e quantity nEξ + ζσradicn where ζ is a standard normal

random variable (so that P(ζ lt x) = Φ(x)) can be calledthe second order approximation for Sn

e rst LLN (for the Bernoulli scheme) was provedby Jacob Bernoulli in the late s (published posthu-mously in ) e rst CLT (also for the Bernoullischeme) was established by A de Moivre (rst publishedin and referred nowadays to as the deMoivrendashLaplacetheorem) In the beginning of the nineteenth centuryPS Laplace and CF Gauss contributed to the generaliza-tion of these assertions and appreciation of their enormousapplied importance (in particular for the theory of errorsof observations) while later in that century further break-throughs in both methodology and applicability rangeof the CLT were achieved by PL Chebyshev () andAM Lyapunov ()

e main directions in which the two aforementionedmain LTs have been extended and rened since then are

Relaxing the assumption EX lt infin When the sec-ond moment is innite one needs to assume that theldquotailrdquo P(x) = P(X gt x) + P(X lt minusx) is a func-tion regularly varying at innity such that the limitlimxrarrinfinP(X gt x)P(x) existsen the distribution of thenormalized sum Snσ(n) where σ(n) = Pminus(nminus)Pminus being the generalized inverse of the function P andwe assume that Eξ = when the expectation is niteconverges to one of the so-called stable laws as nrarrinfine7characteristic functions of these laws have simpleclosed-form representations

Relaxing the assumption that the Xjrsquos are identicallydistributed and proceeding to study the so-called tri-angular array scheme where the distributions of thesummands Xj = Xjn forming the sum Sn depend notonly on j but on n as well In this case the class of alllimit laws for the distribution of Sn (under suitable nor-malization) is substantially wider it coincides with theclass of the so-called innitely divisible distributionsAn important special case here is the Poisson limit the-orem on convergence in distribution of the number ofoccurrences of rare events to a Poisson law

Relaxing the assumption of independence of the XjrsquosSeveral types of ldquoweak dependencerdquo assumptions onXjunder which the LLN and CLT still hold true havebeen suggested and investigatedOne should alsomen-tion here the so-called ergodic theorems (see7Ergodiceorem) for a wide class of random sequences andprocesses

Renement of the main LTs and derivation of asymp-totic expansions For instance in the CLT bounds ofthe rate of convergence P(ζn lt x) minus Φ(x) rarr and

Limit Theorems of Probability Theory L

L

asymptotic expansions for this dierence (in the pow-ers of nminus in the case of iid summands) have beenobtained under broad assumptions

Studying large deviation probabilities for the sums Sn(theorems on rare events) If x rarr infin together with nthen the CLT can only assert that P(ζn gt x) rarr eorems on large deviation probabilities aim to nd afunction P(xn) such that

P(ζn gt x)P(xn) rarr as nrarrinfin x rarrinfin

e nature of the function P(xn) essentially dependson the rate of decay of P(X gt x) as x rarrinfin and on theldquodeviation zonerdquo ie on the asymptotic behavior of theratio xn as nrarrinfin

Considering observations X Xn of a more com-plex nature ndash rst of all multivariate random vectorsIf Xj isin Rd then the role of the limit law in the CLTwill be played by a d-dimensional normal (Gaussian)distribution with the covariance matrix E(X minus EX)(X minus EX)T

e variety of application areas of the LLN and CLTis enormous us Mathematical Statistics is based onthese LTs Let Xlowastn = (X Xn) be a sample froma distribution F and Flowastn (u) the corresponding empiricaldistribution functione fundamental GlivenkondashCantellitheorem (see 7Glivenko-Cantelli eorems) stating thatsupu ∣F

lowastn (u)minus F(u)∣

asETHrarr as nrarrinfin is of the same natureas the LLN and basically means that the unknown distri-bution F can be estimated arbitrary well from the randomsample Xlowastn of a large enough size n

e existence of consistent estimators for the unknownparameters a = Eξ and σ = E(X minus a) also follows fromthe LLN since as nrarrinfin

alowast = n

n

sumj=Xj

asETHrarr a (σ )lowast = n

n

sumj=

(Xj minus alowast)

= n

n

sumj=Xj minus (alowast) asETHrarr σ

Under additional moment assumptions on the distri-bution F one can also construct asymptotic condenceintervals for the parameters a and σ as the distributionsof the quantities

radicn(alowast minus a) and

radicn((σ )lowast minus σ ) con-

verge as n rarr infin to the normal onese same can alsobe said about other parameters that are ldquosmoothrdquo enoughfunctionals of the unknown distribution F

e theorem on the 7asymptotic normality andasymptotic eciency of maximum likelihood estimatorsis another classical example of LTsrsquo applications in mathe-matical statistics (see eg Borovkov ) Furthermore in

estimation theory and hypotheses testing one also needstheorems on large deviation probabilities for the respec-tive statistics as it is statistical procedures with small errorprobabilities that are oen required in applicationsIt is worth noting that summation of random variables

is by no means the only situation in which LTs appear inProbabilityeoryGenerally speaking the main objective of Probabil-

ityeory in applications is nding appropriate stochasticmodels for objects under study and then determining thedistributions or parameters one is interested in As a rulethe explicit form of these distributions and parameters isnot known LTs can be used to nd suitable approximationsto the characteristics in questionAt least two possible approaches to this problem should

be noted here

Suppose that the unknown distribution Fθ depends ona parameter θ such that as θ approaches some ldquocriticalrdquovalue θ the distributions Fθ become ldquodegeneraterdquo inone sense or anotheren in a number of cases onecan nd an approximation for Fθ which is valid for thevalues of θ that are close to θ For instance in actuarialstudies7queueing theory and some other applicationsone of the main problems is concerned with the dis-tribution of S = sup

kge(Sk minus θk) under the assumption

that EX = If θ gt then S is a proper random vari-able If however θ rarr then S asETHrarr infin Here we dealwith the so-called ldquotransient phenomenardquo It turns outthat if σ = Var (X) ltinfin then there exists the limit

limθdarrP(θS gt x) = eminusxσ x gt

is (KingmanndashProkhorov) LT enables one to ndapproximations for the distribution of S in situationswhere θ is small

Sometimes one can estimate the ldquotailsrdquo of the unknowndistributions ie their asymptotic behavior at innityis is of importance in those applications where oneneeds to evaluate the probabilities of rare events If theequation Eemicro(Xminusθ) = has a solution micro gt then inthe above example one has

P(S gt x) sim ceminusmicrox x rarrinfin

where c is a known constant If the distribution F of Xis subexponential (in this case Eemicro(Xminusθ) = infin for anymicro gt ) then

P(S gt x) sim θ int

infin

x( minus F(t))dt x rarrinfin

L Linear Mixed Models

isLTenablesone tondapproximations forP(S gt x)for large xFor both approaches the obtained approximations

can be rened

An important part of Probabilityeory is concernedwith LTs for random processes eir main objective isto nd conditions under which random processes con-verge in some sense to some limit process An extensionof the CLT to that context is the so-called FunctionalCLT (aka the DonskerndashProkhorov invariance princi-ple) which states that as n rarr infin the processesζn(t) = (Slfloorntrfloor minus ant)σ

radicntisin[] converge in distribu-

tion to the standard Wiener process w(t)tisin[]e LTs(including large deviation theorems) for a broad class offunctionals of the sequence (7random walk) S Sncan also be classied as LTs for 7stochastic processese same can be said about Law of iterated logarithmwhich states that for an arbitrary ε gt the random walkSkinfink= crosses the boundary (minusε)σ

radick ln ln k innitely

many times but crosses the boundary ( + ε)σradick ln ln k

nitely many times with probability Similar results holdtrue for trajectories of Wiener processes w(t)tisin[] andw(t)tisin[infin)In mathematical statistics a closely related to func-

tional CLT result says that the so-called ldquoempirical processrdquoradicn(Flowastn (u) minus F(u))uisin(minusinfininfin) converges in distribution

to w(F(u))uisin(minusinfininfin) where w(t) = w(t) minus tw() isthe Brownian bridge processis LT implies7asymptoticnormality of a great many estimators that can be repre-sented as smooth functionals of the empirical distributionfunction Flowastn (u)

ere are many other areas in Probability eoryand its applications where various LTs appear and areextensively used For instance convergence theoremsfor 7martingales asymptotics of extinction probabilityof a branching processes and conditional (under non-extinction condition) LTs on a number of particles etc

About the AuthorProfessor Borovkov is Head of Probability and Statis-tics Department of the Sobolev Institute of MathematicsNovosibirsk (RussianAcademy of Sciences) since Heis Head of Probability and Statistics Chair at the Novosi-birsk University since He is Concelour of RussianAcademy of Sciences and fullmember of RussianAcademyof Sciences (Academician) () He was awarded theState Prize of the USSR () the Markov Prize of theRussian Academy of Sciences () and GovernmentPrize in Education () Professor Borovkov is Editor-in-Chief of the journal ldquoSiberianAdvances inMathematicsrdquo

and Associated Editor of journals ldquoeory of Probabil-ity and its Applicationsrdquo ldquoSiberian Mathematical JournalrdquoldquoMathematical Methods of Statisticsrdquo ldquoElectronic Journal ofProbabilityrdquo

Cross References7Almost Sure Convergence of Random Variables7Approximations to Distributions7Asymptotic Normality7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Central Limiteorems7Empirical Processes7Ergodiceorem7Glivenko-Cantellieorems7Large Deviations and Applications7Laws of Large Numbers7Martingale Central Limiteorem7Probabilityeory An Outline7RandomMatrixeory7Strong Approximations in Probability and Statistics7Weak Convergence of Probability Measures

References and Further ReadingBillingsley P () Convergence of probability measures Wiley

New YorkBorovkov AA () Mathematical statistics Gordon amp Breach

AmsterdamGnedenko BV Kolmogorov AN () Limit distributions for sums

of independent random variables Addison-Wesley CambridgeLeacutevy P () Theacuteorie de lrsquoAddition des Variables Aleacuteatories

Gauthier-Villars ParisLoegraveve M (ndash) Probability theory vols I and II th edn

Springer New YorkPetrov VV () Limit theorems of probability theory sequences of

independent random variables ClarendonOxford UniversityPress New York

Linear Mixed Models

GeertMolenberghsProfessorUniversiteit Hasselt amp Katholieke Universiteit LeuvenLeuven Belgium

In observational studies repeated measurements may betaken at almost arbitrary time points resulting in anextremely large number of time points at which only one

Linear Mixed Models L

L

or only a few measurements have been taken Many of theparametric covariance models described so far may thencontain toomany parameters tomake them useful in prac-tice while othermore parsimoniousmodelsmay be basedon assumptions which are too simplistic to be realisticA general and very exible class of parametric models forcontinuous longitudinal data is formulated as follows

yi∣bi sim N(Xiβ + ZibiΣi) ()bi sim N(D) ()

where Xi and Zi are (ni times p) and (ni times q) dimensionalmatrices of known covariates β is a p-dimensional vec-tor of regression parameters called the xed eects D isa general (q times q) covariance matrix and Σi is a (ni times ni)covariance matrix which depends on i only through itsdimension ni ie the set of unknown parameters in Σi willnot depend upon i Finally bi is a vector of subject-specicor random eects

e above model can be interpreted as a linear regres-sion model (see 7Linear Regression Models) for the vec-tor yi of repeated measurements for each unit separatelywhere some of the regression parameters are specic (ran-dom eects bi) while others are not (xed eects β)edistributional assumptions in () with respect to the ran-dom eects can be motivated as follows First E(bi) = implies that the mean of yi still equals Xiβ such that thexed eects in the random-eects model () can also beinterpreted marginally Not only do they reect the eectof changing covariates within specic units they alsomea-sure the marginal eect in the population of changing thesame covariates Second the normality assumption imme-diately implies that marginally yi also follows a normaldistribution with mean vector Xiβ and with covariancematrix Vi = ZiDZprimei + ΣiNote that the random eects in () implicitly imply the

marginal covariance matrix Vi of yi to be of the very spe-cic form Vi = ZiDZprimei + Σi Let us consider two examplesunder the assumption of conditional independence ieassuming Σi = σ Ini First consider the case where therandom eects are univariate and represent unit-specicintercepts is corresponds to covariates Zi which areni-dimensional vectors containing only ones

e marginal model implied by expressions () and() is

yi sim N(XiβVi) Vi = ZiDZprimei + Σi

which can be viewed as a multivariate linear regressionmodel with a very particular parameterization of thecovariance matrix Vi

With respect to the estimation of unit-specic param-eters bi the posterior distribution of bi given the observeddata yi can be shown to be (multivariate) normal withmean vector equal to DZprimeiV

minusi (α)(yi minus Xiβ) Replacing β

and α by their maximum likelihood estimates we obtainthe so-called empirical Bayes estimates bi for the bi A keyproperty of these EB estimates is shrinkage which is bestillustrated by considering the prediction yi equiv Xi β +Zibi ofthe ith prole It can easily be shown that

yi = ΣiVminusi Xi β + (Ini minus ΣiVminusi ) yi

which can be interpreted as a weighted average of thepopulation-averaged prole Xi β and the observed data yiwith weights ΣiVminusi and Ini minusΣiVminusi respectively Note thatthe ldquonumeratorrdquo of ΣiVminusi represents within-unit variabil-ity and the ldquodenominatorrdquo is the overall covariance matrixVi Hence much weight will be given to the overall averageprole if the within-unit variability is large in comparisonto the between-unit variability (modeled by the randomeects) whereasmuchweight will be given to the observeddata if the opposite is trueis phenomenon is referred toas shrinkage toward the average proleXi β An immediateconsequence of shrinkage is that the EB estimates show lessvariability than actually present in the random-eects dis-tribution ie for any linear combination λ of the randomeects

var(λprimebi)le var(λprimebi) = λprimeDλ

About the AuthorGeert Molenberghs is Professor of Biostatistics at Uni-versiteit Hasselt and Katholieke Universiteit Leuven inBelgium He was Joint Editor of Applied Statistics (ndash) and Co-Editor of Biometrics (ndash) Cur-rently he is Co-Editor of Biostatistics (ndash) He wasPresident of the International Biometric Society (ndash) received the Guy Medal in Bronze from the RoyalStatistical Society and the Myrto Lefkopoulou award fromthe Harvard School of Public Health Geert Molenberghsis founding director of the Center for Statistics He isalso the director of the Interuniversity Institute for Bio-statistics and statistical Bioinformatics (I-BioStat) Jointlywith Geert Verbeke Mike Kenward Tomasz BurzykowskiMarc Buyse and Marc Aerts he authored books on lon-gitudinal and incomplete data and on surrogate markerevaluation GeertMolenberghs received several Excellencein Continuing Education Awards of the American Statisti-cal Association for courses at Joint Statistical Meetings

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 15: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

Likelihood L

L

and can be plotted as a function of θ for any xed valueof y e likelihood function is maximized at θ = ynis model might be appropriate for a sampling schemewhich recorded the number of successes amongn indepen-dent trials that result in success or failure each trial havingthe same probability of success θ Another example is thelikelihood function for the mean and variance parameterswhen sampling from a normal distribution with mean microand variance σ

L(θ y) = expminusn log σ minus (σ )Σ(yi minus micro)

where θ = (micro σ )is could also be plotted as a functionof micro and σ for a given sample y yn and it is not dif-cult to show that this likelihood function only dependson the sample through the sample mean y = nminusΣyi andsample variance s = (n minus )minusΣ(yi minus y) or equivalentlythrough Σyi and Σyi It is a general property of likelihoodfunctions that they depend on the data only through theminimal sucient statistic

Inferencee likelihood function was dened in a seminal paperof Fisher () and has since become the basis for mostmethods of statistical inference One version of likelihoodinference suggested by Fisher is to use some rule suchas L(θ)L(θ) gt k to dene a range of ldquolikelyrdquo or ldquoplau-siblerdquo values of θ Many authors including Royall ()and Edwards () have promoted the use of plots ofthe likelihood function along with interval estimates ofplausible valuesis approach is somewhat limited how-ever as it requires that θ have dimension or possibly or that a likelihood function can be constructed that onlydepends on a component of θ that is of interest see sectionldquo7Nuisance Parametersrdquo belowIn general we would wish to calibrate our inference

for θ by referring to the probabilistic properties of theinferential method One way to do this is to introduce aprobability measure on the unknown parameter θ typi-cally called a prior distribution and use Bayesrsquo rule forconditional probabilities to conclude

π(θ ∣ y) = L(θ y)π(θ)intθL(θ y)π(θ)dθ

where π(θ) is the density for the prior measure and π(θ ∣y) provides a probabilistic assessment of θ aer observingY = y in the model f (y θ) We could then make con-clusions of the form ldquohaving observed successes in trials and assuming π(θ) = the posterior probabilitythat θ gt is rdquo and so on

is is a very brief description of Bayesian inference inwhich probability statements refer to that generated from

the prior through the likelihood to the posterior A majordiculty with this approach is the choice of prior prob-ability function In some applications there may be anaccumulation of previous data that can be incorporatedinto a probability distribution but in general there is notand some rather ad hoc choices are oen made Anotherdiculty is themeaning to be attached to probability state-ments about the parameterInference based on the likelihood function can also be

calibrated with reference to the probability model f (y θ)by examining the distribution ofL(θY) as a random func-tion or more usually by examining the distribution ofvarious derived quantitiesis is the basis for likelihoodinference from a frequentist point of view In particularit can be shown that logL(θY)L(θY) where θ =θ(Y) is the value of θ at which L(θY) is maximized isapproximately distributed as a χk random variable wherek is the dimension of θ To make this precise requires anasymptotic theory for likelihood which is based on a cen-tral limit theorem (see 7Central Limiteorems) for thescore function

U(θY) = partpartθlogL(θY)

If Y = (Y Yn) has independent components thenU(θ) is a sum of n independent components which undermild regularity conditions will be asymptotically normalTo obtain the χ result quoted above it is also necessary toinvestigate the convergence of θ to the true value govern-ing the model f (y θ) Showing this convergence usuallyin probability but sometimes almost surely can be di-cult see Scholz () for a summary of some of the issuesthat ariseAssuming that θ is consistent for θ and that L(θY)

has sucient regularity the follow asymptotic results canbe established

(θ minus θ)T i(θ)(θ minus θ) drarr χk ()

U(θ)T iminus(θ)U(θ) drarr χk ()

ℓ(θ) minus ℓ(θ) drarr χk ()

where i(θ) = Eminusℓprimeprime(θY) θ is the expected Fisher infor-mation function ℓ(θ) = logL(θ) is the log-likelihoodfunction and χk is the 7chi-square distribution with kdegrees of freedom

ese results are all versions of a more general resultthat the log-likelihood function converges to the quadraticform corresponding to a multivariate normal distribution(see 7Multivariate Normal Distributions) under suitablystated limiting conditions ere is a similar asymptoticresult showing that the posterior density is asymptotically

L Likelihood

normal and in fact asymptotically free of the prior distri-bution although this result requires that the prior distribu-tion be a proper probability density ie has integral overthe parameter space equal to

Nuisance ParametersIn models where the dimension of θ is large plotting thelikelihood function is not possible and inference based onthe multivariate normal distribution for θ or the χk dis-tribution of the log-likelihood ratio doesnrsquot lead easily tointerval estimates for components of θ However it is pos-sible to use the likelihood function to construct inferencefor parameters of interest using variousmethods that havebeen proposed to eliminate nuisance parametersSuppose in the model f (y θ) that θ = (ψ λ) where ψ

is a k-dimensional parameter of interest (which will oenbe )e prole log-likelihood function of ψ is

ℓP(ψ) = ℓ(ψ λψ)

where λψ is the constrainedmaximum likelihood estimateit maximizes the likelihood function L(ψ λ) when ψ isheld xede prole log-likelihood function is also calledthe concentrated log-likelihood function especially ineconometrics If the parameter of interest is not expressedexplicitly as a subvector of θ then the constrained maxi-mum likelihood estimate is found using Lagrange multi-pliersIt can be veried under suitable smoothness conditions

that results similar to those at ( ndash ) hold as well for theprole log-likelihood function in particular

ℓP(ψ) minus ℓP(ψ) = ℓ(ψ λ) minus ℓ(ψ λψ)drarr χk

is method of eliminating nuisance parameters is notcompletely satisfactory especially when there are manynuisance parameters in particular it doesnrsquot allow forerrors in estimation of λ For example the prole likeli-hood approach to estimation of σ in the linear regressionmodel (see 7Linear Regression Models) y sim N(Xβ σ )will lead to the estimator σ = Σ(yi minus yi)n whereas theestimator usually preferred has divisor nminusp where p is thedimension of β

us a large literature has developed on improvementsto the prole log-likelihood For Bayesian inference suchimprovements are ldquoautomaticallyrdquo included in the formu-lation of the marginal posterior density for ψ

πM(ψ ∣ y)prop int π(ψ λ ∣ y)dλ

but it is typically quite dicult to specify priors for possiblyhigh-dimensional nuisance parameters For non-Bayesian

inference most modications to the prole log-likelihoodare derived by considering conditional or marginal infer-ence in models that admit factorizations at least approxi-mately like the following

f (y θ) = f(yψ)f(y ∣ y λ) orf (y θ) = f(y ∣ yψ)f(y λ)

A discussion of conditional inference and density factori-sations is given in Reid ()is literature is closely tiedto that on higher order asymptotic theory for likelihoode latter theory builds on saddlepoint and Laplace expan-sions to derive more accurate versions of (ndash) see forexample Severini () and Brazzale et al () edirect likelihood approach of Royall () and others doesnot generalize very well to the nuisance parameter settingalthough Royall and Tsou () present some results inthis direction

Extensions to Likelihoode likelihood function is such an important aspect ofinference based on models that it has been extended toldquolikelihood-likerdquo functions formore complex data settingsExamples include nonparametric and semi-parametriclikelihoods the most famous semi-parametric likelihoodis the proportional hazards model of Cox () Butmany other extensions have been suggested to empiri-cal likelihood (Owen ) which is a type of nonpara-metric likelihood supported on the observed sample toquasi-likelihood (McCullagh ) which starts from thescore function U(θ) and works backwards to an infer-ence function to bootstrap likelihood (Davison et al )and many modications of prole likelihood (Barndor-Nielsen andCox Fraser )ere is recent interestfor multi-dimensional responses Yi in composite likeli-hoods which are products of lower dimensional condi-tional or marginal distributions (Varin ) Reid ()concluded a review article on likelihood by stating

7 From either a Bayesian or frequentist perspective the like-lihood function plays an essential role in inference Themaximum likelihood estimator once regarded on an equalfooting among competing point estimators is now typi-cally the estimator of choice although some refinementis needed in problems with large numbers of nuisanceparameters The likelihood ratio statistic is the basis formost tests of hypotheses and interval estimates The emer-gence of the centrality of the likelihood function for infer-ence partly due to the large increase in computing poweris one of the central developments in the theory of statisticsduring the latter half of the twentieth century

Likelihood L

L

Further Readinge book by Cox and Hinkley () gives a detailedaccount of likelihood inference and principles of statis-tical inference see also Cox () ere are severalbook-length treatments of likelihood inference includingEdwards () Azzalini () Pawitan () and Sev-erini () this last discusses higher order asymptotictheory in detail as does Barndor-Nielsen andCox ()and Brazzale Davison and Reid () A short reviewpaper is Reid () An excellent overview of consis-tency results for maximum likelihood estimators is Scholz() see also Lehmann andCasella () Foundationalissues surrounding likelihood inference are discussed inBerger and Wolpert ()

About the AuthorProfessor Reid is a Past President of the Statistical Societyof Canada (ndash) During (ndash) she servedas the President of the Institute of Mathematical Statis-tics Among many awards she received the Emanuel andCarol Parzen Prize for Statistical Innovation () ldquoforleadership in statistical science for outstanding researchin theoretical statistics and highly accurate inference fromthe likelihood function and for inuential contributionsto statistical methods in biology environmental sciencehigh energy physics and complex social surveysrdquo She wasawarded the Gold Medal Statistical Society of Canada() and FlorenceNightingaleDavidAward Committeeof Presidents of Statistical Societies () She is AssociateEditor of Statistical Science (ndash) Bernoulli (ndash) andMetrika (ndash)

Cross References7Bayesian Analysis or Evidence Based Statistics7Bayesian Statistics7Bayesian Versus Frequentist Statistical Reasoning7Bayesian vs Classical Point Estimation A ComparativeOverview7Chi-Square Test Analysis of Contingency Tables7Empirical Likelihood Approach to Inference fromSample Survey Data7Estimation7Fiducial Inference7General Linear Models7Generalized Linear Models7Generalized Quasi-Likelihood (GQL) Inferences7Mixture Models7Philosophical Foundations of Statistics

7Risk Analysis7Statistical Evidence7Statistical Inference7Statistical Inference An Overview7Statistics An Overview7Statistics Nelderrsquos view7Testing Variance Components in Mixed LinearModels7Uniform Distribution in Statistics

References and Further ReadingAzzalini A () Statistical inference Chapman and Hall LondonBarndorff-Nielsen OE Cox DR () Inference and asymptotics

Chapman and Hall LondonBerger JO Wolpert R () The likelihood principle Institute of

Mathematical Statistics HaywardBirnbaum A () On the foundations of statistical inference Am

Stat Assoc ndashBrazzale AR Davison AC Reid N () Applied asymptotics

Cambridge University Press CambridgeCox DR () Regression models and life tables J R Stat Soc B

ndash (with discussion)Cox DR () Principles of statistical inference Cambridge Uni-

versity Press CambridgeCox DR Hinkley DV () Theoretical statistics Chapman and

Hall LondonDavison AC Hinkley DV Worton B () Bootstrap likelihoods

Biometrika ndashEdwards AF () Likelihood Oxford University Press OxfordFisher RA () On the mathematical foundations of theoretical

statistics Phil Trans R Soc A ndashLehmann EL Casella G () Theory of point estimation nd edn

Springer New YorkMcCullagh P () Quasi-likelihood functions Ann Stat ndashMurphy SA Van der Vaart A () Semiparametric likelihood ratio

inference Ann Stat ndashOwen A () Empirical likelihood ratio confidence intervals for a

single functional Biometrika ndashPawitan Y () In all likelihood Oxford University Press OxfordReid N () The roles of conditioning in inference Stat Sci

ndashReid N () Likelihood J Am Stat Assoc ndashRoyall RM () Statistical evidence a likelihood paradigm

Chapman and Hall LondonRoyall RM Tsou TS () Interpreting statistical evidence using

imperfect models robust adjusted likelihoodfunctions J R StatSoc B

Scholz F () Maximum likelihood estimation In Ency-clopedia of statistical sciences Wiley New York doiesspub Accessed Aug

Severini TA () Likelihood methods in statistics Oxford Univer-sity Press Oxford

Van der Vaart AW () Infinite-dimensional likelihood meth-ods in statistics httpwwwstieltjesorgarchiefbiennialframenodehtml Accessed Aug

Varin C () On composite marginal likelihood Adv Stat Analndash

L Limit Theorems of Probability Theory

Limit Theorems of ProbabilityTheory

Alexandr Alekseevich BorovkovProfessor Head of Probability and Statistics Chair at theNovosibirsk UniversityNovosibirsk University Novosibirsk Russia

Limit eorems of Probability eory is a broad namereferring to the most essential and extensive research areain Probability eory which at the same time has thegreatest impact on the numerous applications of the latterBy its very nature Probability eory is concerned

with asymptotic (limiting) laws that emerge in a long seriesof observations on random events Because of this in theearly twentieth century even the very denition of prob-ability of an event was given by a group of specialists(R von Mises and some others) as the limit of the rela-tive frequency of the occurrence of this event in a longrow of independent random experiments e ldquostabilityrdquoof this frequency (ie that such a limit always exists) waspostulated Aer the s Kolmogorovrsquos axiomatic con-struction of probability theory has prevailed One of themain assertions in this axiomatic theory is the Law of LargeNumbers (LLN) on the convergence of the averages of largenumbers of random variables to their expectation islaw implies the aforementioned stability of the relative fre-quencies and their convergence to the probability of thecorresponding event

e LLN is the simplest limit theorem (LT) of probabil-ity theory elucidating the physical meaning of probabilitye LLN is stated as follows if XXX is a sequenceof iid random variables

Sn =n

sumj=Xj

and the expectation a = EX exists then nminusSnasETHrarr a

(almost surely ie with probability )us the value nacan be called the rst order approximation for the sums Sne Central Limiteorem (CLT) gives one a more preciseapproximation for Sn It says that if σ = E(X minus a) ltinfinthen the distribution of the standardized sum ζn = (Sn minusna)σ

radicn converges as n rarr infin to the standard normal

(Gaussian) lawat is for all x

P(ζn lt x)rarr Φ(x) = radicπ

x

intminusinfin

eminustdt

e quantity nEξ + ζσradicn where ζ is a standard normal

random variable (so that P(ζ lt x) = Φ(x)) can be calledthe second order approximation for Sn

e rst LLN (for the Bernoulli scheme) was provedby Jacob Bernoulli in the late s (published posthu-mously in ) e rst CLT (also for the Bernoullischeme) was established by A de Moivre (rst publishedin and referred nowadays to as the deMoivrendashLaplacetheorem) In the beginning of the nineteenth centuryPS Laplace and CF Gauss contributed to the generaliza-tion of these assertions and appreciation of their enormousapplied importance (in particular for the theory of errorsof observations) while later in that century further break-throughs in both methodology and applicability rangeof the CLT were achieved by PL Chebyshev () andAM Lyapunov ()

e main directions in which the two aforementionedmain LTs have been extended and rened since then are

Relaxing the assumption EX lt infin When the sec-ond moment is innite one needs to assume that theldquotailrdquo P(x) = P(X gt x) + P(X lt minusx) is a func-tion regularly varying at innity such that the limitlimxrarrinfinP(X gt x)P(x) existsen the distribution of thenormalized sum Snσ(n) where σ(n) = Pminus(nminus)Pminus being the generalized inverse of the function P andwe assume that Eξ = when the expectation is niteconverges to one of the so-called stable laws as nrarrinfine7characteristic functions of these laws have simpleclosed-form representations

Relaxing the assumption that the Xjrsquos are identicallydistributed and proceeding to study the so-called tri-angular array scheme where the distributions of thesummands Xj = Xjn forming the sum Sn depend notonly on j but on n as well In this case the class of alllimit laws for the distribution of Sn (under suitable nor-malization) is substantially wider it coincides with theclass of the so-called innitely divisible distributionsAn important special case here is the Poisson limit the-orem on convergence in distribution of the number ofoccurrences of rare events to a Poisson law

Relaxing the assumption of independence of the XjrsquosSeveral types of ldquoweak dependencerdquo assumptions onXjunder which the LLN and CLT still hold true havebeen suggested and investigatedOne should alsomen-tion here the so-called ergodic theorems (see7Ergodiceorem) for a wide class of random sequences andprocesses

Renement of the main LTs and derivation of asymp-totic expansions For instance in the CLT bounds ofthe rate of convergence P(ζn lt x) minus Φ(x) rarr and

Limit Theorems of Probability Theory L

L

asymptotic expansions for this dierence (in the pow-ers of nminus in the case of iid summands) have beenobtained under broad assumptions

Studying large deviation probabilities for the sums Sn(theorems on rare events) If x rarr infin together with nthen the CLT can only assert that P(ζn gt x) rarr eorems on large deviation probabilities aim to nd afunction P(xn) such that

P(ζn gt x)P(xn) rarr as nrarrinfin x rarrinfin

e nature of the function P(xn) essentially dependson the rate of decay of P(X gt x) as x rarrinfin and on theldquodeviation zonerdquo ie on the asymptotic behavior of theratio xn as nrarrinfin

Considering observations X Xn of a more com-plex nature ndash rst of all multivariate random vectorsIf Xj isin Rd then the role of the limit law in the CLTwill be played by a d-dimensional normal (Gaussian)distribution with the covariance matrix E(X minus EX)(X minus EX)T

e variety of application areas of the LLN and CLTis enormous us Mathematical Statistics is based onthese LTs Let Xlowastn = (X Xn) be a sample froma distribution F and Flowastn (u) the corresponding empiricaldistribution functione fundamental GlivenkondashCantellitheorem (see 7Glivenko-Cantelli eorems) stating thatsupu ∣F

lowastn (u)minus F(u)∣

asETHrarr as nrarrinfin is of the same natureas the LLN and basically means that the unknown distri-bution F can be estimated arbitrary well from the randomsample Xlowastn of a large enough size n

e existence of consistent estimators for the unknownparameters a = Eξ and σ = E(X minus a) also follows fromthe LLN since as nrarrinfin

alowast = n

n

sumj=Xj

asETHrarr a (σ )lowast = n

n

sumj=

(Xj minus alowast)

= n

n

sumj=Xj minus (alowast) asETHrarr σ

Under additional moment assumptions on the distri-bution F one can also construct asymptotic condenceintervals for the parameters a and σ as the distributionsof the quantities

radicn(alowast minus a) and

radicn((σ )lowast minus σ ) con-

verge as n rarr infin to the normal onese same can alsobe said about other parameters that are ldquosmoothrdquo enoughfunctionals of the unknown distribution F

e theorem on the 7asymptotic normality andasymptotic eciency of maximum likelihood estimatorsis another classical example of LTsrsquo applications in mathe-matical statistics (see eg Borovkov ) Furthermore in

estimation theory and hypotheses testing one also needstheorems on large deviation probabilities for the respec-tive statistics as it is statistical procedures with small errorprobabilities that are oen required in applicationsIt is worth noting that summation of random variables

is by no means the only situation in which LTs appear inProbabilityeoryGenerally speaking the main objective of Probabil-

ityeory in applications is nding appropriate stochasticmodels for objects under study and then determining thedistributions or parameters one is interested in As a rulethe explicit form of these distributions and parameters isnot known LTs can be used to nd suitable approximationsto the characteristics in questionAt least two possible approaches to this problem should

be noted here

Suppose that the unknown distribution Fθ depends ona parameter θ such that as θ approaches some ldquocriticalrdquovalue θ the distributions Fθ become ldquodegeneraterdquo inone sense or anotheren in a number of cases onecan nd an approximation for Fθ which is valid for thevalues of θ that are close to θ For instance in actuarialstudies7queueing theory and some other applicationsone of the main problems is concerned with the dis-tribution of S = sup

kge(Sk minus θk) under the assumption

that EX = If θ gt then S is a proper random vari-able If however θ rarr then S asETHrarr infin Here we dealwith the so-called ldquotransient phenomenardquo It turns outthat if σ = Var (X) ltinfin then there exists the limit

limθdarrP(θS gt x) = eminusxσ x gt

is (KingmanndashProkhorov) LT enables one to ndapproximations for the distribution of S in situationswhere θ is small

Sometimes one can estimate the ldquotailsrdquo of the unknowndistributions ie their asymptotic behavior at innityis is of importance in those applications where oneneeds to evaluate the probabilities of rare events If theequation Eemicro(Xminusθ) = has a solution micro gt then inthe above example one has

P(S gt x) sim ceminusmicrox x rarrinfin

where c is a known constant If the distribution F of Xis subexponential (in this case Eemicro(Xminusθ) = infin for anymicro gt ) then

P(S gt x) sim θ int

infin

x( minus F(t))dt x rarrinfin

L Linear Mixed Models

isLTenablesone tondapproximations forP(S gt x)for large xFor both approaches the obtained approximations

can be rened

An important part of Probabilityeory is concernedwith LTs for random processes eir main objective isto nd conditions under which random processes con-verge in some sense to some limit process An extensionof the CLT to that context is the so-called FunctionalCLT (aka the DonskerndashProkhorov invariance princi-ple) which states that as n rarr infin the processesζn(t) = (Slfloorntrfloor minus ant)σ

radicntisin[] converge in distribu-

tion to the standard Wiener process w(t)tisin[]e LTs(including large deviation theorems) for a broad class offunctionals of the sequence (7random walk) S Sncan also be classied as LTs for 7stochastic processese same can be said about Law of iterated logarithmwhich states that for an arbitrary ε gt the random walkSkinfink= crosses the boundary (minusε)σ

radick ln ln k innitely

many times but crosses the boundary ( + ε)σradick ln ln k

nitely many times with probability Similar results holdtrue for trajectories of Wiener processes w(t)tisin[] andw(t)tisin[infin)In mathematical statistics a closely related to func-

tional CLT result says that the so-called ldquoempirical processrdquoradicn(Flowastn (u) minus F(u))uisin(minusinfininfin) converges in distribution

to w(F(u))uisin(minusinfininfin) where w(t) = w(t) minus tw() isthe Brownian bridge processis LT implies7asymptoticnormality of a great many estimators that can be repre-sented as smooth functionals of the empirical distributionfunction Flowastn (u)

ere are many other areas in Probability eoryand its applications where various LTs appear and areextensively used For instance convergence theoremsfor 7martingales asymptotics of extinction probabilityof a branching processes and conditional (under non-extinction condition) LTs on a number of particles etc

About the AuthorProfessor Borovkov is Head of Probability and Statis-tics Department of the Sobolev Institute of MathematicsNovosibirsk (RussianAcademy of Sciences) since Heis Head of Probability and Statistics Chair at the Novosi-birsk University since He is Concelour of RussianAcademy of Sciences and fullmember of RussianAcademyof Sciences (Academician) () He was awarded theState Prize of the USSR () the Markov Prize of theRussian Academy of Sciences () and GovernmentPrize in Education () Professor Borovkov is Editor-in-Chief of the journal ldquoSiberianAdvances inMathematicsrdquo

and Associated Editor of journals ldquoeory of Probabil-ity and its Applicationsrdquo ldquoSiberian Mathematical JournalrdquoldquoMathematical Methods of Statisticsrdquo ldquoElectronic Journal ofProbabilityrdquo

Cross References7Almost Sure Convergence of Random Variables7Approximations to Distributions7Asymptotic Normality7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Central Limiteorems7Empirical Processes7Ergodiceorem7Glivenko-Cantellieorems7Large Deviations and Applications7Laws of Large Numbers7Martingale Central Limiteorem7Probabilityeory An Outline7RandomMatrixeory7Strong Approximations in Probability and Statistics7Weak Convergence of Probability Measures

References and Further ReadingBillingsley P () Convergence of probability measures Wiley

New YorkBorovkov AA () Mathematical statistics Gordon amp Breach

AmsterdamGnedenko BV Kolmogorov AN () Limit distributions for sums

of independent random variables Addison-Wesley CambridgeLeacutevy P () Theacuteorie de lrsquoAddition des Variables Aleacuteatories

Gauthier-Villars ParisLoegraveve M (ndash) Probability theory vols I and II th edn

Springer New YorkPetrov VV () Limit theorems of probability theory sequences of

independent random variables ClarendonOxford UniversityPress New York

Linear Mixed Models

GeertMolenberghsProfessorUniversiteit Hasselt amp Katholieke Universiteit LeuvenLeuven Belgium

In observational studies repeated measurements may betaken at almost arbitrary time points resulting in anextremely large number of time points at which only one

Linear Mixed Models L

L

or only a few measurements have been taken Many of theparametric covariance models described so far may thencontain toomany parameters tomake them useful in prac-tice while othermore parsimoniousmodelsmay be basedon assumptions which are too simplistic to be realisticA general and very exible class of parametric models forcontinuous longitudinal data is formulated as follows

yi∣bi sim N(Xiβ + ZibiΣi) ()bi sim N(D) ()

where Xi and Zi are (ni times p) and (ni times q) dimensionalmatrices of known covariates β is a p-dimensional vec-tor of regression parameters called the xed eects D isa general (q times q) covariance matrix and Σi is a (ni times ni)covariance matrix which depends on i only through itsdimension ni ie the set of unknown parameters in Σi willnot depend upon i Finally bi is a vector of subject-specicor random eects

e above model can be interpreted as a linear regres-sion model (see 7Linear Regression Models) for the vec-tor yi of repeated measurements for each unit separatelywhere some of the regression parameters are specic (ran-dom eects bi) while others are not (xed eects β)edistributional assumptions in () with respect to the ran-dom eects can be motivated as follows First E(bi) = implies that the mean of yi still equals Xiβ such that thexed eects in the random-eects model () can also beinterpreted marginally Not only do they reect the eectof changing covariates within specic units they alsomea-sure the marginal eect in the population of changing thesame covariates Second the normality assumption imme-diately implies that marginally yi also follows a normaldistribution with mean vector Xiβ and with covariancematrix Vi = ZiDZprimei + ΣiNote that the random eects in () implicitly imply the

marginal covariance matrix Vi of yi to be of the very spe-cic form Vi = ZiDZprimei + Σi Let us consider two examplesunder the assumption of conditional independence ieassuming Σi = σ Ini First consider the case where therandom eects are univariate and represent unit-specicintercepts is corresponds to covariates Zi which areni-dimensional vectors containing only ones

e marginal model implied by expressions () and() is

yi sim N(XiβVi) Vi = ZiDZprimei + Σi

which can be viewed as a multivariate linear regressionmodel with a very particular parameterization of thecovariance matrix Vi

With respect to the estimation of unit-specic param-eters bi the posterior distribution of bi given the observeddata yi can be shown to be (multivariate) normal withmean vector equal to DZprimeiV

minusi (α)(yi minus Xiβ) Replacing β

and α by their maximum likelihood estimates we obtainthe so-called empirical Bayes estimates bi for the bi A keyproperty of these EB estimates is shrinkage which is bestillustrated by considering the prediction yi equiv Xi β +Zibi ofthe ith prole It can easily be shown that

yi = ΣiVminusi Xi β + (Ini minus ΣiVminusi ) yi

which can be interpreted as a weighted average of thepopulation-averaged prole Xi β and the observed data yiwith weights ΣiVminusi and Ini minusΣiVminusi respectively Note thatthe ldquonumeratorrdquo of ΣiVminusi represents within-unit variabil-ity and the ldquodenominatorrdquo is the overall covariance matrixVi Hence much weight will be given to the overall averageprole if the within-unit variability is large in comparisonto the between-unit variability (modeled by the randomeects) whereasmuchweight will be given to the observeddata if the opposite is trueis phenomenon is referred toas shrinkage toward the average proleXi β An immediateconsequence of shrinkage is that the EB estimates show lessvariability than actually present in the random-eects dis-tribution ie for any linear combination λ of the randomeects

var(λprimebi)le var(λprimebi) = λprimeDλ

About the AuthorGeert Molenberghs is Professor of Biostatistics at Uni-versiteit Hasselt and Katholieke Universiteit Leuven inBelgium He was Joint Editor of Applied Statistics (ndash) and Co-Editor of Biometrics (ndash) Cur-rently he is Co-Editor of Biostatistics (ndash) He wasPresident of the International Biometric Society (ndash) received the Guy Medal in Bronze from the RoyalStatistical Society and the Myrto Lefkopoulou award fromthe Harvard School of Public Health Geert Molenberghsis founding director of the Center for Statistics He isalso the director of the Interuniversity Institute for Bio-statistics and statistical Bioinformatics (I-BioStat) Jointlywith Geert Verbeke Mike Kenward Tomasz BurzykowskiMarc Buyse and Marc Aerts he authored books on lon-gitudinal and incomplete data and on surrogate markerevaluation GeertMolenberghs received several Excellencein Continuing Education Awards of the American Statisti-cal Association for courses at Joint Statistical Meetings

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 16: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

L Likelihood

normal and in fact asymptotically free of the prior distri-bution although this result requires that the prior distribu-tion be a proper probability density ie has integral overthe parameter space equal to

Nuisance ParametersIn models where the dimension of θ is large plotting thelikelihood function is not possible and inference based onthe multivariate normal distribution for θ or the χk dis-tribution of the log-likelihood ratio doesnrsquot lead easily tointerval estimates for components of θ However it is pos-sible to use the likelihood function to construct inferencefor parameters of interest using variousmethods that havebeen proposed to eliminate nuisance parametersSuppose in the model f (y θ) that θ = (ψ λ) where ψ

is a k-dimensional parameter of interest (which will oenbe )e prole log-likelihood function of ψ is

ℓP(ψ) = ℓ(ψ λψ)

where λψ is the constrainedmaximum likelihood estimateit maximizes the likelihood function L(ψ λ) when ψ isheld xede prole log-likelihood function is also calledthe concentrated log-likelihood function especially ineconometrics If the parameter of interest is not expressedexplicitly as a subvector of θ then the constrained maxi-mum likelihood estimate is found using Lagrange multi-pliersIt can be veried under suitable smoothness conditions

that results similar to those at ( ndash ) hold as well for theprole log-likelihood function in particular

ℓP(ψ) minus ℓP(ψ) = ℓ(ψ λ) minus ℓ(ψ λψ)drarr χk

is method of eliminating nuisance parameters is notcompletely satisfactory especially when there are manynuisance parameters in particular it doesnrsquot allow forerrors in estimation of λ For example the prole likeli-hood approach to estimation of σ in the linear regressionmodel (see 7Linear Regression Models) y sim N(Xβ σ )will lead to the estimator σ = Σ(yi minus yi)n whereas theestimator usually preferred has divisor nminusp where p is thedimension of β

us a large literature has developed on improvementsto the prole log-likelihood For Bayesian inference suchimprovements are ldquoautomaticallyrdquo included in the formu-lation of the marginal posterior density for ψ

πM(ψ ∣ y)prop int π(ψ λ ∣ y)dλ

but it is typically quite dicult to specify priors for possiblyhigh-dimensional nuisance parameters For non-Bayesian

inference most modications to the prole log-likelihoodare derived by considering conditional or marginal infer-ence in models that admit factorizations at least approxi-mately like the following

f (y θ) = f(yψ)f(y ∣ y λ) orf (y θ) = f(y ∣ yψ)f(y λ)

A discussion of conditional inference and density factori-sations is given in Reid ()is literature is closely tiedto that on higher order asymptotic theory for likelihoode latter theory builds on saddlepoint and Laplace expan-sions to derive more accurate versions of (ndash) see forexample Severini () and Brazzale et al () edirect likelihood approach of Royall () and others doesnot generalize very well to the nuisance parameter settingalthough Royall and Tsou () present some results inthis direction

Extensions to Likelihoode likelihood function is such an important aspect ofinference based on models that it has been extended toldquolikelihood-likerdquo functions formore complex data settingsExamples include nonparametric and semi-parametriclikelihoods the most famous semi-parametric likelihoodis the proportional hazards model of Cox () Butmany other extensions have been suggested to empiri-cal likelihood (Owen ) which is a type of nonpara-metric likelihood supported on the observed sample toquasi-likelihood (McCullagh ) which starts from thescore function U(θ) and works backwards to an infer-ence function to bootstrap likelihood (Davison et al )and many modications of prole likelihood (Barndor-Nielsen andCox Fraser )ere is recent interestfor multi-dimensional responses Yi in composite likeli-hoods which are products of lower dimensional condi-tional or marginal distributions (Varin ) Reid ()concluded a review article on likelihood by stating

7 From either a Bayesian or frequentist perspective the like-lihood function plays an essential role in inference Themaximum likelihood estimator once regarded on an equalfooting among competing point estimators is now typi-cally the estimator of choice although some refinementis needed in problems with large numbers of nuisanceparameters The likelihood ratio statistic is the basis formost tests of hypotheses and interval estimates The emer-gence of the centrality of the likelihood function for infer-ence partly due to the large increase in computing poweris one of the central developments in the theory of statisticsduring the latter half of the twentieth century

Likelihood L

L

Further Readinge book by Cox and Hinkley () gives a detailedaccount of likelihood inference and principles of statis-tical inference see also Cox () ere are severalbook-length treatments of likelihood inference includingEdwards () Azzalini () Pawitan () and Sev-erini () this last discusses higher order asymptotictheory in detail as does Barndor-Nielsen andCox ()and Brazzale Davison and Reid () A short reviewpaper is Reid () An excellent overview of consis-tency results for maximum likelihood estimators is Scholz() see also Lehmann andCasella () Foundationalissues surrounding likelihood inference are discussed inBerger and Wolpert ()

About the AuthorProfessor Reid is a Past President of the Statistical Societyof Canada (ndash) During (ndash) she servedas the President of the Institute of Mathematical Statis-tics Among many awards she received the Emanuel andCarol Parzen Prize for Statistical Innovation () ldquoforleadership in statistical science for outstanding researchin theoretical statistics and highly accurate inference fromthe likelihood function and for inuential contributionsto statistical methods in biology environmental sciencehigh energy physics and complex social surveysrdquo She wasawarded the Gold Medal Statistical Society of Canada() and FlorenceNightingaleDavidAward Committeeof Presidents of Statistical Societies () She is AssociateEditor of Statistical Science (ndash) Bernoulli (ndash) andMetrika (ndash)

Cross References7Bayesian Analysis or Evidence Based Statistics7Bayesian Statistics7Bayesian Versus Frequentist Statistical Reasoning7Bayesian vs Classical Point Estimation A ComparativeOverview7Chi-Square Test Analysis of Contingency Tables7Empirical Likelihood Approach to Inference fromSample Survey Data7Estimation7Fiducial Inference7General Linear Models7Generalized Linear Models7Generalized Quasi-Likelihood (GQL) Inferences7Mixture Models7Philosophical Foundations of Statistics

7Risk Analysis7Statistical Evidence7Statistical Inference7Statistical Inference An Overview7Statistics An Overview7Statistics Nelderrsquos view7Testing Variance Components in Mixed LinearModels7Uniform Distribution in Statistics

References and Further ReadingAzzalini A () Statistical inference Chapman and Hall LondonBarndorff-Nielsen OE Cox DR () Inference and asymptotics

Chapman and Hall LondonBerger JO Wolpert R () The likelihood principle Institute of

Mathematical Statistics HaywardBirnbaum A () On the foundations of statistical inference Am

Stat Assoc ndashBrazzale AR Davison AC Reid N () Applied asymptotics

Cambridge University Press CambridgeCox DR () Regression models and life tables J R Stat Soc B

ndash (with discussion)Cox DR () Principles of statistical inference Cambridge Uni-

versity Press CambridgeCox DR Hinkley DV () Theoretical statistics Chapman and

Hall LondonDavison AC Hinkley DV Worton B () Bootstrap likelihoods

Biometrika ndashEdwards AF () Likelihood Oxford University Press OxfordFisher RA () On the mathematical foundations of theoretical

statistics Phil Trans R Soc A ndashLehmann EL Casella G () Theory of point estimation nd edn

Springer New YorkMcCullagh P () Quasi-likelihood functions Ann Stat ndashMurphy SA Van der Vaart A () Semiparametric likelihood ratio

inference Ann Stat ndashOwen A () Empirical likelihood ratio confidence intervals for a

single functional Biometrika ndashPawitan Y () In all likelihood Oxford University Press OxfordReid N () The roles of conditioning in inference Stat Sci

ndashReid N () Likelihood J Am Stat Assoc ndashRoyall RM () Statistical evidence a likelihood paradigm

Chapman and Hall LondonRoyall RM Tsou TS () Interpreting statistical evidence using

imperfect models robust adjusted likelihoodfunctions J R StatSoc B

Scholz F () Maximum likelihood estimation In Ency-clopedia of statistical sciences Wiley New York doiesspub Accessed Aug

Severini TA () Likelihood methods in statistics Oxford Univer-sity Press Oxford

Van der Vaart AW () Infinite-dimensional likelihood meth-ods in statistics httpwwwstieltjesorgarchiefbiennialframenodehtml Accessed Aug

Varin C () On composite marginal likelihood Adv Stat Analndash

L Limit Theorems of Probability Theory

Limit Theorems of ProbabilityTheory

Alexandr Alekseevich BorovkovProfessor Head of Probability and Statistics Chair at theNovosibirsk UniversityNovosibirsk University Novosibirsk Russia

Limit eorems of Probability eory is a broad namereferring to the most essential and extensive research areain Probability eory which at the same time has thegreatest impact on the numerous applications of the latterBy its very nature Probability eory is concerned

with asymptotic (limiting) laws that emerge in a long seriesof observations on random events Because of this in theearly twentieth century even the very denition of prob-ability of an event was given by a group of specialists(R von Mises and some others) as the limit of the rela-tive frequency of the occurrence of this event in a longrow of independent random experiments e ldquostabilityrdquoof this frequency (ie that such a limit always exists) waspostulated Aer the s Kolmogorovrsquos axiomatic con-struction of probability theory has prevailed One of themain assertions in this axiomatic theory is the Law of LargeNumbers (LLN) on the convergence of the averages of largenumbers of random variables to their expectation islaw implies the aforementioned stability of the relative fre-quencies and their convergence to the probability of thecorresponding event

e LLN is the simplest limit theorem (LT) of probabil-ity theory elucidating the physical meaning of probabilitye LLN is stated as follows if XXX is a sequenceof iid random variables

Sn =n

sumj=Xj

and the expectation a = EX exists then nminusSnasETHrarr a

(almost surely ie with probability )us the value nacan be called the rst order approximation for the sums Sne Central Limiteorem (CLT) gives one a more preciseapproximation for Sn It says that if σ = E(X minus a) ltinfinthen the distribution of the standardized sum ζn = (Sn minusna)σ

radicn converges as n rarr infin to the standard normal

(Gaussian) lawat is for all x

P(ζn lt x)rarr Φ(x) = radicπ

x

intminusinfin

eminustdt

e quantity nEξ + ζσradicn where ζ is a standard normal

random variable (so that P(ζ lt x) = Φ(x)) can be calledthe second order approximation for Sn

e rst LLN (for the Bernoulli scheme) was provedby Jacob Bernoulli in the late s (published posthu-mously in ) e rst CLT (also for the Bernoullischeme) was established by A de Moivre (rst publishedin and referred nowadays to as the deMoivrendashLaplacetheorem) In the beginning of the nineteenth centuryPS Laplace and CF Gauss contributed to the generaliza-tion of these assertions and appreciation of their enormousapplied importance (in particular for the theory of errorsof observations) while later in that century further break-throughs in both methodology and applicability rangeof the CLT were achieved by PL Chebyshev () andAM Lyapunov ()

e main directions in which the two aforementionedmain LTs have been extended and rened since then are

Relaxing the assumption EX lt infin When the sec-ond moment is innite one needs to assume that theldquotailrdquo P(x) = P(X gt x) + P(X lt minusx) is a func-tion regularly varying at innity such that the limitlimxrarrinfinP(X gt x)P(x) existsen the distribution of thenormalized sum Snσ(n) where σ(n) = Pminus(nminus)Pminus being the generalized inverse of the function P andwe assume that Eξ = when the expectation is niteconverges to one of the so-called stable laws as nrarrinfine7characteristic functions of these laws have simpleclosed-form representations

Relaxing the assumption that the Xjrsquos are identicallydistributed and proceeding to study the so-called tri-angular array scheme where the distributions of thesummands Xj = Xjn forming the sum Sn depend notonly on j but on n as well In this case the class of alllimit laws for the distribution of Sn (under suitable nor-malization) is substantially wider it coincides with theclass of the so-called innitely divisible distributionsAn important special case here is the Poisson limit the-orem on convergence in distribution of the number ofoccurrences of rare events to a Poisson law

Relaxing the assumption of independence of the XjrsquosSeveral types of ldquoweak dependencerdquo assumptions onXjunder which the LLN and CLT still hold true havebeen suggested and investigatedOne should alsomen-tion here the so-called ergodic theorems (see7Ergodiceorem) for a wide class of random sequences andprocesses

Renement of the main LTs and derivation of asymp-totic expansions For instance in the CLT bounds ofthe rate of convergence P(ζn lt x) minus Φ(x) rarr and

Limit Theorems of Probability Theory L

L

asymptotic expansions for this dierence (in the pow-ers of nminus in the case of iid summands) have beenobtained under broad assumptions

Studying large deviation probabilities for the sums Sn(theorems on rare events) If x rarr infin together with nthen the CLT can only assert that P(ζn gt x) rarr eorems on large deviation probabilities aim to nd afunction P(xn) such that

P(ζn gt x)P(xn) rarr as nrarrinfin x rarrinfin

e nature of the function P(xn) essentially dependson the rate of decay of P(X gt x) as x rarrinfin and on theldquodeviation zonerdquo ie on the asymptotic behavior of theratio xn as nrarrinfin

Considering observations X Xn of a more com-plex nature ndash rst of all multivariate random vectorsIf Xj isin Rd then the role of the limit law in the CLTwill be played by a d-dimensional normal (Gaussian)distribution with the covariance matrix E(X minus EX)(X minus EX)T

e variety of application areas of the LLN and CLTis enormous us Mathematical Statistics is based onthese LTs Let Xlowastn = (X Xn) be a sample froma distribution F and Flowastn (u) the corresponding empiricaldistribution functione fundamental GlivenkondashCantellitheorem (see 7Glivenko-Cantelli eorems) stating thatsupu ∣F

lowastn (u)minus F(u)∣

asETHrarr as nrarrinfin is of the same natureas the LLN and basically means that the unknown distri-bution F can be estimated arbitrary well from the randomsample Xlowastn of a large enough size n

e existence of consistent estimators for the unknownparameters a = Eξ and σ = E(X minus a) also follows fromthe LLN since as nrarrinfin

alowast = n

n

sumj=Xj

asETHrarr a (σ )lowast = n

n

sumj=

(Xj minus alowast)

= n

n

sumj=Xj minus (alowast) asETHrarr σ

Under additional moment assumptions on the distri-bution F one can also construct asymptotic condenceintervals for the parameters a and σ as the distributionsof the quantities

radicn(alowast minus a) and

radicn((σ )lowast minus σ ) con-

verge as n rarr infin to the normal onese same can alsobe said about other parameters that are ldquosmoothrdquo enoughfunctionals of the unknown distribution F

e theorem on the 7asymptotic normality andasymptotic eciency of maximum likelihood estimatorsis another classical example of LTsrsquo applications in mathe-matical statistics (see eg Borovkov ) Furthermore in

estimation theory and hypotheses testing one also needstheorems on large deviation probabilities for the respec-tive statistics as it is statistical procedures with small errorprobabilities that are oen required in applicationsIt is worth noting that summation of random variables

is by no means the only situation in which LTs appear inProbabilityeoryGenerally speaking the main objective of Probabil-

ityeory in applications is nding appropriate stochasticmodels for objects under study and then determining thedistributions or parameters one is interested in As a rulethe explicit form of these distributions and parameters isnot known LTs can be used to nd suitable approximationsto the characteristics in questionAt least two possible approaches to this problem should

be noted here

Suppose that the unknown distribution Fθ depends ona parameter θ such that as θ approaches some ldquocriticalrdquovalue θ the distributions Fθ become ldquodegeneraterdquo inone sense or anotheren in a number of cases onecan nd an approximation for Fθ which is valid for thevalues of θ that are close to θ For instance in actuarialstudies7queueing theory and some other applicationsone of the main problems is concerned with the dis-tribution of S = sup

kge(Sk minus θk) under the assumption

that EX = If θ gt then S is a proper random vari-able If however θ rarr then S asETHrarr infin Here we dealwith the so-called ldquotransient phenomenardquo It turns outthat if σ = Var (X) ltinfin then there exists the limit

limθdarrP(θS gt x) = eminusxσ x gt

is (KingmanndashProkhorov) LT enables one to ndapproximations for the distribution of S in situationswhere θ is small

Sometimes one can estimate the ldquotailsrdquo of the unknowndistributions ie their asymptotic behavior at innityis is of importance in those applications where oneneeds to evaluate the probabilities of rare events If theequation Eemicro(Xminusθ) = has a solution micro gt then inthe above example one has

P(S gt x) sim ceminusmicrox x rarrinfin

where c is a known constant If the distribution F of Xis subexponential (in this case Eemicro(Xminusθ) = infin for anymicro gt ) then

P(S gt x) sim θ int

infin

x( minus F(t))dt x rarrinfin

L Linear Mixed Models

isLTenablesone tondapproximations forP(S gt x)for large xFor both approaches the obtained approximations

can be rened

An important part of Probabilityeory is concernedwith LTs for random processes eir main objective isto nd conditions under which random processes con-verge in some sense to some limit process An extensionof the CLT to that context is the so-called FunctionalCLT (aka the DonskerndashProkhorov invariance princi-ple) which states that as n rarr infin the processesζn(t) = (Slfloorntrfloor minus ant)σ

radicntisin[] converge in distribu-

tion to the standard Wiener process w(t)tisin[]e LTs(including large deviation theorems) for a broad class offunctionals of the sequence (7random walk) S Sncan also be classied as LTs for 7stochastic processese same can be said about Law of iterated logarithmwhich states that for an arbitrary ε gt the random walkSkinfink= crosses the boundary (minusε)σ

radick ln ln k innitely

many times but crosses the boundary ( + ε)σradick ln ln k

nitely many times with probability Similar results holdtrue for trajectories of Wiener processes w(t)tisin[] andw(t)tisin[infin)In mathematical statistics a closely related to func-

tional CLT result says that the so-called ldquoempirical processrdquoradicn(Flowastn (u) minus F(u))uisin(minusinfininfin) converges in distribution

to w(F(u))uisin(minusinfininfin) where w(t) = w(t) minus tw() isthe Brownian bridge processis LT implies7asymptoticnormality of a great many estimators that can be repre-sented as smooth functionals of the empirical distributionfunction Flowastn (u)

ere are many other areas in Probability eoryand its applications where various LTs appear and areextensively used For instance convergence theoremsfor 7martingales asymptotics of extinction probabilityof a branching processes and conditional (under non-extinction condition) LTs on a number of particles etc

About the AuthorProfessor Borovkov is Head of Probability and Statis-tics Department of the Sobolev Institute of MathematicsNovosibirsk (RussianAcademy of Sciences) since Heis Head of Probability and Statistics Chair at the Novosi-birsk University since He is Concelour of RussianAcademy of Sciences and fullmember of RussianAcademyof Sciences (Academician) () He was awarded theState Prize of the USSR () the Markov Prize of theRussian Academy of Sciences () and GovernmentPrize in Education () Professor Borovkov is Editor-in-Chief of the journal ldquoSiberianAdvances inMathematicsrdquo

and Associated Editor of journals ldquoeory of Probabil-ity and its Applicationsrdquo ldquoSiberian Mathematical JournalrdquoldquoMathematical Methods of Statisticsrdquo ldquoElectronic Journal ofProbabilityrdquo

Cross References7Almost Sure Convergence of Random Variables7Approximations to Distributions7Asymptotic Normality7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Central Limiteorems7Empirical Processes7Ergodiceorem7Glivenko-Cantellieorems7Large Deviations and Applications7Laws of Large Numbers7Martingale Central Limiteorem7Probabilityeory An Outline7RandomMatrixeory7Strong Approximations in Probability and Statistics7Weak Convergence of Probability Measures

References and Further ReadingBillingsley P () Convergence of probability measures Wiley

New YorkBorovkov AA () Mathematical statistics Gordon amp Breach

AmsterdamGnedenko BV Kolmogorov AN () Limit distributions for sums

of independent random variables Addison-Wesley CambridgeLeacutevy P () Theacuteorie de lrsquoAddition des Variables Aleacuteatories

Gauthier-Villars ParisLoegraveve M (ndash) Probability theory vols I and II th edn

Springer New YorkPetrov VV () Limit theorems of probability theory sequences of

independent random variables ClarendonOxford UniversityPress New York

Linear Mixed Models

GeertMolenberghsProfessorUniversiteit Hasselt amp Katholieke Universiteit LeuvenLeuven Belgium

In observational studies repeated measurements may betaken at almost arbitrary time points resulting in anextremely large number of time points at which only one

Linear Mixed Models L

L

or only a few measurements have been taken Many of theparametric covariance models described so far may thencontain toomany parameters tomake them useful in prac-tice while othermore parsimoniousmodelsmay be basedon assumptions which are too simplistic to be realisticA general and very exible class of parametric models forcontinuous longitudinal data is formulated as follows

yi∣bi sim N(Xiβ + ZibiΣi) ()bi sim N(D) ()

where Xi and Zi are (ni times p) and (ni times q) dimensionalmatrices of known covariates β is a p-dimensional vec-tor of regression parameters called the xed eects D isa general (q times q) covariance matrix and Σi is a (ni times ni)covariance matrix which depends on i only through itsdimension ni ie the set of unknown parameters in Σi willnot depend upon i Finally bi is a vector of subject-specicor random eects

e above model can be interpreted as a linear regres-sion model (see 7Linear Regression Models) for the vec-tor yi of repeated measurements for each unit separatelywhere some of the regression parameters are specic (ran-dom eects bi) while others are not (xed eects β)edistributional assumptions in () with respect to the ran-dom eects can be motivated as follows First E(bi) = implies that the mean of yi still equals Xiβ such that thexed eects in the random-eects model () can also beinterpreted marginally Not only do they reect the eectof changing covariates within specic units they alsomea-sure the marginal eect in the population of changing thesame covariates Second the normality assumption imme-diately implies that marginally yi also follows a normaldistribution with mean vector Xiβ and with covariancematrix Vi = ZiDZprimei + ΣiNote that the random eects in () implicitly imply the

marginal covariance matrix Vi of yi to be of the very spe-cic form Vi = ZiDZprimei + Σi Let us consider two examplesunder the assumption of conditional independence ieassuming Σi = σ Ini First consider the case where therandom eects are univariate and represent unit-specicintercepts is corresponds to covariates Zi which areni-dimensional vectors containing only ones

e marginal model implied by expressions () and() is

yi sim N(XiβVi) Vi = ZiDZprimei + Σi

which can be viewed as a multivariate linear regressionmodel with a very particular parameterization of thecovariance matrix Vi

With respect to the estimation of unit-specic param-eters bi the posterior distribution of bi given the observeddata yi can be shown to be (multivariate) normal withmean vector equal to DZprimeiV

minusi (α)(yi minus Xiβ) Replacing β

and α by their maximum likelihood estimates we obtainthe so-called empirical Bayes estimates bi for the bi A keyproperty of these EB estimates is shrinkage which is bestillustrated by considering the prediction yi equiv Xi β +Zibi ofthe ith prole It can easily be shown that

yi = ΣiVminusi Xi β + (Ini minus ΣiVminusi ) yi

which can be interpreted as a weighted average of thepopulation-averaged prole Xi β and the observed data yiwith weights ΣiVminusi and Ini minusΣiVminusi respectively Note thatthe ldquonumeratorrdquo of ΣiVminusi represents within-unit variabil-ity and the ldquodenominatorrdquo is the overall covariance matrixVi Hence much weight will be given to the overall averageprole if the within-unit variability is large in comparisonto the between-unit variability (modeled by the randomeects) whereasmuchweight will be given to the observeddata if the opposite is trueis phenomenon is referred toas shrinkage toward the average proleXi β An immediateconsequence of shrinkage is that the EB estimates show lessvariability than actually present in the random-eects dis-tribution ie for any linear combination λ of the randomeects

var(λprimebi)le var(λprimebi) = λprimeDλ

About the AuthorGeert Molenberghs is Professor of Biostatistics at Uni-versiteit Hasselt and Katholieke Universiteit Leuven inBelgium He was Joint Editor of Applied Statistics (ndash) and Co-Editor of Biometrics (ndash) Cur-rently he is Co-Editor of Biostatistics (ndash) He wasPresident of the International Biometric Society (ndash) received the Guy Medal in Bronze from the RoyalStatistical Society and the Myrto Lefkopoulou award fromthe Harvard School of Public Health Geert Molenberghsis founding director of the Center for Statistics He isalso the director of the Interuniversity Institute for Bio-statistics and statistical Bioinformatics (I-BioStat) Jointlywith Geert Verbeke Mike Kenward Tomasz BurzykowskiMarc Buyse and Marc Aerts he authored books on lon-gitudinal and incomplete data and on surrogate markerevaluation GeertMolenberghs received several Excellencein Continuing Education Awards of the American Statisti-cal Association for courses at Joint Statistical Meetings

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 17: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

Likelihood L

L

Further Readinge book by Cox and Hinkley () gives a detailedaccount of likelihood inference and principles of statis-tical inference see also Cox () ere are severalbook-length treatments of likelihood inference includingEdwards () Azzalini () Pawitan () and Sev-erini () this last discusses higher order asymptotictheory in detail as does Barndor-Nielsen andCox ()and Brazzale Davison and Reid () A short reviewpaper is Reid () An excellent overview of consis-tency results for maximum likelihood estimators is Scholz() see also Lehmann andCasella () Foundationalissues surrounding likelihood inference are discussed inBerger and Wolpert ()

About the AuthorProfessor Reid is a Past President of the Statistical Societyof Canada (ndash) During (ndash) she servedas the President of the Institute of Mathematical Statis-tics Among many awards she received the Emanuel andCarol Parzen Prize for Statistical Innovation () ldquoforleadership in statistical science for outstanding researchin theoretical statistics and highly accurate inference fromthe likelihood function and for inuential contributionsto statistical methods in biology environmental sciencehigh energy physics and complex social surveysrdquo She wasawarded the Gold Medal Statistical Society of Canada() and FlorenceNightingaleDavidAward Committeeof Presidents of Statistical Societies () She is AssociateEditor of Statistical Science (ndash) Bernoulli (ndash) andMetrika (ndash)

Cross References7Bayesian Analysis or Evidence Based Statistics7Bayesian Statistics7Bayesian Versus Frequentist Statistical Reasoning7Bayesian vs Classical Point Estimation A ComparativeOverview7Chi-Square Test Analysis of Contingency Tables7Empirical Likelihood Approach to Inference fromSample Survey Data7Estimation7Fiducial Inference7General Linear Models7Generalized Linear Models7Generalized Quasi-Likelihood (GQL) Inferences7Mixture Models7Philosophical Foundations of Statistics

7Risk Analysis7Statistical Evidence7Statistical Inference7Statistical Inference An Overview7Statistics An Overview7Statistics Nelderrsquos view7Testing Variance Components in Mixed LinearModels7Uniform Distribution in Statistics

References and Further ReadingAzzalini A () Statistical inference Chapman and Hall LondonBarndorff-Nielsen OE Cox DR () Inference and asymptotics

Chapman and Hall LondonBerger JO Wolpert R () The likelihood principle Institute of

Mathematical Statistics HaywardBirnbaum A () On the foundations of statistical inference Am

Stat Assoc ndashBrazzale AR Davison AC Reid N () Applied asymptotics

Cambridge University Press CambridgeCox DR () Regression models and life tables J R Stat Soc B

ndash (with discussion)Cox DR () Principles of statistical inference Cambridge Uni-

versity Press CambridgeCox DR Hinkley DV () Theoretical statistics Chapman and

Hall LondonDavison AC Hinkley DV Worton B () Bootstrap likelihoods

Biometrika ndashEdwards AF () Likelihood Oxford University Press OxfordFisher RA () On the mathematical foundations of theoretical

statistics Phil Trans R Soc A ndashLehmann EL Casella G () Theory of point estimation nd edn

Springer New YorkMcCullagh P () Quasi-likelihood functions Ann Stat ndashMurphy SA Van der Vaart A () Semiparametric likelihood ratio

inference Ann Stat ndashOwen A () Empirical likelihood ratio confidence intervals for a

single functional Biometrika ndashPawitan Y () In all likelihood Oxford University Press OxfordReid N () The roles of conditioning in inference Stat Sci

ndashReid N () Likelihood J Am Stat Assoc ndashRoyall RM () Statistical evidence a likelihood paradigm

Chapman and Hall LondonRoyall RM Tsou TS () Interpreting statistical evidence using

imperfect models robust adjusted likelihoodfunctions J R StatSoc B

Scholz F () Maximum likelihood estimation In Ency-clopedia of statistical sciences Wiley New York doiesspub Accessed Aug

Severini TA () Likelihood methods in statistics Oxford Univer-sity Press Oxford

Van der Vaart AW () Infinite-dimensional likelihood meth-ods in statistics httpwwwstieltjesorgarchiefbiennialframenodehtml Accessed Aug

Varin C () On composite marginal likelihood Adv Stat Analndash

L Limit Theorems of Probability Theory

Limit Theorems of ProbabilityTheory

Alexandr Alekseevich BorovkovProfessor Head of Probability and Statistics Chair at theNovosibirsk UniversityNovosibirsk University Novosibirsk Russia

Limit eorems of Probability eory is a broad namereferring to the most essential and extensive research areain Probability eory which at the same time has thegreatest impact on the numerous applications of the latterBy its very nature Probability eory is concerned

with asymptotic (limiting) laws that emerge in a long seriesof observations on random events Because of this in theearly twentieth century even the very denition of prob-ability of an event was given by a group of specialists(R von Mises and some others) as the limit of the rela-tive frequency of the occurrence of this event in a longrow of independent random experiments e ldquostabilityrdquoof this frequency (ie that such a limit always exists) waspostulated Aer the s Kolmogorovrsquos axiomatic con-struction of probability theory has prevailed One of themain assertions in this axiomatic theory is the Law of LargeNumbers (LLN) on the convergence of the averages of largenumbers of random variables to their expectation islaw implies the aforementioned stability of the relative fre-quencies and their convergence to the probability of thecorresponding event

e LLN is the simplest limit theorem (LT) of probabil-ity theory elucidating the physical meaning of probabilitye LLN is stated as follows if XXX is a sequenceof iid random variables

Sn =n

sumj=Xj

and the expectation a = EX exists then nminusSnasETHrarr a

(almost surely ie with probability )us the value nacan be called the rst order approximation for the sums Sne Central Limiteorem (CLT) gives one a more preciseapproximation for Sn It says that if σ = E(X minus a) ltinfinthen the distribution of the standardized sum ζn = (Sn minusna)σ

radicn converges as n rarr infin to the standard normal

(Gaussian) lawat is for all x

P(ζn lt x)rarr Φ(x) = radicπ

x

intminusinfin

eminustdt

e quantity nEξ + ζσradicn where ζ is a standard normal

random variable (so that P(ζ lt x) = Φ(x)) can be calledthe second order approximation for Sn

e rst LLN (for the Bernoulli scheme) was provedby Jacob Bernoulli in the late s (published posthu-mously in ) e rst CLT (also for the Bernoullischeme) was established by A de Moivre (rst publishedin and referred nowadays to as the deMoivrendashLaplacetheorem) In the beginning of the nineteenth centuryPS Laplace and CF Gauss contributed to the generaliza-tion of these assertions and appreciation of their enormousapplied importance (in particular for the theory of errorsof observations) while later in that century further break-throughs in both methodology and applicability rangeof the CLT were achieved by PL Chebyshev () andAM Lyapunov ()

e main directions in which the two aforementionedmain LTs have been extended and rened since then are

Relaxing the assumption EX lt infin When the sec-ond moment is innite one needs to assume that theldquotailrdquo P(x) = P(X gt x) + P(X lt minusx) is a func-tion regularly varying at innity such that the limitlimxrarrinfinP(X gt x)P(x) existsen the distribution of thenormalized sum Snσ(n) where σ(n) = Pminus(nminus)Pminus being the generalized inverse of the function P andwe assume that Eξ = when the expectation is niteconverges to one of the so-called stable laws as nrarrinfine7characteristic functions of these laws have simpleclosed-form representations

Relaxing the assumption that the Xjrsquos are identicallydistributed and proceeding to study the so-called tri-angular array scheme where the distributions of thesummands Xj = Xjn forming the sum Sn depend notonly on j but on n as well In this case the class of alllimit laws for the distribution of Sn (under suitable nor-malization) is substantially wider it coincides with theclass of the so-called innitely divisible distributionsAn important special case here is the Poisson limit the-orem on convergence in distribution of the number ofoccurrences of rare events to a Poisson law

Relaxing the assumption of independence of the XjrsquosSeveral types of ldquoweak dependencerdquo assumptions onXjunder which the LLN and CLT still hold true havebeen suggested and investigatedOne should alsomen-tion here the so-called ergodic theorems (see7Ergodiceorem) for a wide class of random sequences andprocesses

Renement of the main LTs and derivation of asymp-totic expansions For instance in the CLT bounds ofthe rate of convergence P(ζn lt x) minus Φ(x) rarr and

Limit Theorems of Probability Theory L

L

asymptotic expansions for this dierence (in the pow-ers of nminus in the case of iid summands) have beenobtained under broad assumptions

Studying large deviation probabilities for the sums Sn(theorems on rare events) If x rarr infin together with nthen the CLT can only assert that P(ζn gt x) rarr eorems on large deviation probabilities aim to nd afunction P(xn) such that

P(ζn gt x)P(xn) rarr as nrarrinfin x rarrinfin

e nature of the function P(xn) essentially dependson the rate of decay of P(X gt x) as x rarrinfin and on theldquodeviation zonerdquo ie on the asymptotic behavior of theratio xn as nrarrinfin

Considering observations X Xn of a more com-plex nature ndash rst of all multivariate random vectorsIf Xj isin Rd then the role of the limit law in the CLTwill be played by a d-dimensional normal (Gaussian)distribution with the covariance matrix E(X minus EX)(X minus EX)T

e variety of application areas of the LLN and CLTis enormous us Mathematical Statistics is based onthese LTs Let Xlowastn = (X Xn) be a sample froma distribution F and Flowastn (u) the corresponding empiricaldistribution functione fundamental GlivenkondashCantellitheorem (see 7Glivenko-Cantelli eorems) stating thatsupu ∣F

lowastn (u)minus F(u)∣

asETHrarr as nrarrinfin is of the same natureas the LLN and basically means that the unknown distri-bution F can be estimated arbitrary well from the randomsample Xlowastn of a large enough size n

e existence of consistent estimators for the unknownparameters a = Eξ and σ = E(X minus a) also follows fromthe LLN since as nrarrinfin

alowast = n

n

sumj=Xj

asETHrarr a (σ )lowast = n

n

sumj=

(Xj minus alowast)

= n

n

sumj=Xj minus (alowast) asETHrarr σ

Under additional moment assumptions on the distri-bution F one can also construct asymptotic condenceintervals for the parameters a and σ as the distributionsof the quantities

radicn(alowast minus a) and

radicn((σ )lowast minus σ ) con-

verge as n rarr infin to the normal onese same can alsobe said about other parameters that are ldquosmoothrdquo enoughfunctionals of the unknown distribution F

e theorem on the 7asymptotic normality andasymptotic eciency of maximum likelihood estimatorsis another classical example of LTsrsquo applications in mathe-matical statistics (see eg Borovkov ) Furthermore in

estimation theory and hypotheses testing one also needstheorems on large deviation probabilities for the respec-tive statistics as it is statistical procedures with small errorprobabilities that are oen required in applicationsIt is worth noting that summation of random variables

is by no means the only situation in which LTs appear inProbabilityeoryGenerally speaking the main objective of Probabil-

ityeory in applications is nding appropriate stochasticmodels for objects under study and then determining thedistributions or parameters one is interested in As a rulethe explicit form of these distributions and parameters isnot known LTs can be used to nd suitable approximationsto the characteristics in questionAt least two possible approaches to this problem should

be noted here

Suppose that the unknown distribution Fθ depends ona parameter θ such that as θ approaches some ldquocriticalrdquovalue θ the distributions Fθ become ldquodegeneraterdquo inone sense or anotheren in a number of cases onecan nd an approximation for Fθ which is valid for thevalues of θ that are close to θ For instance in actuarialstudies7queueing theory and some other applicationsone of the main problems is concerned with the dis-tribution of S = sup

kge(Sk minus θk) under the assumption

that EX = If θ gt then S is a proper random vari-able If however θ rarr then S asETHrarr infin Here we dealwith the so-called ldquotransient phenomenardquo It turns outthat if σ = Var (X) ltinfin then there exists the limit

limθdarrP(θS gt x) = eminusxσ x gt

is (KingmanndashProkhorov) LT enables one to ndapproximations for the distribution of S in situationswhere θ is small

Sometimes one can estimate the ldquotailsrdquo of the unknowndistributions ie their asymptotic behavior at innityis is of importance in those applications where oneneeds to evaluate the probabilities of rare events If theequation Eemicro(Xminusθ) = has a solution micro gt then inthe above example one has

P(S gt x) sim ceminusmicrox x rarrinfin

where c is a known constant If the distribution F of Xis subexponential (in this case Eemicro(Xminusθ) = infin for anymicro gt ) then

P(S gt x) sim θ int

infin

x( minus F(t))dt x rarrinfin

L Linear Mixed Models

isLTenablesone tondapproximations forP(S gt x)for large xFor both approaches the obtained approximations

can be rened

An important part of Probabilityeory is concernedwith LTs for random processes eir main objective isto nd conditions under which random processes con-verge in some sense to some limit process An extensionof the CLT to that context is the so-called FunctionalCLT (aka the DonskerndashProkhorov invariance princi-ple) which states that as n rarr infin the processesζn(t) = (Slfloorntrfloor minus ant)σ

radicntisin[] converge in distribu-

tion to the standard Wiener process w(t)tisin[]e LTs(including large deviation theorems) for a broad class offunctionals of the sequence (7random walk) S Sncan also be classied as LTs for 7stochastic processese same can be said about Law of iterated logarithmwhich states that for an arbitrary ε gt the random walkSkinfink= crosses the boundary (minusε)σ

radick ln ln k innitely

many times but crosses the boundary ( + ε)σradick ln ln k

nitely many times with probability Similar results holdtrue for trajectories of Wiener processes w(t)tisin[] andw(t)tisin[infin)In mathematical statistics a closely related to func-

tional CLT result says that the so-called ldquoempirical processrdquoradicn(Flowastn (u) minus F(u))uisin(minusinfininfin) converges in distribution

to w(F(u))uisin(minusinfininfin) where w(t) = w(t) minus tw() isthe Brownian bridge processis LT implies7asymptoticnormality of a great many estimators that can be repre-sented as smooth functionals of the empirical distributionfunction Flowastn (u)

ere are many other areas in Probability eoryand its applications where various LTs appear and areextensively used For instance convergence theoremsfor 7martingales asymptotics of extinction probabilityof a branching processes and conditional (under non-extinction condition) LTs on a number of particles etc

About the AuthorProfessor Borovkov is Head of Probability and Statis-tics Department of the Sobolev Institute of MathematicsNovosibirsk (RussianAcademy of Sciences) since Heis Head of Probability and Statistics Chair at the Novosi-birsk University since He is Concelour of RussianAcademy of Sciences and fullmember of RussianAcademyof Sciences (Academician) () He was awarded theState Prize of the USSR () the Markov Prize of theRussian Academy of Sciences () and GovernmentPrize in Education () Professor Borovkov is Editor-in-Chief of the journal ldquoSiberianAdvances inMathematicsrdquo

and Associated Editor of journals ldquoeory of Probabil-ity and its Applicationsrdquo ldquoSiberian Mathematical JournalrdquoldquoMathematical Methods of Statisticsrdquo ldquoElectronic Journal ofProbabilityrdquo

Cross References7Almost Sure Convergence of Random Variables7Approximations to Distributions7Asymptotic Normality7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Central Limiteorems7Empirical Processes7Ergodiceorem7Glivenko-Cantellieorems7Large Deviations and Applications7Laws of Large Numbers7Martingale Central Limiteorem7Probabilityeory An Outline7RandomMatrixeory7Strong Approximations in Probability and Statistics7Weak Convergence of Probability Measures

References and Further ReadingBillingsley P () Convergence of probability measures Wiley

New YorkBorovkov AA () Mathematical statistics Gordon amp Breach

AmsterdamGnedenko BV Kolmogorov AN () Limit distributions for sums

of independent random variables Addison-Wesley CambridgeLeacutevy P () Theacuteorie de lrsquoAddition des Variables Aleacuteatories

Gauthier-Villars ParisLoegraveve M (ndash) Probability theory vols I and II th edn

Springer New YorkPetrov VV () Limit theorems of probability theory sequences of

independent random variables ClarendonOxford UniversityPress New York

Linear Mixed Models

GeertMolenberghsProfessorUniversiteit Hasselt amp Katholieke Universiteit LeuvenLeuven Belgium

In observational studies repeated measurements may betaken at almost arbitrary time points resulting in anextremely large number of time points at which only one

Linear Mixed Models L

L

or only a few measurements have been taken Many of theparametric covariance models described so far may thencontain toomany parameters tomake them useful in prac-tice while othermore parsimoniousmodelsmay be basedon assumptions which are too simplistic to be realisticA general and very exible class of parametric models forcontinuous longitudinal data is formulated as follows

yi∣bi sim N(Xiβ + ZibiΣi) ()bi sim N(D) ()

where Xi and Zi are (ni times p) and (ni times q) dimensionalmatrices of known covariates β is a p-dimensional vec-tor of regression parameters called the xed eects D isa general (q times q) covariance matrix and Σi is a (ni times ni)covariance matrix which depends on i only through itsdimension ni ie the set of unknown parameters in Σi willnot depend upon i Finally bi is a vector of subject-specicor random eects

e above model can be interpreted as a linear regres-sion model (see 7Linear Regression Models) for the vec-tor yi of repeated measurements for each unit separatelywhere some of the regression parameters are specic (ran-dom eects bi) while others are not (xed eects β)edistributional assumptions in () with respect to the ran-dom eects can be motivated as follows First E(bi) = implies that the mean of yi still equals Xiβ such that thexed eects in the random-eects model () can also beinterpreted marginally Not only do they reect the eectof changing covariates within specic units they alsomea-sure the marginal eect in the population of changing thesame covariates Second the normality assumption imme-diately implies that marginally yi also follows a normaldistribution with mean vector Xiβ and with covariancematrix Vi = ZiDZprimei + ΣiNote that the random eects in () implicitly imply the

marginal covariance matrix Vi of yi to be of the very spe-cic form Vi = ZiDZprimei + Σi Let us consider two examplesunder the assumption of conditional independence ieassuming Σi = σ Ini First consider the case where therandom eects are univariate and represent unit-specicintercepts is corresponds to covariates Zi which areni-dimensional vectors containing only ones

e marginal model implied by expressions () and() is

yi sim N(XiβVi) Vi = ZiDZprimei + Σi

which can be viewed as a multivariate linear regressionmodel with a very particular parameterization of thecovariance matrix Vi

With respect to the estimation of unit-specic param-eters bi the posterior distribution of bi given the observeddata yi can be shown to be (multivariate) normal withmean vector equal to DZprimeiV

minusi (α)(yi minus Xiβ) Replacing β

and α by their maximum likelihood estimates we obtainthe so-called empirical Bayes estimates bi for the bi A keyproperty of these EB estimates is shrinkage which is bestillustrated by considering the prediction yi equiv Xi β +Zibi ofthe ith prole It can easily be shown that

yi = ΣiVminusi Xi β + (Ini minus ΣiVminusi ) yi

which can be interpreted as a weighted average of thepopulation-averaged prole Xi β and the observed data yiwith weights ΣiVminusi and Ini minusΣiVminusi respectively Note thatthe ldquonumeratorrdquo of ΣiVminusi represents within-unit variabil-ity and the ldquodenominatorrdquo is the overall covariance matrixVi Hence much weight will be given to the overall averageprole if the within-unit variability is large in comparisonto the between-unit variability (modeled by the randomeects) whereasmuchweight will be given to the observeddata if the opposite is trueis phenomenon is referred toas shrinkage toward the average proleXi β An immediateconsequence of shrinkage is that the EB estimates show lessvariability than actually present in the random-eects dis-tribution ie for any linear combination λ of the randomeects

var(λprimebi)le var(λprimebi) = λprimeDλ

About the AuthorGeert Molenberghs is Professor of Biostatistics at Uni-versiteit Hasselt and Katholieke Universiteit Leuven inBelgium He was Joint Editor of Applied Statistics (ndash) and Co-Editor of Biometrics (ndash) Cur-rently he is Co-Editor of Biostatistics (ndash) He wasPresident of the International Biometric Society (ndash) received the Guy Medal in Bronze from the RoyalStatistical Society and the Myrto Lefkopoulou award fromthe Harvard School of Public Health Geert Molenberghsis founding director of the Center for Statistics He isalso the director of the Interuniversity Institute for Bio-statistics and statistical Bioinformatics (I-BioStat) Jointlywith Geert Verbeke Mike Kenward Tomasz BurzykowskiMarc Buyse and Marc Aerts he authored books on lon-gitudinal and incomplete data and on surrogate markerevaluation GeertMolenberghs received several Excellencein Continuing Education Awards of the American Statisti-cal Association for courses at Joint Statistical Meetings

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 18: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

L Limit Theorems of Probability Theory

Limit Theorems of ProbabilityTheory

Alexandr Alekseevich BorovkovProfessor Head of Probability and Statistics Chair at theNovosibirsk UniversityNovosibirsk University Novosibirsk Russia

Limit eorems of Probability eory is a broad namereferring to the most essential and extensive research areain Probability eory which at the same time has thegreatest impact on the numerous applications of the latterBy its very nature Probability eory is concerned

with asymptotic (limiting) laws that emerge in a long seriesof observations on random events Because of this in theearly twentieth century even the very denition of prob-ability of an event was given by a group of specialists(R von Mises and some others) as the limit of the rela-tive frequency of the occurrence of this event in a longrow of independent random experiments e ldquostabilityrdquoof this frequency (ie that such a limit always exists) waspostulated Aer the s Kolmogorovrsquos axiomatic con-struction of probability theory has prevailed One of themain assertions in this axiomatic theory is the Law of LargeNumbers (LLN) on the convergence of the averages of largenumbers of random variables to their expectation islaw implies the aforementioned stability of the relative fre-quencies and their convergence to the probability of thecorresponding event

e LLN is the simplest limit theorem (LT) of probabil-ity theory elucidating the physical meaning of probabilitye LLN is stated as follows if XXX is a sequenceof iid random variables

Sn =n

sumj=Xj

and the expectation a = EX exists then nminusSnasETHrarr a

(almost surely ie with probability )us the value nacan be called the rst order approximation for the sums Sne Central Limiteorem (CLT) gives one a more preciseapproximation for Sn It says that if σ = E(X minus a) ltinfinthen the distribution of the standardized sum ζn = (Sn minusna)σ

radicn converges as n rarr infin to the standard normal

(Gaussian) lawat is for all x

P(ζn lt x)rarr Φ(x) = radicπ

x

intminusinfin

eminustdt

e quantity nEξ + ζσradicn where ζ is a standard normal

random variable (so that P(ζ lt x) = Φ(x)) can be calledthe second order approximation for Sn

e rst LLN (for the Bernoulli scheme) was provedby Jacob Bernoulli in the late s (published posthu-mously in ) e rst CLT (also for the Bernoullischeme) was established by A de Moivre (rst publishedin and referred nowadays to as the deMoivrendashLaplacetheorem) In the beginning of the nineteenth centuryPS Laplace and CF Gauss contributed to the generaliza-tion of these assertions and appreciation of their enormousapplied importance (in particular for the theory of errorsof observations) while later in that century further break-throughs in both methodology and applicability rangeof the CLT were achieved by PL Chebyshev () andAM Lyapunov ()

e main directions in which the two aforementionedmain LTs have been extended and rened since then are

Relaxing the assumption EX lt infin When the sec-ond moment is innite one needs to assume that theldquotailrdquo P(x) = P(X gt x) + P(X lt minusx) is a func-tion regularly varying at innity such that the limitlimxrarrinfinP(X gt x)P(x) existsen the distribution of thenormalized sum Snσ(n) where σ(n) = Pminus(nminus)Pminus being the generalized inverse of the function P andwe assume that Eξ = when the expectation is niteconverges to one of the so-called stable laws as nrarrinfine7characteristic functions of these laws have simpleclosed-form representations

Relaxing the assumption that the Xjrsquos are identicallydistributed and proceeding to study the so-called tri-angular array scheme where the distributions of thesummands Xj = Xjn forming the sum Sn depend notonly on j but on n as well In this case the class of alllimit laws for the distribution of Sn (under suitable nor-malization) is substantially wider it coincides with theclass of the so-called innitely divisible distributionsAn important special case here is the Poisson limit the-orem on convergence in distribution of the number ofoccurrences of rare events to a Poisson law

Relaxing the assumption of independence of the XjrsquosSeveral types of ldquoweak dependencerdquo assumptions onXjunder which the LLN and CLT still hold true havebeen suggested and investigatedOne should alsomen-tion here the so-called ergodic theorems (see7Ergodiceorem) for a wide class of random sequences andprocesses

Renement of the main LTs and derivation of asymp-totic expansions For instance in the CLT bounds ofthe rate of convergence P(ζn lt x) minus Φ(x) rarr and

Limit Theorems of Probability Theory L

L

asymptotic expansions for this dierence (in the pow-ers of nminus in the case of iid summands) have beenobtained under broad assumptions

Studying large deviation probabilities for the sums Sn(theorems on rare events) If x rarr infin together with nthen the CLT can only assert that P(ζn gt x) rarr eorems on large deviation probabilities aim to nd afunction P(xn) such that

P(ζn gt x)P(xn) rarr as nrarrinfin x rarrinfin

e nature of the function P(xn) essentially dependson the rate of decay of P(X gt x) as x rarrinfin and on theldquodeviation zonerdquo ie on the asymptotic behavior of theratio xn as nrarrinfin

Considering observations X Xn of a more com-plex nature ndash rst of all multivariate random vectorsIf Xj isin Rd then the role of the limit law in the CLTwill be played by a d-dimensional normal (Gaussian)distribution with the covariance matrix E(X minus EX)(X minus EX)T

e variety of application areas of the LLN and CLTis enormous us Mathematical Statistics is based onthese LTs Let Xlowastn = (X Xn) be a sample froma distribution F and Flowastn (u) the corresponding empiricaldistribution functione fundamental GlivenkondashCantellitheorem (see 7Glivenko-Cantelli eorems) stating thatsupu ∣F

lowastn (u)minus F(u)∣

asETHrarr as nrarrinfin is of the same natureas the LLN and basically means that the unknown distri-bution F can be estimated arbitrary well from the randomsample Xlowastn of a large enough size n

e existence of consistent estimators for the unknownparameters a = Eξ and σ = E(X minus a) also follows fromthe LLN since as nrarrinfin

alowast = n

n

sumj=Xj

asETHrarr a (σ )lowast = n

n

sumj=

(Xj minus alowast)

= n

n

sumj=Xj minus (alowast) asETHrarr σ

Under additional moment assumptions on the distri-bution F one can also construct asymptotic condenceintervals for the parameters a and σ as the distributionsof the quantities

radicn(alowast minus a) and

radicn((σ )lowast minus σ ) con-

verge as n rarr infin to the normal onese same can alsobe said about other parameters that are ldquosmoothrdquo enoughfunctionals of the unknown distribution F

e theorem on the 7asymptotic normality andasymptotic eciency of maximum likelihood estimatorsis another classical example of LTsrsquo applications in mathe-matical statistics (see eg Borovkov ) Furthermore in

estimation theory and hypotheses testing one also needstheorems on large deviation probabilities for the respec-tive statistics as it is statistical procedures with small errorprobabilities that are oen required in applicationsIt is worth noting that summation of random variables

is by no means the only situation in which LTs appear inProbabilityeoryGenerally speaking the main objective of Probabil-

ityeory in applications is nding appropriate stochasticmodels for objects under study and then determining thedistributions or parameters one is interested in As a rulethe explicit form of these distributions and parameters isnot known LTs can be used to nd suitable approximationsto the characteristics in questionAt least two possible approaches to this problem should

be noted here

Suppose that the unknown distribution Fθ depends ona parameter θ such that as θ approaches some ldquocriticalrdquovalue θ the distributions Fθ become ldquodegeneraterdquo inone sense or anotheren in a number of cases onecan nd an approximation for Fθ which is valid for thevalues of θ that are close to θ For instance in actuarialstudies7queueing theory and some other applicationsone of the main problems is concerned with the dis-tribution of S = sup

kge(Sk minus θk) under the assumption

that EX = If θ gt then S is a proper random vari-able If however θ rarr then S asETHrarr infin Here we dealwith the so-called ldquotransient phenomenardquo It turns outthat if σ = Var (X) ltinfin then there exists the limit

limθdarrP(θS gt x) = eminusxσ x gt

is (KingmanndashProkhorov) LT enables one to ndapproximations for the distribution of S in situationswhere θ is small

Sometimes one can estimate the ldquotailsrdquo of the unknowndistributions ie their asymptotic behavior at innityis is of importance in those applications where oneneeds to evaluate the probabilities of rare events If theequation Eemicro(Xminusθ) = has a solution micro gt then inthe above example one has

P(S gt x) sim ceminusmicrox x rarrinfin

where c is a known constant If the distribution F of Xis subexponential (in this case Eemicro(Xminusθ) = infin for anymicro gt ) then

P(S gt x) sim θ int

infin

x( minus F(t))dt x rarrinfin

L Linear Mixed Models

isLTenablesone tondapproximations forP(S gt x)for large xFor both approaches the obtained approximations

can be rened

An important part of Probabilityeory is concernedwith LTs for random processes eir main objective isto nd conditions under which random processes con-verge in some sense to some limit process An extensionof the CLT to that context is the so-called FunctionalCLT (aka the DonskerndashProkhorov invariance princi-ple) which states that as n rarr infin the processesζn(t) = (Slfloorntrfloor minus ant)σ

radicntisin[] converge in distribu-

tion to the standard Wiener process w(t)tisin[]e LTs(including large deviation theorems) for a broad class offunctionals of the sequence (7random walk) S Sncan also be classied as LTs for 7stochastic processese same can be said about Law of iterated logarithmwhich states that for an arbitrary ε gt the random walkSkinfink= crosses the boundary (minusε)σ

radick ln ln k innitely

many times but crosses the boundary ( + ε)σradick ln ln k

nitely many times with probability Similar results holdtrue for trajectories of Wiener processes w(t)tisin[] andw(t)tisin[infin)In mathematical statistics a closely related to func-

tional CLT result says that the so-called ldquoempirical processrdquoradicn(Flowastn (u) minus F(u))uisin(minusinfininfin) converges in distribution

to w(F(u))uisin(minusinfininfin) where w(t) = w(t) minus tw() isthe Brownian bridge processis LT implies7asymptoticnormality of a great many estimators that can be repre-sented as smooth functionals of the empirical distributionfunction Flowastn (u)

ere are many other areas in Probability eoryand its applications where various LTs appear and areextensively used For instance convergence theoremsfor 7martingales asymptotics of extinction probabilityof a branching processes and conditional (under non-extinction condition) LTs on a number of particles etc

About the AuthorProfessor Borovkov is Head of Probability and Statis-tics Department of the Sobolev Institute of MathematicsNovosibirsk (RussianAcademy of Sciences) since Heis Head of Probability and Statistics Chair at the Novosi-birsk University since He is Concelour of RussianAcademy of Sciences and fullmember of RussianAcademyof Sciences (Academician) () He was awarded theState Prize of the USSR () the Markov Prize of theRussian Academy of Sciences () and GovernmentPrize in Education () Professor Borovkov is Editor-in-Chief of the journal ldquoSiberianAdvances inMathematicsrdquo

and Associated Editor of journals ldquoeory of Probabil-ity and its Applicationsrdquo ldquoSiberian Mathematical JournalrdquoldquoMathematical Methods of Statisticsrdquo ldquoElectronic Journal ofProbabilityrdquo

Cross References7Almost Sure Convergence of Random Variables7Approximations to Distributions7Asymptotic Normality7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Central Limiteorems7Empirical Processes7Ergodiceorem7Glivenko-Cantellieorems7Large Deviations and Applications7Laws of Large Numbers7Martingale Central Limiteorem7Probabilityeory An Outline7RandomMatrixeory7Strong Approximations in Probability and Statistics7Weak Convergence of Probability Measures

References and Further ReadingBillingsley P () Convergence of probability measures Wiley

New YorkBorovkov AA () Mathematical statistics Gordon amp Breach

AmsterdamGnedenko BV Kolmogorov AN () Limit distributions for sums

of independent random variables Addison-Wesley CambridgeLeacutevy P () Theacuteorie de lrsquoAddition des Variables Aleacuteatories

Gauthier-Villars ParisLoegraveve M (ndash) Probability theory vols I and II th edn

Springer New YorkPetrov VV () Limit theorems of probability theory sequences of

independent random variables ClarendonOxford UniversityPress New York

Linear Mixed Models

GeertMolenberghsProfessorUniversiteit Hasselt amp Katholieke Universiteit LeuvenLeuven Belgium

In observational studies repeated measurements may betaken at almost arbitrary time points resulting in anextremely large number of time points at which only one

Linear Mixed Models L

L

or only a few measurements have been taken Many of theparametric covariance models described so far may thencontain toomany parameters tomake them useful in prac-tice while othermore parsimoniousmodelsmay be basedon assumptions which are too simplistic to be realisticA general and very exible class of parametric models forcontinuous longitudinal data is formulated as follows

yi∣bi sim N(Xiβ + ZibiΣi) ()bi sim N(D) ()

where Xi and Zi are (ni times p) and (ni times q) dimensionalmatrices of known covariates β is a p-dimensional vec-tor of regression parameters called the xed eects D isa general (q times q) covariance matrix and Σi is a (ni times ni)covariance matrix which depends on i only through itsdimension ni ie the set of unknown parameters in Σi willnot depend upon i Finally bi is a vector of subject-specicor random eects

e above model can be interpreted as a linear regres-sion model (see 7Linear Regression Models) for the vec-tor yi of repeated measurements for each unit separatelywhere some of the regression parameters are specic (ran-dom eects bi) while others are not (xed eects β)edistributional assumptions in () with respect to the ran-dom eects can be motivated as follows First E(bi) = implies that the mean of yi still equals Xiβ such that thexed eects in the random-eects model () can also beinterpreted marginally Not only do they reect the eectof changing covariates within specic units they alsomea-sure the marginal eect in the population of changing thesame covariates Second the normality assumption imme-diately implies that marginally yi also follows a normaldistribution with mean vector Xiβ and with covariancematrix Vi = ZiDZprimei + ΣiNote that the random eects in () implicitly imply the

marginal covariance matrix Vi of yi to be of the very spe-cic form Vi = ZiDZprimei + Σi Let us consider two examplesunder the assumption of conditional independence ieassuming Σi = σ Ini First consider the case where therandom eects are univariate and represent unit-specicintercepts is corresponds to covariates Zi which areni-dimensional vectors containing only ones

e marginal model implied by expressions () and() is

yi sim N(XiβVi) Vi = ZiDZprimei + Σi

which can be viewed as a multivariate linear regressionmodel with a very particular parameterization of thecovariance matrix Vi

With respect to the estimation of unit-specic param-eters bi the posterior distribution of bi given the observeddata yi can be shown to be (multivariate) normal withmean vector equal to DZprimeiV

minusi (α)(yi minus Xiβ) Replacing β

and α by their maximum likelihood estimates we obtainthe so-called empirical Bayes estimates bi for the bi A keyproperty of these EB estimates is shrinkage which is bestillustrated by considering the prediction yi equiv Xi β +Zibi ofthe ith prole It can easily be shown that

yi = ΣiVminusi Xi β + (Ini minus ΣiVminusi ) yi

which can be interpreted as a weighted average of thepopulation-averaged prole Xi β and the observed data yiwith weights ΣiVminusi and Ini minusΣiVminusi respectively Note thatthe ldquonumeratorrdquo of ΣiVminusi represents within-unit variabil-ity and the ldquodenominatorrdquo is the overall covariance matrixVi Hence much weight will be given to the overall averageprole if the within-unit variability is large in comparisonto the between-unit variability (modeled by the randomeects) whereasmuchweight will be given to the observeddata if the opposite is trueis phenomenon is referred toas shrinkage toward the average proleXi β An immediateconsequence of shrinkage is that the EB estimates show lessvariability than actually present in the random-eects dis-tribution ie for any linear combination λ of the randomeects

var(λprimebi)le var(λprimebi) = λprimeDλ

About the AuthorGeert Molenberghs is Professor of Biostatistics at Uni-versiteit Hasselt and Katholieke Universiteit Leuven inBelgium He was Joint Editor of Applied Statistics (ndash) and Co-Editor of Biometrics (ndash) Cur-rently he is Co-Editor of Biostatistics (ndash) He wasPresident of the International Biometric Society (ndash) received the Guy Medal in Bronze from the RoyalStatistical Society and the Myrto Lefkopoulou award fromthe Harvard School of Public Health Geert Molenberghsis founding director of the Center for Statistics He isalso the director of the Interuniversity Institute for Bio-statistics and statistical Bioinformatics (I-BioStat) Jointlywith Geert Verbeke Mike Kenward Tomasz BurzykowskiMarc Buyse and Marc Aerts he authored books on lon-gitudinal and incomplete data and on surrogate markerevaluation GeertMolenberghs received several Excellencein Continuing Education Awards of the American Statisti-cal Association for courses at Joint Statistical Meetings

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 19: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

Limit Theorems of Probability Theory L

L

asymptotic expansions for this dierence (in the pow-ers of nminus in the case of iid summands) have beenobtained under broad assumptions

Studying large deviation probabilities for the sums Sn(theorems on rare events) If x rarr infin together with nthen the CLT can only assert that P(ζn gt x) rarr eorems on large deviation probabilities aim to nd afunction P(xn) such that

P(ζn gt x)P(xn) rarr as nrarrinfin x rarrinfin

e nature of the function P(xn) essentially dependson the rate of decay of P(X gt x) as x rarrinfin and on theldquodeviation zonerdquo ie on the asymptotic behavior of theratio xn as nrarrinfin

Considering observations X Xn of a more com-plex nature ndash rst of all multivariate random vectorsIf Xj isin Rd then the role of the limit law in the CLTwill be played by a d-dimensional normal (Gaussian)distribution with the covariance matrix E(X minus EX)(X minus EX)T

e variety of application areas of the LLN and CLTis enormous us Mathematical Statistics is based onthese LTs Let Xlowastn = (X Xn) be a sample froma distribution F and Flowastn (u) the corresponding empiricaldistribution functione fundamental GlivenkondashCantellitheorem (see 7Glivenko-Cantelli eorems) stating thatsupu ∣F

lowastn (u)minus F(u)∣

asETHrarr as nrarrinfin is of the same natureas the LLN and basically means that the unknown distri-bution F can be estimated arbitrary well from the randomsample Xlowastn of a large enough size n

e existence of consistent estimators for the unknownparameters a = Eξ and σ = E(X minus a) also follows fromthe LLN since as nrarrinfin

alowast = n

n

sumj=Xj

asETHrarr a (σ )lowast = n

n

sumj=

(Xj minus alowast)

= n

n

sumj=Xj minus (alowast) asETHrarr σ

Under additional moment assumptions on the distri-bution F one can also construct asymptotic condenceintervals for the parameters a and σ as the distributionsof the quantities

radicn(alowast minus a) and

radicn((σ )lowast minus σ ) con-

verge as n rarr infin to the normal onese same can alsobe said about other parameters that are ldquosmoothrdquo enoughfunctionals of the unknown distribution F

e theorem on the 7asymptotic normality andasymptotic eciency of maximum likelihood estimatorsis another classical example of LTsrsquo applications in mathe-matical statistics (see eg Borovkov ) Furthermore in

estimation theory and hypotheses testing one also needstheorems on large deviation probabilities for the respec-tive statistics as it is statistical procedures with small errorprobabilities that are oen required in applicationsIt is worth noting that summation of random variables

is by no means the only situation in which LTs appear inProbabilityeoryGenerally speaking the main objective of Probabil-

ityeory in applications is nding appropriate stochasticmodels for objects under study and then determining thedistributions or parameters one is interested in As a rulethe explicit form of these distributions and parameters isnot known LTs can be used to nd suitable approximationsto the characteristics in questionAt least two possible approaches to this problem should

be noted here

Suppose that the unknown distribution Fθ depends ona parameter θ such that as θ approaches some ldquocriticalrdquovalue θ the distributions Fθ become ldquodegeneraterdquo inone sense or anotheren in a number of cases onecan nd an approximation for Fθ which is valid for thevalues of θ that are close to θ For instance in actuarialstudies7queueing theory and some other applicationsone of the main problems is concerned with the dis-tribution of S = sup

kge(Sk minus θk) under the assumption

that EX = If θ gt then S is a proper random vari-able If however θ rarr then S asETHrarr infin Here we dealwith the so-called ldquotransient phenomenardquo It turns outthat if σ = Var (X) ltinfin then there exists the limit

limθdarrP(θS gt x) = eminusxσ x gt

is (KingmanndashProkhorov) LT enables one to ndapproximations for the distribution of S in situationswhere θ is small

Sometimes one can estimate the ldquotailsrdquo of the unknowndistributions ie their asymptotic behavior at innityis is of importance in those applications where oneneeds to evaluate the probabilities of rare events If theequation Eemicro(Xminusθ) = has a solution micro gt then inthe above example one has

P(S gt x) sim ceminusmicrox x rarrinfin

where c is a known constant If the distribution F of Xis subexponential (in this case Eemicro(Xminusθ) = infin for anymicro gt ) then

P(S gt x) sim θ int

infin

x( minus F(t))dt x rarrinfin

L Linear Mixed Models

isLTenablesone tondapproximations forP(S gt x)for large xFor both approaches the obtained approximations

can be rened

An important part of Probabilityeory is concernedwith LTs for random processes eir main objective isto nd conditions under which random processes con-verge in some sense to some limit process An extensionof the CLT to that context is the so-called FunctionalCLT (aka the DonskerndashProkhorov invariance princi-ple) which states that as n rarr infin the processesζn(t) = (Slfloorntrfloor minus ant)σ

radicntisin[] converge in distribu-

tion to the standard Wiener process w(t)tisin[]e LTs(including large deviation theorems) for a broad class offunctionals of the sequence (7random walk) S Sncan also be classied as LTs for 7stochastic processese same can be said about Law of iterated logarithmwhich states that for an arbitrary ε gt the random walkSkinfink= crosses the boundary (minusε)σ

radick ln ln k innitely

many times but crosses the boundary ( + ε)σradick ln ln k

nitely many times with probability Similar results holdtrue for trajectories of Wiener processes w(t)tisin[] andw(t)tisin[infin)In mathematical statistics a closely related to func-

tional CLT result says that the so-called ldquoempirical processrdquoradicn(Flowastn (u) minus F(u))uisin(minusinfininfin) converges in distribution

to w(F(u))uisin(minusinfininfin) where w(t) = w(t) minus tw() isthe Brownian bridge processis LT implies7asymptoticnormality of a great many estimators that can be repre-sented as smooth functionals of the empirical distributionfunction Flowastn (u)

ere are many other areas in Probability eoryand its applications where various LTs appear and areextensively used For instance convergence theoremsfor 7martingales asymptotics of extinction probabilityof a branching processes and conditional (under non-extinction condition) LTs on a number of particles etc

About the AuthorProfessor Borovkov is Head of Probability and Statis-tics Department of the Sobolev Institute of MathematicsNovosibirsk (RussianAcademy of Sciences) since Heis Head of Probability and Statistics Chair at the Novosi-birsk University since He is Concelour of RussianAcademy of Sciences and fullmember of RussianAcademyof Sciences (Academician) () He was awarded theState Prize of the USSR () the Markov Prize of theRussian Academy of Sciences () and GovernmentPrize in Education () Professor Borovkov is Editor-in-Chief of the journal ldquoSiberianAdvances inMathematicsrdquo

and Associated Editor of journals ldquoeory of Probabil-ity and its Applicationsrdquo ldquoSiberian Mathematical JournalrdquoldquoMathematical Methods of Statisticsrdquo ldquoElectronic Journal ofProbabilityrdquo

Cross References7Almost Sure Convergence of Random Variables7Approximations to Distributions7Asymptotic Normality7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Central Limiteorems7Empirical Processes7Ergodiceorem7Glivenko-Cantellieorems7Large Deviations and Applications7Laws of Large Numbers7Martingale Central Limiteorem7Probabilityeory An Outline7RandomMatrixeory7Strong Approximations in Probability and Statistics7Weak Convergence of Probability Measures

References and Further ReadingBillingsley P () Convergence of probability measures Wiley

New YorkBorovkov AA () Mathematical statistics Gordon amp Breach

AmsterdamGnedenko BV Kolmogorov AN () Limit distributions for sums

of independent random variables Addison-Wesley CambridgeLeacutevy P () Theacuteorie de lrsquoAddition des Variables Aleacuteatories

Gauthier-Villars ParisLoegraveve M (ndash) Probability theory vols I and II th edn

Springer New YorkPetrov VV () Limit theorems of probability theory sequences of

independent random variables ClarendonOxford UniversityPress New York

Linear Mixed Models

GeertMolenberghsProfessorUniversiteit Hasselt amp Katholieke Universiteit LeuvenLeuven Belgium

In observational studies repeated measurements may betaken at almost arbitrary time points resulting in anextremely large number of time points at which only one

Linear Mixed Models L

L

or only a few measurements have been taken Many of theparametric covariance models described so far may thencontain toomany parameters tomake them useful in prac-tice while othermore parsimoniousmodelsmay be basedon assumptions which are too simplistic to be realisticA general and very exible class of parametric models forcontinuous longitudinal data is formulated as follows

yi∣bi sim N(Xiβ + ZibiΣi) ()bi sim N(D) ()

where Xi and Zi are (ni times p) and (ni times q) dimensionalmatrices of known covariates β is a p-dimensional vec-tor of regression parameters called the xed eects D isa general (q times q) covariance matrix and Σi is a (ni times ni)covariance matrix which depends on i only through itsdimension ni ie the set of unknown parameters in Σi willnot depend upon i Finally bi is a vector of subject-specicor random eects

e above model can be interpreted as a linear regres-sion model (see 7Linear Regression Models) for the vec-tor yi of repeated measurements for each unit separatelywhere some of the regression parameters are specic (ran-dom eects bi) while others are not (xed eects β)edistributional assumptions in () with respect to the ran-dom eects can be motivated as follows First E(bi) = implies that the mean of yi still equals Xiβ such that thexed eects in the random-eects model () can also beinterpreted marginally Not only do they reect the eectof changing covariates within specic units they alsomea-sure the marginal eect in the population of changing thesame covariates Second the normality assumption imme-diately implies that marginally yi also follows a normaldistribution with mean vector Xiβ and with covariancematrix Vi = ZiDZprimei + ΣiNote that the random eects in () implicitly imply the

marginal covariance matrix Vi of yi to be of the very spe-cic form Vi = ZiDZprimei + Σi Let us consider two examplesunder the assumption of conditional independence ieassuming Σi = σ Ini First consider the case where therandom eects are univariate and represent unit-specicintercepts is corresponds to covariates Zi which areni-dimensional vectors containing only ones

e marginal model implied by expressions () and() is

yi sim N(XiβVi) Vi = ZiDZprimei + Σi

which can be viewed as a multivariate linear regressionmodel with a very particular parameterization of thecovariance matrix Vi

With respect to the estimation of unit-specic param-eters bi the posterior distribution of bi given the observeddata yi can be shown to be (multivariate) normal withmean vector equal to DZprimeiV

minusi (α)(yi minus Xiβ) Replacing β

and α by their maximum likelihood estimates we obtainthe so-called empirical Bayes estimates bi for the bi A keyproperty of these EB estimates is shrinkage which is bestillustrated by considering the prediction yi equiv Xi β +Zibi ofthe ith prole It can easily be shown that

yi = ΣiVminusi Xi β + (Ini minus ΣiVminusi ) yi

which can be interpreted as a weighted average of thepopulation-averaged prole Xi β and the observed data yiwith weights ΣiVminusi and Ini minusΣiVminusi respectively Note thatthe ldquonumeratorrdquo of ΣiVminusi represents within-unit variabil-ity and the ldquodenominatorrdquo is the overall covariance matrixVi Hence much weight will be given to the overall averageprole if the within-unit variability is large in comparisonto the between-unit variability (modeled by the randomeects) whereasmuchweight will be given to the observeddata if the opposite is trueis phenomenon is referred toas shrinkage toward the average proleXi β An immediateconsequence of shrinkage is that the EB estimates show lessvariability than actually present in the random-eects dis-tribution ie for any linear combination λ of the randomeects

var(λprimebi)le var(λprimebi) = λprimeDλ

About the AuthorGeert Molenberghs is Professor of Biostatistics at Uni-versiteit Hasselt and Katholieke Universiteit Leuven inBelgium He was Joint Editor of Applied Statistics (ndash) and Co-Editor of Biometrics (ndash) Cur-rently he is Co-Editor of Biostatistics (ndash) He wasPresident of the International Biometric Society (ndash) received the Guy Medal in Bronze from the RoyalStatistical Society and the Myrto Lefkopoulou award fromthe Harvard School of Public Health Geert Molenberghsis founding director of the Center for Statistics He isalso the director of the Interuniversity Institute for Bio-statistics and statistical Bioinformatics (I-BioStat) Jointlywith Geert Verbeke Mike Kenward Tomasz BurzykowskiMarc Buyse and Marc Aerts he authored books on lon-gitudinal and incomplete data and on surrogate markerevaluation GeertMolenberghs received several Excellencein Continuing Education Awards of the American Statisti-cal Association for courses at Joint Statistical Meetings

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 20: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

L Linear Mixed Models

isLTenablesone tondapproximations forP(S gt x)for large xFor both approaches the obtained approximations

can be rened

An important part of Probabilityeory is concernedwith LTs for random processes eir main objective isto nd conditions under which random processes con-verge in some sense to some limit process An extensionof the CLT to that context is the so-called FunctionalCLT (aka the DonskerndashProkhorov invariance princi-ple) which states that as n rarr infin the processesζn(t) = (Slfloorntrfloor minus ant)σ

radicntisin[] converge in distribu-

tion to the standard Wiener process w(t)tisin[]e LTs(including large deviation theorems) for a broad class offunctionals of the sequence (7random walk) S Sncan also be classied as LTs for 7stochastic processese same can be said about Law of iterated logarithmwhich states that for an arbitrary ε gt the random walkSkinfink= crosses the boundary (minusε)σ

radick ln ln k innitely

many times but crosses the boundary ( + ε)σradick ln ln k

nitely many times with probability Similar results holdtrue for trajectories of Wiener processes w(t)tisin[] andw(t)tisin[infin)In mathematical statistics a closely related to func-

tional CLT result says that the so-called ldquoempirical processrdquoradicn(Flowastn (u) minus F(u))uisin(minusinfininfin) converges in distribution

to w(F(u))uisin(minusinfininfin) where w(t) = w(t) minus tw() isthe Brownian bridge processis LT implies7asymptoticnormality of a great many estimators that can be repre-sented as smooth functionals of the empirical distributionfunction Flowastn (u)

ere are many other areas in Probability eoryand its applications where various LTs appear and areextensively used For instance convergence theoremsfor 7martingales asymptotics of extinction probabilityof a branching processes and conditional (under non-extinction condition) LTs on a number of particles etc

About the AuthorProfessor Borovkov is Head of Probability and Statis-tics Department of the Sobolev Institute of MathematicsNovosibirsk (RussianAcademy of Sciences) since Heis Head of Probability and Statistics Chair at the Novosi-birsk University since He is Concelour of RussianAcademy of Sciences and fullmember of RussianAcademyof Sciences (Academician) () He was awarded theState Prize of the USSR () the Markov Prize of theRussian Academy of Sciences () and GovernmentPrize in Education () Professor Borovkov is Editor-in-Chief of the journal ldquoSiberianAdvances inMathematicsrdquo

and Associated Editor of journals ldquoeory of Probabil-ity and its Applicationsrdquo ldquoSiberian Mathematical JournalrdquoldquoMathematical Methods of Statisticsrdquo ldquoElectronic Journal ofProbabilityrdquo

Cross References7Almost Sure Convergence of Random Variables7Approximations to Distributions7Asymptotic Normality7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Central Limiteorems7Empirical Processes7Ergodiceorem7Glivenko-Cantellieorems7Large Deviations and Applications7Laws of Large Numbers7Martingale Central Limiteorem7Probabilityeory An Outline7RandomMatrixeory7Strong Approximations in Probability and Statistics7Weak Convergence of Probability Measures

References and Further ReadingBillingsley P () Convergence of probability measures Wiley

New YorkBorovkov AA () Mathematical statistics Gordon amp Breach

AmsterdamGnedenko BV Kolmogorov AN () Limit distributions for sums

of independent random variables Addison-Wesley CambridgeLeacutevy P () Theacuteorie de lrsquoAddition des Variables Aleacuteatories

Gauthier-Villars ParisLoegraveve M (ndash) Probability theory vols I and II th edn

Springer New YorkPetrov VV () Limit theorems of probability theory sequences of

independent random variables ClarendonOxford UniversityPress New York

Linear Mixed Models

GeertMolenberghsProfessorUniversiteit Hasselt amp Katholieke Universiteit LeuvenLeuven Belgium

In observational studies repeated measurements may betaken at almost arbitrary time points resulting in anextremely large number of time points at which only one

Linear Mixed Models L

L

or only a few measurements have been taken Many of theparametric covariance models described so far may thencontain toomany parameters tomake them useful in prac-tice while othermore parsimoniousmodelsmay be basedon assumptions which are too simplistic to be realisticA general and very exible class of parametric models forcontinuous longitudinal data is formulated as follows

yi∣bi sim N(Xiβ + ZibiΣi) ()bi sim N(D) ()

where Xi and Zi are (ni times p) and (ni times q) dimensionalmatrices of known covariates β is a p-dimensional vec-tor of regression parameters called the xed eects D isa general (q times q) covariance matrix and Σi is a (ni times ni)covariance matrix which depends on i only through itsdimension ni ie the set of unknown parameters in Σi willnot depend upon i Finally bi is a vector of subject-specicor random eects

e above model can be interpreted as a linear regres-sion model (see 7Linear Regression Models) for the vec-tor yi of repeated measurements for each unit separatelywhere some of the regression parameters are specic (ran-dom eects bi) while others are not (xed eects β)edistributional assumptions in () with respect to the ran-dom eects can be motivated as follows First E(bi) = implies that the mean of yi still equals Xiβ such that thexed eects in the random-eects model () can also beinterpreted marginally Not only do they reect the eectof changing covariates within specic units they alsomea-sure the marginal eect in the population of changing thesame covariates Second the normality assumption imme-diately implies that marginally yi also follows a normaldistribution with mean vector Xiβ and with covariancematrix Vi = ZiDZprimei + ΣiNote that the random eects in () implicitly imply the

marginal covariance matrix Vi of yi to be of the very spe-cic form Vi = ZiDZprimei + Σi Let us consider two examplesunder the assumption of conditional independence ieassuming Σi = σ Ini First consider the case where therandom eects are univariate and represent unit-specicintercepts is corresponds to covariates Zi which areni-dimensional vectors containing only ones

e marginal model implied by expressions () and() is

yi sim N(XiβVi) Vi = ZiDZprimei + Σi

which can be viewed as a multivariate linear regressionmodel with a very particular parameterization of thecovariance matrix Vi

With respect to the estimation of unit-specic param-eters bi the posterior distribution of bi given the observeddata yi can be shown to be (multivariate) normal withmean vector equal to DZprimeiV

minusi (α)(yi minus Xiβ) Replacing β

and α by their maximum likelihood estimates we obtainthe so-called empirical Bayes estimates bi for the bi A keyproperty of these EB estimates is shrinkage which is bestillustrated by considering the prediction yi equiv Xi β +Zibi ofthe ith prole It can easily be shown that

yi = ΣiVminusi Xi β + (Ini minus ΣiVminusi ) yi

which can be interpreted as a weighted average of thepopulation-averaged prole Xi β and the observed data yiwith weights ΣiVminusi and Ini minusΣiVminusi respectively Note thatthe ldquonumeratorrdquo of ΣiVminusi represents within-unit variabil-ity and the ldquodenominatorrdquo is the overall covariance matrixVi Hence much weight will be given to the overall averageprole if the within-unit variability is large in comparisonto the between-unit variability (modeled by the randomeects) whereasmuchweight will be given to the observeddata if the opposite is trueis phenomenon is referred toas shrinkage toward the average proleXi β An immediateconsequence of shrinkage is that the EB estimates show lessvariability than actually present in the random-eects dis-tribution ie for any linear combination λ of the randomeects

var(λprimebi)le var(λprimebi) = λprimeDλ

About the AuthorGeert Molenberghs is Professor of Biostatistics at Uni-versiteit Hasselt and Katholieke Universiteit Leuven inBelgium He was Joint Editor of Applied Statistics (ndash) and Co-Editor of Biometrics (ndash) Cur-rently he is Co-Editor of Biostatistics (ndash) He wasPresident of the International Biometric Society (ndash) received the Guy Medal in Bronze from the RoyalStatistical Society and the Myrto Lefkopoulou award fromthe Harvard School of Public Health Geert Molenberghsis founding director of the Center for Statistics He isalso the director of the Interuniversity Institute for Bio-statistics and statistical Bioinformatics (I-BioStat) Jointlywith Geert Verbeke Mike Kenward Tomasz BurzykowskiMarc Buyse and Marc Aerts he authored books on lon-gitudinal and incomplete data and on surrogate markerevaluation GeertMolenberghs received several Excellencein Continuing Education Awards of the American Statisti-cal Association for courses at Joint Statistical Meetings

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 21: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

Linear Mixed Models L

L

or only a few measurements have been taken Many of theparametric covariance models described so far may thencontain toomany parameters tomake them useful in prac-tice while othermore parsimoniousmodelsmay be basedon assumptions which are too simplistic to be realisticA general and very exible class of parametric models forcontinuous longitudinal data is formulated as follows

yi∣bi sim N(Xiβ + ZibiΣi) ()bi sim N(D) ()

where Xi and Zi are (ni times p) and (ni times q) dimensionalmatrices of known covariates β is a p-dimensional vec-tor of regression parameters called the xed eects D isa general (q times q) covariance matrix and Σi is a (ni times ni)covariance matrix which depends on i only through itsdimension ni ie the set of unknown parameters in Σi willnot depend upon i Finally bi is a vector of subject-specicor random eects

e above model can be interpreted as a linear regres-sion model (see 7Linear Regression Models) for the vec-tor yi of repeated measurements for each unit separatelywhere some of the regression parameters are specic (ran-dom eects bi) while others are not (xed eects β)edistributional assumptions in () with respect to the ran-dom eects can be motivated as follows First E(bi) = implies that the mean of yi still equals Xiβ such that thexed eects in the random-eects model () can also beinterpreted marginally Not only do they reect the eectof changing covariates within specic units they alsomea-sure the marginal eect in the population of changing thesame covariates Second the normality assumption imme-diately implies that marginally yi also follows a normaldistribution with mean vector Xiβ and with covariancematrix Vi = ZiDZprimei + ΣiNote that the random eects in () implicitly imply the

marginal covariance matrix Vi of yi to be of the very spe-cic form Vi = ZiDZprimei + Σi Let us consider two examplesunder the assumption of conditional independence ieassuming Σi = σ Ini First consider the case where therandom eects are univariate and represent unit-specicintercepts is corresponds to covariates Zi which areni-dimensional vectors containing only ones

e marginal model implied by expressions () and() is

yi sim N(XiβVi) Vi = ZiDZprimei + Σi

which can be viewed as a multivariate linear regressionmodel with a very particular parameterization of thecovariance matrix Vi

With respect to the estimation of unit-specic param-eters bi the posterior distribution of bi given the observeddata yi can be shown to be (multivariate) normal withmean vector equal to DZprimeiV

minusi (α)(yi minus Xiβ) Replacing β

and α by their maximum likelihood estimates we obtainthe so-called empirical Bayes estimates bi for the bi A keyproperty of these EB estimates is shrinkage which is bestillustrated by considering the prediction yi equiv Xi β +Zibi ofthe ith prole It can easily be shown that

yi = ΣiVminusi Xi β + (Ini minus ΣiVminusi ) yi

which can be interpreted as a weighted average of thepopulation-averaged prole Xi β and the observed data yiwith weights ΣiVminusi and Ini minusΣiVminusi respectively Note thatthe ldquonumeratorrdquo of ΣiVminusi represents within-unit variabil-ity and the ldquodenominatorrdquo is the overall covariance matrixVi Hence much weight will be given to the overall averageprole if the within-unit variability is large in comparisonto the between-unit variability (modeled by the randomeects) whereasmuchweight will be given to the observeddata if the opposite is trueis phenomenon is referred toas shrinkage toward the average proleXi β An immediateconsequence of shrinkage is that the EB estimates show lessvariability than actually present in the random-eects dis-tribution ie for any linear combination λ of the randomeects

var(λprimebi)le var(λprimebi) = λprimeDλ

About the AuthorGeert Molenberghs is Professor of Biostatistics at Uni-versiteit Hasselt and Katholieke Universiteit Leuven inBelgium He was Joint Editor of Applied Statistics (ndash) and Co-Editor of Biometrics (ndash) Cur-rently he is Co-Editor of Biostatistics (ndash) He wasPresident of the International Biometric Society (ndash) received the Guy Medal in Bronze from the RoyalStatistical Society and the Myrto Lefkopoulou award fromthe Harvard School of Public Health Geert Molenberghsis founding director of the Center for Statistics He isalso the director of the Interuniversity Institute for Bio-statistics and statistical Bioinformatics (I-BioStat) Jointlywith Geert Verbeke Mike Kenward Tomasz BurzykowskiMarc Buyse and Marc Aerts he authored books on lon-gitudinal and incomplete data and on surrogate markerevaluation GeertMolenberghs received several Excellencein Continuing Education Awards of the American Statisti-cal Association for courses at Joint Statistical Meetings

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 22: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

L Linear Regression Models

Cross References7Best Linear Unbiased Estimation in Linear Models7General Linear Models7Testing Variance Components in Mixed Linear Models7Trend Estimation

References and Further ReadingBrown H Prescott R () Applied mixed models in medicine

Wiley New YorkCrowder MJ Hand DJ () Analysis of repeated measures

Chapman amp Hall LondonDavidian M Giltinan DM () Nonlinear models for repeated

measurement data Chapman amp Hall LondonDavis CS () Statistical methods for the analysis of repeated

measurements Springer New YorkDemidenko E () Mixed models theory and applications Wiley

New YorkDiggle PJ Heagerty PJ Liang KY Zeger SL () Analysis of

longitudinal data nd edn Oxford University Press OxfordFahrmeir L Tutz G () Multivariate statistical modelling based

on generalized linear models nd edn Springer New YorkFitzmaurice GM Davidian M Verbeke G Molenberghs G ()

Longitudinal data analysis Handbook Wiley HobokenGoldstein H () Multilevel statistical models Edward Arnold

LondonHand DJ Crowder MJ () Practical longitudinal data analysis

Chapman amp Hall LondonHedeker D Gibbons RD () Longitudinal data analysis Wiley

New YorkKshirsagar AM Smith WB () Growth curves Marcel Dekker

New YorkLeyland AH Goldstein H () Multilevel modelling of health

statistics Wiley ChichesterLindsey JK () Models for repeated measurements Oxford

University Press OxfordLittell RC Milliken GA Stroup WW Wolfinger RD

Schabenberger O () SAS for mixed models nd ednSAS Press Cary

Longford NT () Random coefficient models Oxford UniversityPress Oxford

Molenberghs G Verbeke G () Models for discrete longitudinaldata Springer New York

Pinheiro JC Bates DM () Mixed effects models in S and S-PlusSpringer New York

Searle SR Casella G McCulloch CE () Variance componentsWiley New York

Verbeke G Molenberghs G () Linear mixed models for longi-tudinal data Springer series in statistics Springer New York

Vonesh EF Chinchilli VM () Linear and non-linear models forthe analysis of repeated measurements Marcel Dekker Basel

Weiss RE () Modeling longitudinal data Springer New YorkWest BT Welch KB Gałecki AT () Linear mixed models a prac-

tical guide using statistical software Chapman amp HallCRCBoca Raton

Wu H Zhang J-T () Nonparametric regression methods forlongitudinal data analysis Wiley New York

Wu L () Mixed effects models for complex data Chapman ampHallCRC Press Boca Raton

Linear Regression Models

Raoul LePageProfessorMichigan State University East Lansing MI USA

7 I did not want proof because the theoretical exigencies ofthe problem would afford that What I wanted was to bestarted in the right direction

(F Galton)

e linear regressionmodel of statistics is any functionalrelationship y = f (x β ε) involving a dependent real-valued variable y independent variables x model param-eters β and random variables ε such that a measure ofcentral tendency for y in relation to x termed the regressionfunction is linear in β Possible regression functions includethe conditional mean E(y∣x β) (as when β is itself randomas in Bayesian approaches) conditional medians quantilesor other forms Perhaps y is corn yield from a given plotof earth and variables x include levels of water sunlightfertilization discrete variables identifying the genetic vari-ety of seed and combinations of these intended to modelinteractive eects they may have on y e form of thislinkage is specied by a function f known to the experi-menter one that depends upon parameters β whose valuesare not known and also upon unseen random errors εabout which statistical assumptions are madeese mod-els prove surprisingly exible as when localized linearregression models are knit together to estimate a regres-sion function nonlinear in β Draper and Smith () isa plainly written elementary introduction to linear regres-sion models Rao () is one of many established generalreferences at the calculus level

Aspects of Data Model and NotationSuppose a time varying sound signal is the superpositionof sine waves of unknown amplitudes at two xed knownfrequencies embedded in white noise background y[t] =β+β sin[t]+β sin[t]+εWewrite β+βx+βx+εfor β = (β β β) x = (x x x) x equiv x(t) =sin[t] x(t) = sin[t] t ge A natural choice of regres-sion function is m(x β) = E(y∣x β) = β + βx + βxprovided Eε equiv In the classical linear regression modelone assumes for dierent instances ldquoirdquo of observation thatrandom errors satisfy Eεi equiv Eεiεk equiv σ gt i = k len Eεiεk equiv otherwise Errors in linear regression mod-els typically depend upon instances i at which we selectobservations and may in some formulations depend alsoon the values of x associated with instance i (perhaps the

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 23: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

Linear Regression Models L

L

errors are correlated and that correlation depends uponthe x values) What we observe are yi and associated val-ues of the independent variables x at is we observe(yi sin[ti] sin[ti]) i le ne linear model on datamay be expressed y = xβ+εwith y = column vector yi i len likewise for ε and matrix x (the design matrix) whosen entries are xik = xk[ti]

TerminologyIndependent variables as employed in this context ismisleading It derives from language used in connec-tion with mathematical equations and does not refer tostatistically independent random variables Independentvariables may be of any dimension in some applicationsfunctions or surfaces If y is not scalar-valued the modelis instead a multivariate linear regression In some ver-sions either x β or both may also be random and subjectto statistical models Do not confuse multivariate linearregression with multiple linear regression which refers toa model having more than one non-constant independentvariable

General Remarks on Fitting LinearRegression Models to DataEarly work (the classical linear model) emphasized inde-pendent identically distributed (iid) additive normalerrors in linear regression where 7least squares has par-ticular advantages (connections with 7multivariate nor-mal distributions are discussed below) In that setup leastsquares would arguably be a principle method of ttinglinear regression models to data perhaps with modica-tions such as Lasso or other constrained optimizations thatachieve reduced sampling variations of coecient estima-tors while introducing bias (Efron et al ) Absent abreakthrough enlarging the applicability of the classicallinear model other methods gain traction such as Bayesianmethods (7Markov Chain Monte Carlo having enabledtheir calculation) Non-parametric methods (good perfor-mance relative to more relaxed assumptions about errors)Iteratively Reweighted least squares (having under someconditions behavior like maximum likelihood estimatorswithout knowing the precise form of the likelihood)eDantzig selector is good news for dealing with far fewerobservations than independent variables when a relativelysmall fraction of them matter (Candegraves and Tao )

BackgroundCF Gauss may have used least squares as early as In he was able to predict the apparent position at which

asteroid Ceres would re-appear from behind the sun aerit had been lost to view following discovery only daysbefore Gaussrsquo prediction was well removed from all othersand he soon followed upwith numerous other high-calibersuccesses each achieved by tting relatively simple modelsmotivated by Kepplerrsquos Laws work at which he was veryadept and quickese were ts to imperfect sometimeslimited yet fairly precise data Legendre () publisheda substantive account of least squares following which themethod became widely adopted in astronomy and otherelds See Stigler ()By contrast Galton () working with what might

today be described as ldquolow correlationrdquo data discovereddeep truths not already known by tting a straight lineNo theoretical model previously available had preparedGalton for these discoveries which were made in a studyof his own data w = standard score of weight of parentalsweet pea seed(s) y = standard score of seed weights(s)of their immediate issue Each sweet pea seed has but oneparent and the distributions of x and y the same Work-ing at a time when correlation and its role in regressionwere yet unknown Galton found to his astonishment anearly perfect straight line tracking points (parental seedweight w median lial seed weight m(w)) Since for thisdata sy sim sw this was the least squares line (also the regres-sion line since the data was bivariate normal) and its slopewas rsysw = r (the correlation) Medians m(w) beingessentially equal to means of y for eachw greatly facilitatedcalculations owing to his choice to select equal numbersof parent seeds at weights plusmnplusmnplusmn standard deviationsfrom the mean of w Galton gave the name co-relation(later correlation) to the slope sim of this line andfor a brief time thought it might be a universal constantAlthough the correlation was small this slope nonethelessgavemeasure to the previously vague principle of reversion(later regression as when larger parental examples begetospring typically not quite so large) Galton deduced thegeneral principle that if lt r lt then for a value w gt Ewthe relation Ew lt m(w) lt w follows Having sensiblyselected equal numbers of parental seeds at intervals mayhave helped him observe that points (w y) departed oneach vertical from the regression line by statistically inde-pendentN( θ) random residuals whose variance θ gt was the same for all w Of course this likewise amazed himand by he had identied all these properties as a con-sequence of bivariate normal observations (w y) (Galton)Echoes of those long ago events reverberate today in

our many statistical models ldquodrivenrdquo as we now proclaimby random errors subject to ever broadening statisticalmodeling In the beginning it was very close to truth

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 24: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

L Linear Regression Models

Aspects of Linear Regression ModelsData of the real world seldomconformexactly to any deter-ministic mathematical model y = f (x β) and through thedevice of incorporating random errors we have now anestablished paradigm for tting models to data (x y) bystatistically estimating model parameters In consequencewe obtain methods for such purposes as predicting whatwill be the average response y to particular given inputsx providing margins of error (and prediction error) forvarious quantities being estimated or predicted tests ofhypotheses and the like It is important to note in all thisthat more than one statistical model may apply to a givenproblem the function f and the other model componentsdiering among them Two statistical modelsmay disagreesubstantially in structure and yet neither either or bothmay produce useful results In this respect statistical mod-eling is more a matter of how much we gain from usinga statistical model and whether we trust and can agreeupon the assumptions placed on the model at least as apractical expedient In some cases the regression functionconforms precisely to underlying mathematical relation-ships but that does not reect the majority of statisticalpractice It may be that a given statistical model althoughfar from being an underlying truth confers advantage bycapturing some portion of the variation of y vis-a-vis xemethod principle componentswhich seeks to nd relativelysmall numbers of linear combinations of the independentvariables that together account for most of the variationof y illustrates this point well In one application elec-tromagnetic theory was used to generate by computer anelaborate database of theoretical responses of an inducedelectromagnetic eld near a metal surface to various com-binations of aws in the metale role of principle com-ponents and linear modeling was to establish a simplemodel reecting those ndings so that a portable devicecould be devised to make detections in real time based onthe modelIf there is any weakness to the statistical approach

it lies in the fact that margins of error statistical testsand the like can be seriously incorrect even if the predic-tions aorded by a model have apparent value Refer toHinkelmann and Kempthorne () Berk ()Freedman () Freedman ()

Classical Linear Regression Modeland Least Squarese classical linear regression model may be expressed y =xβ + ε an abbreviated matrix formulation of the system ofequations in which random errors ε are assumed to satisfy

EεI equiv Eε i equiv σ gt i le n

yi = xiβ +⋯ + xipβp + εi i le n ()

e interpretation is that response yi is observedfor the ith sample in conjunction with numerical values(xi xip) of the independent variables If these errorsεI are jointly normally distributed (and therefore sta-tistically independent having been assumed to be uncor-related) and if the matrix xtrx is non-singular then themaximum likelihood (ML) estimates of the model coef-cients βk k le p are produced by ordinary least squares(LS) as follows

βML = βLS = (xtrx)minusxtry = β +Mε ()

forM = (xtrx)minusxtr with xtr denotingmatrix transpose of xand (xtrx)minus thematrix inverseese coecient estimatesβLS are linear functions in y and satisfy the GaussndashMarkovproperties ()()

E(βLS)k = βk k le p ()

and among all unbiased estimators βlowastk (of βk) that arelinear in y

E((βLS)k minus βk) le E (βlowastk minus βk) for every k le p ()

Least squares estimator () is frequently employedwithout the assumption of normality owing to the fact thatproperties ()() must hold in that case as well Many sta-tistical distributions F t chi-square have important roles inconnection with model () either as exact distributions forquantities of interest (normality assumed) or more gener-ally as limit distributions when data are suitably enriched

Algebraic Properties of Least SquaresSetting all randomness assumptions aside we may exam-ine the algebraic properties of least squares If y = xβ + εthen βLS = My = M(xβ + ε) = β + Mε as in () atis the least squares estimate of model coecients acts onxβ + ε returning β plus the result of applying least squaresto εis has nothing to do with the model being corrector ε being error but is purely algebraic If ε itself has theform xb + e then Mε = b +Me Another useful observa-tion is that if x has rst column identically one as wouldtypically be the case for a model with constant term theneach row Mk k gt of M satises Mk = (ie Mk is acontrast) so (βLS)k = βk + Mkε and Mk(ε + c) = Mkεso ε may as well be assumed to be centered for k gt ere are many of these interesting algebraic propertiessuch as s(yminusxβLS) = (minusR)s(y)where s() denotes thesample standard deviation and R is themultiple correlationdened as the correlation between y and the tted val-ues xβLS Yet another algebraic identity this one involving

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 25: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

Linear Regression Models L

L

an interplay of permutations with projections is exploitedto help establish for exchangeable errors ε and contrastsv a permutation bootstrap of least squares residuals thatconsistently estimates the conditional sampling distributionof v(βLS minus β) conditional on the 7order statistics of ε(See LePage and Podgorski ) Freedman and Lane in advocated tests based on permutation bootstrap ofresiduals as a descriptive method

Generalized Least SquaresIf errors ε are N(Σ) distributed for a covariance matrixΣ known up to a constant multiple then the maximumlikelihood estimates of coecients β are produced by a gen-eralized least squares solution retaining properties ()()(any positive multiple of Σ will produce the same result)given by

βML = (xtrΣminusx)minusxtrΣminusy = β + (xtrΣminusx)minusxtrΣminusε ()

Generalized least squares solution () retains prop-erties ()() even if normality is not assumed It mustnot be confused with 7generalized linear models whichrefers to models equating moments of y to nonlinear func-tions of xβA very large body of work has been devoted to linear

regression models and the closely related subject areas ofexperimental design7analysis of variance principle com-ponent analysis and their consequent distribution theory

Reproducing Kernel Generalized LinearRegression ModelParzen ( Sect ) developed the reproducing kernelframework extending generalized least squares to spacesof arbitrary nite or innite dimension when the ran-dom error function ε = ε(t) t isin T has zero meansEε(t) equiv and a covariance function K(s t) = Eε(s)ε(t)that is assumed known up to some positive constant mul-tiple In this formulation

y(t) = m(t) + ε(t) t isin Tm(t) = Ey(t) = Σiβiwi(t) t isin T

where wi() are known linearly independent functions inthe reproducing kernel Hilbert (RKHS) space H(K) of thekernel K For reasons having to do with singularity ofGaussian measures it is assumed that the series deningm is convergent in H(K) Parzen extends to that contextand solves the problem of best linear unbiased estima-tion of the model coecients β and more generally ofestimable linear functions of them developing condenceregions prediction intervals exact or approximate distri-butions tests and other matters of interest and establish-ing the GaussndashMarkov properties ()()e RKHS setup

has been examined from an on-line learning perspective(Vovk )

Joint Normal Distributed (x y) asMotivation for the Linear RegressionModel and Least SquaresFor the moment think of (x y) as following a multivariatenormal distribution as might be the case under processcontrol or in natural systems e (regular) conditionalexpectation of y relative to x is then for some β

E(y∣x) = Ey + E((y minus Ey)∣x) = Ey + xβ for every x

and the discrepancies y minus E(y∣x) are for each x distributedN( σ ) for a xed σ independent for dierent x

Comparing Two Basic Linear RegressionModelsFreedman () compares the analysis of two superciallysimilar but diering modelsErrors modelModel () aboveSamplingmodelData (xi xip yi) i le n represent

a random sample from a nite population (eg an actualphysical population)In the sampling model εi are simply the residual

discrepancies y-xβLS of a least squares t of linear modelxβ to the population Galtonrsquos seed study is an exam-ple of this if we regard his data (w y) as resulting fromequal probability without-replacement random samplingof a population of pairs (w y) with w restricted to be atplusmnplusmnplusmn standard deviations from themean Both withand without-replacement equal-probability sampling areconsidered by Freedman Unlike the errors model thereis no assumption in the sampling model that the popula-tion linear regressionmodel is in anyway correct althoughleast squares may not be recommended if the populationresiduals depart signicantly from iid normal Our onlypurpose is to estimate the coecients of the population LSt of the model using LS t of the model to our samplegive estimates of the likely proximity of our sample leastsquares t to the population t and estimate the quality ofthe population t (eg multiple correlation)Freedman () established the applicability of Efronrsquos

Bootstrap to each of the two models above but under dif-ferent assumptions His results for the sampling model area textbook application of Bootstrap since a description ofthe sampling theory of least squares estimates for the sam-pling model has complexities largely as had been said outof the way when the Bootstrap approach is used It wouldbe an interesting exercise to examine data such as Galtonrsquos

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 26: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

L Linear Regression Models

seed data analyzing it by the two dierent models obtain-ing condence intervals for the estimated coecients of astraight line t in each case to see how closely they agree

Balancing Good Fit AgainstReproducibilityA balance in the linear regression model is necessaryIncluding too many independent variables in order toassure a close t of the model to the data is called over-tting Models over-t in this manner tend not to workwith fresh data for example to predict y from a fresh choiceof the values of the independent variables Galtonrsquos regres-sion line although it did not aord very accurate predic-tions of y from w owing to the modest correlation sim was arguably best for his bi-variate normal data (w y)Tossing in another independent variable such as w for aparabolic t would have over-t the data possibly spoilingdiscovery of the principle of regression to mediocrityA model might well be used even when it is under-

stood that incorporating additional independent variableswill yield a better t to data and a model closer to truthHow could this be If the more parsimonious choice ofx accounts for enough of the variation of y in relation tothe variables of interest to be useful and if its fewer coe-cients β are estimated more reliably perhaps Intentionaluse of a simpler model might do a reasonably good jobof giving us the estimates we need but at the same timeviolate assumptions about the errors thereby invalidat-ing condence intervals and tests Gauss needed to comeclose to identifying the location at which Ceres wouldappear Going for too much accuracy by complicating themodel risked overtting owing to the limited number ofobservations availableOne possible resolution to this tradeo between reli-

able estimation of a few model coecients versus the riskthat by doing so too much model related material is lein the error term is to include all of several hierarchicallyordered layers of independent variables more thanmay beneeded then remove those that the data suggests are notrequired to explain the greater share of the variation of y(Raudenbush and Bryk ) New results on data com-pression (Candegraves and Tao ) may oer fresh ideas forreliably removing in some cases less relevant independentvariables without rst arranging them in a hierarchy

Regression to Mediocrity VersusReversion to Mediocrity or BeyondRegression (when applicable) is oen used to prove that ahigh performing group on some scoring ie X gt c gt EXwill not average so highly on another scoring Y as they doon X ie E(Y ∣X gt c) lt E(X∣X gt c) Termed reversion

to mediocrity or beyond by Samuels () this propertyis easily come by when XY have the same distributione following result and brief proof are Samuelsrsquo exceptfor clarications made here (italics)ese comments areaddressed only to the formal mathematical proof of thepaper

Proposition Let random variables XY be identicallydistributed with nite mean EX and x any c gtmax(EX) If P(X gt c and Y gt c) lt P(Y gt c) then thereis reversion to mediocrity or beyond for that c

Proof For any given c gt max(EX) dene the dierenceJ of indicator random variables J = (X gt c) minus (Y gt c) J iszero unless one indicator is and the other YJ is less orequal cJ = c on J = (ie on X gt cY le c) and YJ is strictlyless than cJ = minusc on J = minus (ie on X le cY gt c) Sincethe event J = minus has positive probability by assumption theprevious implies E YJ lt c EJ and so

EY(X gt c) = E(Y(Y gt c) + YJ) = EX(X gt c) + E YJlt EX(X gt c) + cEJ = EX(X gt c)

yielding E(Y ∣X gt c) lt E(X∣X gt c) ◻

Cross References7Adaptive Linear Regression7Analysis of Areal and Spatial Interaction Data7Business Forecasting Methods7Gauss-Markoveorem7General Linear Models7Heteroscedasticity7Least Squares7Optimum Experimental Design7Partial Least Squares Regression Versus Other Methods7Regression Diagnostics7RegressionModelswith IncreasingNumbers ofUnknownParameters7Ridge and Surrogate Ridge Regressions7Simple Linear Regression7Two-Stage Least Squares

References and Further ReadingBerk RA () Regression analysis a constructive critique Sage

Newbury parkCandegraves E Tao T () The Dantzig selector statistical estimation

when p is much larger than n Ann Stat ()ndashEfron B () Bootstrap methods another look at the jackknife

Ann Stat ()ndashEfron B Hastie T Johnstone I Tibshirani R () Least angle

regression Ann Stat ()ndashFreedman D () Bootstrapping regression models Ann Stat

ndash

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 27: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

Local Asymptotic Mixed Normal Family L

L

Freedman D () Statistical models and shoe leather SociolMethodol ndash

Freedman D () Statistical models theory and practiceCambridge University Press New York

Freedman D Lane D () A non-stochastic interpretation ofreported significance levels J Bus Econ Stat ndash

Galton F () Typical laws of heredity Nature ndash ndashndash

Galton F () Regression towards mediocrity in hereditary statureJ Anthropolocal Inst Great Britain and Ireland ndash

Hinkelmann K Kempthorne O () Design and analysis of exper-iments vol nd edn Wiley Hoboken

Kendall M Stuart A () The advanced theory of statistics volume Inference and relationship Charles Griffin London

LePage R Podgorski K () Resampling permutations in regres-sion without second moments J Multivariate Anal ()ndash Elsevier

Parzen E () An approach to time series analysis Ann Math Stat()ndash

Rao CR () Linear statistical inference and its applications WileyNew York

Raudenbush S Bryk A () Hierarchical linear models appli-cations and data analysis methods nd edn Sage ThousandOaks

Samuels ML () Statistical reversion toward the mean moreuniversal than regression toward the mean Am Stat ndash

Stigler S () The history of statistics the measurement of uncer-tainty before Belknap Press of Harvard University PressCambridge

Vovk V () On-line regression competitive with reproducingkernel Hilbert spaces Lecture notes in computer science vol Springer Berlin pp ndash

Local Asymptotic Mixed NormalFamily

Ishwar V BasawaProfessorUniversity of Georgia Athens GA USA

Suppose x(n) = (x xn) is a sample from a stochasticprocess x = x x Let pn(x(n) θ) denote the jointdensity function of x(n) where θєΩ sub Rk is a parameterDene the log-likelihood ratio Λn = [ pn(x(n)θn)pn(x(n)θ) ] whereθn = θ + nminus h and h is a (k times ) vectore joint densitypn(x(n) θ) belongs to a local asymptotic normal (LAN)family if Λn satises

Λn = nminus htSn(θ) minus nminus (

htJn(θ)h) + op() ()

where Sn(θ) = dlnpn(x(n)θ)dθ Jn(θ) = minus d

lnpn(x(n)θ)dθdθ t and

(i)nminus Sn(θ) dETHrarr Nk(F(θ)) (ii)nminusJn(θ) pETHrarr F(θ)

()

F(θ) being the limiting Fisher information matrix HereF(θ) ia assumed to be non-random See LaCam and Yang() for a review of the LAN familyFor the LAN family dened by () and () it is well

known that under some regularity conditions the maxi-mum likelihood (ML) estimator θn is consistent asymp-totically normal and ecient estimator of θ with

radicn(θn minus θ) dETHrarr Nk(Fminus(θ)) ()

A large class of models involving the classical iid (inde-pendent and identically distributed) observations are cov-ered by the LAN framework Many time series models and7Markov processes also are included in the LAN familyIf the limiting Fisher information matrix F(θ) is non-

degenerate random we obtain a generalization of the LANfamily for which the limit distribution of the ML estima-tor in () will be a mixture of normals (and hence non-normal) If Λn satises () and () with F(θ) random thedensity pn(x(n) θ) belongs to a local asymptotic mixednormal (LAMN) family See Basawa and Scott () fora discussion of the LAMN family and related asymptoticinference questions for this familyFor the LAMN family one can replace the norm

radicn by

a random norm Jn (θ) to get the limiting normal distribu-

tion vizJn (θ)(θn minus θ) dETHrarr N( I) ()

where I is the identity matrixTwo examples belonging to the LAMN family are given

below

Example Variance mixture of normals

Suppose conditionally on V = v (x x xn)are iid N(θ vminus) random variables and V is an expo-nential random variable with mean e marginaljoint density of x(n) is then given by pn(x(n) θ) prop

[ +

n

sum(xi minus θ)]

minus( n +) It can be veried that F(θ) is

an exponential random variable with mean e ML esti-mator θn = x and

radicn(x minus θ) dETHrarr t() It is interesting to

note that the variance of the limit distribution of x isinfin

Example Autoregressive process

Consider a rst-order autoregressive process xtdened by xt = θxtminus + et t = with x = whereet are assumed to be iid N( ) random variables

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 28: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

L Location-Scale Distributions

We then have pn(x(n) θ) prop exp [minus n

sum(xt minus θxtminus)]

For the stationary case ∣θ∣ lt this model belongsto the LAN family However for ∣θ∣ gt the modelbelongs to the LAMN family For any θ the ML esti-

mator θn = (n

sumxiximinus)(

n

sumximinus)

minus One can verify that

radicn(θn minus θ) dETHrarr N( ( minus θ)minus) for ∣θ∣ lt and

(θ minus )minusθn(θn minus θ) dETHrarr Cauchy for ∣θ∣ gt

About the AuthorDr Ishwar Basawa is a Professor of Statistics at theUniversity of Georgia USA He has served as interimhead of the department (ndash) Executive Edi-tor of the Journal of Statistical Planning and Inference(ndash) on the editorial board of Communicationsin Statistics and currently the online Journal of Prob-ability and Statistics Professor Basawa is a Fellow ofthe Institute of Mathematical Statistics and he was anElected member of the International Statistical InstituteHe has co-authored two books and co-edited eight Pro-ceedingsMonographsSpecial Issues of journals He hasauthored more than publications His areas of researchinclude inference for stochastic processes time series andasymptotic statistics

Cross References7Asymptotic Normality7Optimal Statistical Inference in Financial Engineering7Sampling Problems for Stochastic Processes

References and Further ReadingBasawa IV Scott DJ () Asymptotic optimal inference for

non-ergodic models Springer New YorkLeCam L Yang GL () Asymptotics in statistics Springer

New York

Location-Scale Distributions

Horst RinneProfessor Emeritus for Statistics and EconometricsJustusndashLiebigndashUniversity Giessen Giessen Germany

A random variable X with realization x belongs to thelocation-scale family when its cumulative distribution is afunction only of (x minus a)b

FX(x ∣ a b) = Pr(X le x∣a b) = F(x minus ab

) a isin R b gt

where F(sdot) is a distribution having no other parametersDierent F(sdot)rsquos correspond to dierent members of thefamily (a b) is called the locationndashscale parameter a beingthe location parameter and b being the scale parameter Forxed b = we have a subfamily which is a location familywith parameter a and for xed a = we have a scale familywith parameter be variable

Y = X minus ab

is called the reduced or standardized variable It has a = and b = If the distribution of X is absolutely continuouswith density function

fX(x ∣ a b) =dFX(x ∣ a b)

d xthen (a b) is a location scale-parameter for the distribu-tion of X if (and only if)

fX(x ∣ a b) =bf(x minus a

b)

for some density f (sdot) called the reduced density All distri-butions in a given family have the same shape ie the sameskewness and the same kurtosisWhenY hasmean microY andstandard deviation σY then the mean of X is E(X) = a +b microY and the standard deviation of X is

radicVar(X) = b σY

e location parameter a a isin R is responsible forthe distributionrsquos position on the abscissa An enlargement(reduction) of a causes a movement of the distribution tothe right (le)e location parameter is either a measureof central tendency eg the mean median and mode or itis an upper or lower threshold parametere scale param-eter b b gt is responsible for the dispersion or variationof the variate X Increasing (decreasing) b results in anenlargement (reduction) of the spread and a correspond-ing reduction (enlargement) of the density b may be thestandard deviation the full or half length of the supportor the length of a central ( minus α)ndashinterval

e location-scale family has a great number ofmembers

Arc-sine distribution Special cases of the beta distribution like the rectan-gular the asymmetric triangular the Undashshaped or thepowerndashfunction distributions

Cauchy and halfndashCauchy distributions Special cases of the χndashdistribution like the halfndashnormal theRayleigh and theMaxwellndashBoltzmanndistributions

Ordinary and raised cosine distributions Exponential and reected exponential distributions Extreme value distribution of the maximum and theminimum each of type I

Hyperbolic secant distribution

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 29: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

Location-Scale Distributions L

L

Laplace distribution Logistic and halfndashlogistic distributions Normal and halfndashnormal distributions Parabolic distributions Rectangular or uniform distribution Semindashelliptical distribution Symmetric triangular distribution Teissier distribution with reduced density f (y) =

[ exp(y) minus ] exp[ + y minus exp(y)] y ge Vndashshaped distribution

For each of the above mentioned distributions we candesign a special probability paper Conventionally theabscissa is for the realization of the variate and the ordinatecalled the probability axis displays the values of the cumu-lative distribution function but its underlying scaling isaccording to the percentile function e ordinate valuebelonging to a given sample data on the abscissa is calledplotting position for its choice see Barnett ( )Blom () Kimball ()When the sample comes fromthe probability paperrsquos distribution the plotted data willrandomly scatter around a straight line thus we have agraphical goodness-t-testWhenwe t the straight line byeye we may read o estimates for a and b as the abscissa ordierence on the abscissa for certain percentiles A moreobjective method is to t a least-squares line to the dataand the estimates of a and b will be the the parameters ofthis line

e latter approach takes the order statisticsXin Xn leXn le le Xnn as regressand and the mean of thereduced order statistics αin = E(Yin) as regressor whichunder these circumstances acts as plotting position eregression model reads

Xin = a + b αin + εi

where εi is a random variable expressing the dierencebetweenXin and its mean E(Xin) = a+b αin As the orderstatistics Xin and ndash as a consequence ndash the disturbanceterms εi are neither homoscedastic nor uncorrelated wehave to use ndash according to Lloyd () ndash the general-least-squares method to nd best linear unbiased estimators ofa and b Introducing the following vectors and matrices

x =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

Xn

Xn

Xnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

=

⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

α =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

αn

αn

αnn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

ε =

⎜⎜⎜⎜⎜⎜⎜⎜⎜

ε

ε

εn

⎟⎟⎟⎟⎟⎟⎟⎟⎟

θ =⎛

⎜⎜

a

b

⎟⎟

A = ( α)

the regression model now reads

x = A θ + ε

with variancendashcovariance matrix

Var(x) = b B

e GLS estimator of θ is

θ = (AprimeΩA)minusAprimeΩ x

and its variancendashcovariance matrix reads

Var( θ ) = b (AprimeΩA)minus

e vector α and the matrix B are not always easy to ndFor only a few locationndashscale distributions like the expo-nential the reected exponential the extreme value thelogistic and the rectangular distributions we have closed-form expressions in all other cases we have to evalu-ate the integrals dening E(Yin) and E(Yin Yjn) Formore details on linear estimation and probability plot-ting for location-scale distributions and for distributionswhich can be transformed to locationndashscale type see Rinne() Maximum likelihood estimation for locationndashscaledistributions is treated by Mi ()

About the AuthorDr Horst Rinne is Professor Emeritus (since ) ofstatistics and econometrics In he was awarded theVenia legendi in statistics and econometrics by the Facultyof economics of Berlin Technical University From to he worked as a part-time lecturer of statistics atBerlin Polytechnic and at the Technical University of Han-nover In he joined Volkswagen AG to do operationsresearch In he was appointed full professor of statis-tics and econometrics at the Justus-Liebig University inGiessen where later on he was Dean of the faculty of eco-nomics and management science for three years He gotfurther appointments to the universities of Tuumlbingen andCologne and to Berlin Polytechnic He was a member ofthe board of the German Statistical Society and in chargeof editing its journal Allgemeines Statistisches Archiv nowAStA ndash Advances in Statistical Analysis from to He is Co-founder of the Econometric Board of theGermanSociety of Economics Since he is Elected memberof the International Statistical Institute He is Associateeditor for Quality Technology and Quantitative Manage-ment His scientic interests are wide-spread ranging fromnancial mathematics business and economics statisticstechnical statistics to econometrics He has written numer-ous papers in these elds and is author of several textbooksand monographs in statistics econometrics time seriesanalysis multivariate statistics and statistical quality con-trol including process capability He is the author of thetexte Weibull Distribution A Handbook (Chapman andHallCRC )

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 30: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

L Logistic Normal Distribution

Cross References7Statistical Distributions An Overview

References and Further ReadingBarnett V () Probability plotting methods and order statistics

Appl Stat ndashBarnett V () Convenient probability plotting positions for the

normal distribution Appl Stat ndashBlom G () Statistical estimates and transformed beta variables

Almqvist and Wiksell StockholmKimball BF () On the choice of plotting positions on probability

paper J Am Stat Assoc ndashLloyd EH () Least-squares estimation of location and scale

parameters using order statistics Biometrika ndashMi J () MLE of parameters of locationndashscale distributions for

complete and partially grouped data J Stat Planning Inferencendash

Rinne H () Location-scale distributions ndash linear estimation andprobability plotting httpgebuni-giessengebvolltexte

Logistic Normal Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic-normal distribution arises by assuming thatthe logit (or logistic transformation) of a proportion hasa normal distribution with an obvious extension to a vec-tor of proportions through taking a logistic transformationof a multivariate normal distribution see Aitchison andShen () In the univariate case this provides a family ofdistributions on ( ) that is distinct from the 7beta dis-tribution while the multivariate version is an alternativeto the Dirichlet distribution Note that in the multivariatecase there is no unique way to dene the set of logits forthe multinomial proportions (just as in multinomial logitmodels see Agresti ) and dierent formulations maybe appropriate in particular applications (Aitchison )e univariate distribution has been used oen implicitlyin random eects models for binary data and the multi-variate version was pioneered by Aitchison for statisticaldiagnosisdiscrimination (Aitchison and Begg ) theBayesian analysis of contingency tables and the analysis ofcompositional data (Aitchison )

e use of the logistic-normal distribution is most eas-ily seen in the analysis of binary data where the logit model(based on a logistic tolerance distribution) is extendedto the logit-normal model For grouped binary data with

responses ri out of mi trials (i = n) the responseprobabilities Pi are assumed to have a logistic-normal dis-tribution with logit(Pi) = log(Pi( minus Pi)) sim N(microi σ )where microi is modelled as a linear function of explanatoryvariables x xpe resulting model can be summa-rized as

Ri∣Pi sim Binomial(miPi)logit(Pi)∣Z = ηi + σZ = β + βxi +⋯ + βpxip + σZ

Z sim N( )

is is a simple extension of the basic logit model withthe inclusion of a single normally distributed randomeect in the linear predictor an example of a general-ized linear mixed model see McCulloch and Searle ()Maximum likelihood estimation for this model is compli-cated by the fact that the likelihood has no closed formand involves integration over the normal density whichrequires numerical methods using Gaussian quadratureroutines now exist as part of generalized linear mixedmodel tting in all major soware packages such as SASR Stata and Genstat Approximate moment-based estima-tion methods make use of the fact that if σ is small thenas derived in Williams ()

E[Ri] = miπi andVar(Ri) = miπi( minus π)[ + σ (mi minus )πi( minus πi)]

where logit(πi) = ηie form of the variance functionshows that this model is overdispersed compared to thebinomial that is it exhibits greater variability the ran-dom eect Z allows for unexplained variation across thegrouped observations However note that for binary data(mi = ) it is not possible to have overdispersion arising inthis way

About the AuthorFor biography see the entry 7Logistic Distribution

Cross References7Logistic Distribution7Mixed Membership Models7Multivariate Normal Distributions7Normal Distribution Univariate

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitchison J () The statistical analysis of compositional data

(with discussion) J R Stat Soc Ser B ndashAitchison J () The statistical analysis of compositional data

Chapman amp Hall LondonAitchison J Begg CB () Statistical diagnosis when basic cases

are not classified with certainty Biometrika ndash

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 31: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

Logistic Regression L

L

Aitchison J Shen SM () Logistic-normal distributions someproperties and uses Biometrika ndash

McCulloch CE Searle SR () Generalized linear and mixedmodels Wiley New York

Williams D () Extra-binomial variation in logistic linearmodels Appl Stat ndash

Logistic Regression

JosephM HilbeEmeritus ProfessorUniversity of Hawaii Honolulu HI USAAdjunct Professor of StatisticsArizona State University Tempe AZ USASolar System AmbassadorCalifornia Institute of Technology PasadenaCA USA

Logistic regression is the most common method used tomodel binary response data When the response is binaryit typically takes the form of with generally indicat-ing a success and a failureHowever the actual values that and can take vary widely depending on the purpose ofthe study For example for a study of the odds of failure in aschool setting may have the value of fail and of not-failor pass e important point is that indicates the fore-most subject of interest forwhich a binary response study isdesigned Modeling a binary response variable using nor-mal linear regression introduces substantial bias into theparameter estimates e standard linear model assumesthat the response and error terms are normally or Gaus-sian distributed that the variance σ is constant acrossobservations and that observations in the model are inde-pendent When a binary variable is modeled using thismethod the rst two of the above assumptions are violatedAnalogical to the normal regression model being basedon the Gaussian probability distribution function (pdf )a binary response model is derived from a Bernoulli dis-tribution which is a subset of the binomial pdf with thebinomial denominator taking the value of e Bernoullipdf may be expressed as

f (yi πi) = π yii ( minus πi)minusyi ()

Binary logistic regression derives from the canonicalform of the Bernoulli distribution e Bernoulli pdf isa member of the exponential family of probability distri-butions which has properties allowing for a much easier

estimation of its parameters than traditional NewtonndashRaphson-based maximum likelihood estimation (MLE)methodsIn Nelder and Wedderbrun discovered that it

was possible to construct a single algorithm for estimat-ing models based on the exponential family of distri-butions e algorithm was termed 7Generalized linearmodels (GLM) and became a standard method to esti-mate binary response models such as logistic probit andcomplimentary-loglog regression count response mod-els such as Poisson and negative binomial regression andcontinuous response models such as gamma and inverseGaussian regressione standard normalmodel or Gaus-sian regression is also a generalized linear model andmaybe estimated under its algorithme formof the exponen-tial distribution appropriate for generalized linear modelsmay be expressed as

f (yi θ i ϕ) = exp(yiθ i minus b(θ i))α(ϕ) + c(yi ϕ) ()

with θ representing the link function α(ϕ) the scaleparameter b(θ) the cumulant and c(y ϕ) the normaliza-tion term which guarantees that the probability functionsums to e link a monotonically increasing func-tion linearizes the relationship of the expected mean andexplanatory predictors e scale for binary and countmodels is constrained to a value of and the cumulant isused to calculate the model mean and variance functionse mean is given as the rst derivative of the cumulantwith respect to θ bprime(θ) the variance is given as the secondderivative bprimeprime(θ) Taken together the above four termsdene a specic GLM modelWe may structure the Bernoulli distribution () into

exponential family form () as

f (yi πi) = expyi ln(πi( minus πi)) + ln( minus πi) ()

e link function is therefore ln(π(minusπ)) and cumu-lant minus ln( minus π) or ln(( minus π)) For the Bernoulli π isdened as the probability of successe rst derivative ofthe cumulant is π the second derivative π( minus π)esetwo values are respectively the mean and variance func-tions of the Bernoulli pdf Recalling that the logistic modelis the canonical form of the distribution meaning that it isthe form that is directly derived from the pdf the valuesexpressed in () and the values we gave for the mean andvariance are the values for the logistic modelEstimation of statistical models using the GLM algo-

rithm as well asMLE are both based on the log-likelihoodfunctione likelihood is simply a re-parameterization ofthe pdf which seeks to estimate π for example rather thanye log-likelihood is formed from the likelihood by tak-ing the natural log of the function allowing summation

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 32: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

L Logistic Regression

across observations during the estimation process ratherthan multiplication

e traditional GLM symbol for the mean micro is typ-ically substituted for π when GLM is used to estimate alogisticmodel In that form the log-likelihood function forthe binary-logistic model is given as

L(microi yi) =n

sumi=

yi ln(microi( minus microi)) + ln( minus microi) ()

or

L(microi yi) =n

sumi=

yi ln(microi) + ( minus yi) ln( minus microi) ()

e Bernoulli-logistic log-likelihood function is essen-tial to logistic regression When GLM is used to esti-mate logistic models many soware algorithms use thedeviance rather than the log-likelihood function as thebasis of convergencee deviance which can be used asa goodness-of-t statistic is dened as twice the dierenceof the saturated log-likelihood and model log-likelihoodFor logistic model the deviance is expressed as

D = n

sumi=

yi ln(yimicroi)+(minusyi) ln((minusyi)(minus microi)) ()

Whether estimated using maximum likelihood techniquesor asGLM the value of micro for each observation in themodelis calculated on the basis of the linear predictor xprimeβ For thenormal model the predicted t y is identical to xprimeβ theright side of () However for logistic models the responseis expressed in terms of the link function ln(microi( minus microi))We have therefore

xprimeiβ = ln(microi(minus microi)) = β + βx + βx +⋯+ βnxn ()

e value of microi for each observation in the logistic modelis calculated as

microi = ( + exp (minusxprimeiβ)) = exp (xprimeiβ) ( + exp (xprimeiβ)) ()

e functions to the right of micro are commonly used waysof expressing the logistic inverse link function which con-verts the linear predictor to the tted value For the logisticmodel micro is a probabilityWhen logistic regression is estimated using a Newton-

Raphson type ofMLE algorithm the log-likelihood func-tion as parameterized to xprimeβ rather than microe estimatedt is then determined by taking the rst derivative of thelog-likelihood function with respect to β setting it to zeroand solvinge rst derivative of the log-likelihood func-tion is commonly referred to as the gradient or scorefunctione second derivative of the log-likelihood withrespect to β produces the Hessian matrix from which thestandard errors of the predictor parameter estimates are

derived e logistic gradient and hessian functions aregiven as

partL(β)partβ

=n

sumi=

(yi minus microi)xi ()

partL(β)partβpartβprime

= minusn

sumi=

xixprimei microi( minus microi) ()

One of the primary values of using the logistic regressionmodel is the ability to interpret the exponentiated param-eter estimates as odds ratios Note that the link functionis the log of the odds of micro ln(micro( minus micro)) where the oddsare understood as the success of micro over its failure minus microe log-odds is commonly referred to as the logit functionAn example will help clarify the relationship as well as theinterpretation of the odds ratioWe use data from the Titanic accident compar-

ing the odds of survival for adult passengers to childrenA tabulation of the data is given as

Age (Child vs Adult)

Survived child adults Total

no

yes

Total

e odds of survival for adult passengers is ore odds of survival for children is or e ratio of the odds of survival for adults to the odds ofsurvival for children is ()() or is value is referred to as the odds ratio or ratio of the twocomponent odds relationships Using a logistic regressionprocedure to estimate the odds ratio of age produces thefollowing results

survivedOddsRatio

StdErr z Pgt ∣z∣

[ ConfInterval]

age minus

With = adult and = child the estimated odds ratiomay be interpreted as

7 The odds of an adult surviving were about half the odds of a

child surviving

By inverting the estimated odds ratio above we mayconclude that children had [sim ] some ndash or

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 33: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

Logistic Regression L

L

nearly two times ndash greater odds of surviving than didadultsFor continuous predictors a one-unit increase in a pre-

dictor value indicates the change in odds expressed bythe displayed odds ratio For example if age was recordedas a continuous predictor in the Titanic data and theodds ratio was calculated as we would interpret therelationship as

7 Theoddsof surviving isoneandahalfpercentgreater for each

increasing year of age

Non-exponentiated logistic regression parameter esti-mates are interpreted as log-odds relationships whichcarry little meaning in ordinary discourse Logistic mod-els are typically interpreted in terms of odds ratios unlessa researcher is interested in estimating predicted prob-abilities for given patterns of model covariates ie inestimating microLogistic regression may also be used for grouped or

proportional data For these models the response consistsof a numerator indicating the number of successes (s)for a specic covariate pattern and the denominator (m)the number of observations having the specic covariatepatterne response ym is binomially distributed as

f (yi πimi) = (miyi

)π yii ( minus πi)miminusyi ()

with a corresponding log-likelihood function expressed as

L(microi yimi) =n

sumi=

yi ln(microi( minus microi)) +mi ln( minus microi)

+ (miyi

) ()

Taking derivatives of the cumulant minusmi ln( minus microi) aswe did for the binary response model produces a mean ofmicroi = miπi and variance microi( minus microimi)Consider the data below

y cases x x x

y indicates the number of times a specic pattern of covari-ates is successful Cases is the number of observations

having the specic covariate patterne rst observationin the table informs us that there are three cases havingpredictor values of x = x = and x = Of thosethree cases one has a value of y equal to the other twohave values of All current commercial soware appli-cations estimate this type of logistic model using GLMmethodology

yOddsratio

OIMstd err z P gt ∣z∣

[ confinterval]

x

x minus

x minus

e data in the above table may be restructured so thatit is in individual observation format rather than groupede new table would have ten observations having thesame logic as described Modeling would result in iden-tical parameter estimates It is not uncommon to nd anindividual-based data set of for example observa-tions being grouped into ndash rows or observations asabove described Data in tables is nearly always expressedin grouped formatLogisticmodels are subject to a variety of t tests Some

of the more popular tests include the Hosmer-Lemeshowgoodness-of-t test ROC analysis various informationcriteria tests link tests and residual analysiseHosmerndashLemeshow test once well used is now only used withcaution e test is heavily inuenced by the manner inwhich tied data is classied Comparing observed withexpected probabilities across levels it is now preferred toconstruct tables of risk having dierent numbers of lev-els If there is consistency in results across tables then thestatistic is more trustworthyInformation criteria tests eg Akaike informationCri-

teria (see 7Akaikersquos Information Criterion and 7AkaikersquosInformation Criterion Background Derivation Proper-ties and Renements) (AIC) and Bayesian InformationCriteria (BIC) are the most used of this type of test Infor-mation tests are comparative with lower values indicatingthe preferred model Recent research indicates that AICand BIC both are biased when data is correlated to anydegree Statisticians have attempted to develop enhance-ments of these two tests but have not been entirely suc-cessfule best advice is to use several dierent types oftests aiming for consistency of resultsSeveral types of residual analyses are typically recom-

mended for logistic models e references below pro-vide extensive discussion of these methods together with

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 34: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

L Logistic Distribution

appropriate caveats However it appears well establishedthat m-asymptotic residual analyses is most appropri-ate for logistic models having no continuous predictorsm-asymptotics is based on grouping observations withthe same covariate pattern in a similar manner to thegrouped or binomial logistic regression discussed earliere Hilbe () and Hosmer and Lemeshow () ref-erences below provide guidance on how best to constructand interpret this type of residualLogistic models have been expanded to include cat-

egorical responses eg proportional odds models andmultinomial logistic regression ey have also beenenhanced to include the modeling of panel and corre-lated data eg generalized estimating equations xed andrandom eects and mixed eects logistic modelsFinally exact logistic regression models have recently

been developed to allow the modeling of perfectly pre-dicted data as well as small and unbalanced datasetsIn these cases logistic models which are estimated usingGLM or full maximum likelihood will not converge Exactmodels employ entirely dierent methods of estimationbased on large numbers of permutations

About the AuthorJoseph M Hilbe is an emeritus professor University ofHawaii and adjunct professor of statistics Arizona StateUniversity He is also a Solar System Ambassador withNASAJet Propulsion Laboratory at California Institute ofTechnology Hilbe is a Fellow of the American StatisticalAssociation and Elected Member of the International Sta-tistical institute for which he is founder and chair of theISI astrostatistics committee and Network the rst globalassociation of astrostatisticians He is also chair of the ISIsports statistics committee and was on the founding exec-utive committee of the Health Policy Statistics Section ofthe American Statistical Association (ndash) Hilbeis author of Negative Binomial Regression ( Cam-bridge University Press) and Logistic Regression Models( Chapman amp Hall) two of the leading texts in theirrespective areas of statistics He is also co-author (withJames Hardin) of Generalized Estimating Equations (Chapman amp HallCRC) and two editions of GeneralizedLinear Models and Extensions ( Stata Press)and with Robert Muenchen is coauthor of the R for StataUsers ( Springer) Hilbe has also been inuential inthe production and review of statistical soware servingas Soware Reviews Editor for e American Statisticianfor years from ndash He was founding editor ofthe Stata Technical Bulletin () and was the rst to addthe negative binomial family into commercial generalizedlinear models soware Professor Hilbe was presented the

Distinguished Alumnus award at California State Univer-sity Chico in two years following his induction intothe Universityrsquos Athletic Hall of Fame (he was two-timeUSchampion track amp eld athlete) He is the only graduate ofthe university to be recognized with both honors

Cross References7Case-Control Studies7Categorical Data Analysis7Generalized Linear Models7Multivariate Data Analysis An Overview7Probit Analysis7Recursive Partitioning7Regression Models with Symmetrical Errors7Robust Regression Estimation in Generalized LinearModels7Statistics An Overview7Target Estimation A New Approach to ParametricEstimation

References and Further ReadingCollett D () Modeling binary regression nd edn Chapman amp

HallCRC Cox LondonCox DR Snell EJ () Analysis of binary data nd edn

Chapman amp Hall LondonHardin JW Hilbe JM () Generalized linear models and exten-

sions nd edn Stata Press College StationHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonHosmer D Lemeshow S () Applied logistic regression nd edn

Wiley New YorkKleinbaum DG () Logistic regression a self-teaching guide

Springer New YorkLong JS () Regression models for categorical and limited

dependent variables Sage Thousand OaksMcCullagh P Nelder J () Generalized linear models nd edn

Chapman amp Hall London

Logistic Distribution

JohnHindeProfessor of StatisticsNational University of Ireland Galway Ireland

e logistic distribution is a location-scale family distri-bution with a very similar shape to the normal (Gaussian)distribution but with somewhat heavier tails e distri-bution has applications in reliability and survival analysise cumulative distribution function has been used for

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 35: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

Logistic Distribution L

L

modelling growth functions and as a tolerance distribu-tion in the analysis of binary data leading to the widelyused logit model For a detailed discussion of the proper-ties of the logistic and related distributions see Johnsonet al ()

e probability density function is

f (x) = τ

expminus(x minus micro)τ[ + expminus(x minus micro)τ]

minusinfin lt x ltinfin ()

and the cumulative distribution function is

F(x) = [ + expminus(x minus micro)τ] minusinfin lt x ltinfin

e distribution is symmetric about the mean micro and hasvariance τπ so that when comparing the standardlogistic distribution (micro = τ = ) with the standard nor-mal distribution N( ) it is important to allow for thedierent variancese suitably scaled logistic distributionhas a very similar shape to the normal although the kur-tosis is which is somewhat larger than the value of for the normal indicating the heavier tails of the logisticdistributionIn survival analysis one advantage of the logistic

distribution over the normal is that both right- and le-censoring can be easily handlede survivor and hazardfunctions are given by

S(x) = [ + exp(x minus micro)τ] minusinfin lt x ltinfin

h(x)= τ

[ + expminus(x minus micro)τ]

e hazard function has the same logistic form and ismonotonically increasing so themodel is only appropriatefor ageing systemswith an increasing failure rate over timeIn modelling the dependence of failure times on explana-tory variables if we use a linear regression model for microthen the tted model has an accelerated failure time inter-pretation for the eect of the variables Fitting of thismodelto right- and le-censored data is described in Aitkin et al()One obvious extension for modelling failure times T

is to assume a logistic model for logT giving a log-logisticmodel for T analagous to the lognormal modele result-ing hazard function based on the logistic distribution in() is

h(t) = αθ

(tθ)αminus

+ (tθ)α t θ α gt

where θ = emicro and α = τ For α le the hazard ismonotone decreasing and for α gt it has a single max-imum as for the lognormal distribution hazards of thisform may be appropriate in the analysis of data such as

heart transplant survival ndash there may be an initial periodof increasing hazard associated with rejection followed bydecreasing hazard as the patient survives the procedureand the transplanted organ is acceptedFor the standard logistic distribution (micro = τ = )

the probability density and the cumulative distributionfunctions are related through the very simple identity

f (x) = F(x) [ minus F(x)]

which in turn by elementary calculus implies that

logit(F(x)) = loge [F(x) minus F(x)] = x ()

and uniquely characterizes the standard logistic distribu-tion Equation () provides a very simple way for sim-ulating from the standard logistic distribution by settingX = loge[U( minus U)] where U sim U( ) for the generallogistic distribution in () we take τX + micro

e logit transformation is now very familiar in mod-elling probabilities for binary responses Its use goes backto Berkson () who suggested the use of the logis-tic distribution to replace the normal distribution as theunderlying tolerance distribution in quantal bio-assayswhere various dose levels are given to groups of subjects(animals) and a simple binary response (eg cure deathetc) is recorded for each individual (giving r-out-of-n typeresponse data for the groups)e use of the normal dis-tribution in this context had been pioneered by Finneythrough his work on 7probit analysis and the same meth-ods mapped across to the logit analysis see Finney ()for a historical treatment of this area e probability ofresponse P(d) at a particular dose level d is modelled bya linear logit model

logit(P(d)) = loge [P(d) minus P(d)] = β + βd

which by the identity () implies a logistic tolerance dis-tribution with parameters micro = minusββ and τ = ∣β∣e logit transformation is computationally convenientand has the nice interpretation of modelling the log-odds of a response is goes across to general logis-tic regression models for binary data where parametereects are on the log-odds scale and for a two-level factorthe tted eect corresponds to a log-odds-ratio Approx-imate methods for parameter estimation involve usingthe empirical logits of the observed proportions How-ever maximum likelihood estimates are easily obtainedwith standard generalized linear model tting sowareusing a binomial response distribution with a logit linkfunction for the response probability this uses the itera-tively reweighted least-squares Fisher-scoring algorithm of

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 36: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

L Lorenz Curve

Nelder and Wedderburn () although Newton-basedalgorithms for maximum likelihood estimation of the logitmodel appeared well before the unifying treatment of7generalized linearmodels A comprehensive treatment of7logistic regression including models and applications isgiven in Agresti () and Hilbe ()

About the AuthorJohn Hinde is Professor of Statistics School of Mathe-matics Statistics and Applied Mathematics National Uni-versity of Ireland Galway Ireland He is past Presidentof the Irish Statistical Association (ndash) the Sta-tistical Modelling Society (ndash) and the EuropeanRegional Section of the International Association of Sta-tistical Computing (ndash) He has been an activemember of the International Biometric Society and is theincoming President of the British and Irish Region ()He is an Elected member of the International Statisti-cal Institute () He has authored or coauthored over papers mainly in the area of statistical modelling andstatistical computing and is coauthor of several booksincluding most recently Statistical Modelling in R (withM Aitkin B Francis and R Darnell Oxford UniversityPress ) He is currently an Associate Editor of Statis-tics and Computing and was one of the joint FoundingEditors of Statistical Modelling

Cross References7Asymptotic Relative Eciency in Estimation7Asymptotic Relative Eciency in Testing7Bivariate Distributions7Location-Scale Distributions7Logistic Regression7Multivariate Statistical Distributions7Statistical Distributions An Overview

References and Further ReadingAgresti A () Categorical data analysis nd edn Wiley New YorkAitkin M Francis B Hinde J Darnell R () Statistical modelling

in R Oxford University Press OxfordBerkson J () Application of the logistic function to bio-assay

J Am Stat Assoc ndashFinney DJ () Statistical method in biological assay rd edn

Griffin LondonHilbe JM () Logistic regression models Chapman amp HallCRC

Press Boca RatonJohnson NL Kotz S Balakrishnan N () Continuous univariate

distributions vol nd edn Wiley New YorkNelder JA Wedderburn RWM () Generalized linear models J R

Stat Soc A ndash

Lorenz Curve

Johan FellmanProfessor EmeritusFolkhaumllsan Institute of Genetics Helsinki Finland

Definition and PropertiesIt is a general rule that income distributions are skewedAlthough various distributionmodels such as the Lognor-mal and the Pareto have been proposed they are usuallyapplied in specic situations For general studies morewide-ranging tools have to be applied the rst and mostcommon tool of which is the Lorenz curve Lorenz ()developed it in order to analyze the distribution of incomeand wealth within populations describing it in the follow-ing way

7 Plot along one axis accumulated percentages of the popula-

tion from poorest to richest and along the other wealth held

by these percentages of the population

e Lorenz curve L(p) is dened as a function of theproportion p of the population L(p) is a curve startingfrom the origin and ending at point ( ) with the follow-ing additional properties (I) L(p) is monotone increasing(II) L(p) le p (III) L(p) convex (IV) L()= and L()= e Lorenz curve is convex because the income share of thepoor is less than their proportion of the population (Fig )

e Lorenz curve satises the general rules

7 A unique Lorenz curve corresponds to every distribution The

contrary does not hold but every Lorenz L(p) is a common

curve for a whole class of distributions F(θ x) where θ is an

arbitrary constant

e higher the curve the less inequality in the incomedistribution If all individuals receive the same incomethen the Lorenz curve coincides with the diagonal from( ) to ( ) Increasing inequality lowers the Lorenzcurve which can converge towards the lower right cornerof the squareConsider two Lorenz curves LX(p) and LY(p) If

LX(p) ge LY(p) for all p then the distribution correspond-ing to LX(p) has lower inequality than the distributioncorresponding to LY(p) and is said to Lorenz dominate theother Figure shows an example of Lorenz curves

e inequality can be of a dierent type the corre-sponding Lorenz curves may intersect and for these noLorenz ordering holdsis case is seen in Fig Undersuch circumstances alternative inequality measures haveto be dened the most frequently used being the Giniindex G introduced by Gini ()is index is the ratio

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 37: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

Lorenz Curve L

L

10

06

08

04

02

0000 02 04 06 08

Lx(p)

Ly(p)

10

Lorenz Curve Fig Lorenz curves with Lorenz orderingthat is LX(p) ge LY(p)

1

06

08

04

02

000 02 04 06 08 1

L2(p)

L1(p)

Lorenz Curve Fig Two intersecting Lorenz curves Using theGini index L(p) has greater inequality (G = ) than L(p)

(G = )

between the area between the diagonal and the Lorenzcurve and the whole area under the diagonalis deni-tion yields Gini indices satisfying the inequality le G le

e higher the G value the greater the inequality in theincome distribution

Income RedistributionsIt is a well-known fact that progressive taxation reducesinequality Similar eects can be obtained by appropriatetransfer policies ndings based on the following generaltheorem (Fellman Jakobsson Kakwani )

eorem Let u(x) be a continuous monotone increasingfunction and assume that microY = E (u(X)) existsen theLorenz curve LY(p) for Y = u(X) exists and

(I) LY(p) ge LX(p) ifu(x)xis monotone decreasing

(II) LY(p) = LX(p) ifu(x)xis constant

(III) LY(p) le LX(p) ifu(x)xis monotone increasing

For progressive taxation rules u(x)xmeasures the pro-

portion of post-tax income to initial income and is amonotone-decreasing function satisfying condition (I)and the Gini index is reduced Hemming and Keen ()gave an alternative condition for the Lorenz dominance

which is that u(x)xcrosses the microY

microXlevel once from above

If the taxation rule is a at tax then (II) holds and theLorenz curve and the Gini index remain e third casein eorem indicates that the ratio u(x)

xis increasing

and the Gini index increases but this case has only minorpractical importanceA crucial study concerning income distributions and

redistributions is the monograph by Lambert ()

About the AuthorDr Johan Fellman is Professor Emeritus in statistics Heobtained PhD in mathematics (University of Helsinki) in with the thesis On the Allocation of Linear Observa-tions and in addition he has published about printedarticles He served as professor in statistics at HankenSchool of Economics (ndash) and has been a scientistat Folkhaumllsan Institute of Genetics since He is mem-ber of the Editorial Board of Twin Research and HumanGenetics and of the Advisory Board of MathematicaSlovaca and Editor of InterStat He has been awarded theKnight First Class of the Order of the White Rose ofFinland and the Medal in Silver of Hanken School ofEconomics He is Elected member of the Finnish Societyof Sciences and Letters and of the International StatisticalInstitute

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 38: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

L Loss Function

Cross References7Econometrics7Economic Statistics7Testing Exponentiality of Distribution

References and Further ReadingFellman J () The effect of transformations on Lorenz curves

Econometrica ndashGini C () Variabilitagrave e mutabilitagrave Bologna ItalyHemming R Keen MJ () Single crossing conditions in compar-

isons of tax progressivity J Publ Econ ndashJakobsson U () On the measurement of the degree of progres-

sion J Publ Econ ndashKakwani NC () Applications of Lorenz curves in economic

analysis Econometrica ndashLambert PJ () The distribution and redistribution of income

a mathematical analysis rd edn Manchester university pressManchester

Lorenz MO () Methods for measuring concentration of wealthJ Am Stat Assoc New Ser ndash

Loss Function

Wolfgang BischoffProfessor and Dean of the Faculty of Mathematics andGeographyCatholic University EichstaumlttndashIngolstadt EichstaumlttGermany

Loss functions occur at several places in statistics Herewe attach importance to decision theory (see 7Decisioneory An Introduction and 7Decision eory AnOverview) and regression For both elds the same lossfunctions can be used But the interpretation is dierentDecision theory gives a general framework to dene

and understand statistics as a mathematical disciplineeloss function is the essential component in decision theorye loss function judges a decisionwith respect to the truthby a real value greater or equal to zero In case the decisioncoincides with the truth then there is no losserefore thevalue of the loss function is zero then otherwise the valuegives the loss which is suered by the decision unequalthe truthe larger the value the larger the loss which issueredTo describe this more exactly let Θ be the known set

of all outcomes for the problem under consideration onwhich we have information by data We assume that oneof the values θ isin Θ is the true value Each d isin Θ is a possi-ble decisione decision d is chosen according to a rulemore exactly according to a function with values in Θ and

dened on the set of all possible data Since the true value θis unknown the loss function L has to be dened on ΘtimesΘie

L Θ timesΘ rarr [infin)

e rst variable describes the true value say and thesecond one the decisionus L(θ a) is the loss which issuered if θ is the true value and a is the decisionere-fore each (up to technical conditions) function L Θ timesΘ rarr [infin) with the property

L(θ θ) = for all θ isin Θ

is a possible loss function e loss function has to bechosen by the statistician according to the problem underconsiderationNext we describe examples for loss functions First

let us consider a test problem en Θ is divided in twodisjoint subsets Θ and Θ describing the null hypothesisand the alternative set Θ = Θ + Θen the usual lossfunction is given by

L(θ ϑ) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if θ ϑ isin Θ or θ ϑ isin Θ

if θ isin Θ ϑ isin Θ or θ isin Θ ϑ isin Θ

For point estimation problems we assume that Θ is anormed linear space and let ∣ sdot ∣ be its norm Such a spaceis typical for estimating a location parameteren the lossL(θ ϑ) = ∣θ minus ϑ∣ θ ϑ isin Θ can be used Next let us con-sider the specic case Θ=Ren L(θ ϑ) = ℓ(θ minus ϑ) is atypical form for loss functions where ℓ IR rarr [infin) isnonincreasing on (minusinfin ] and nondecreasing on [infin)with ℓ() = ℓ is also called loss function An importantclass of such functions is given by choosing ℓ(t) = ∣t∣pwhere p gt is a xed constantere are two prominentcases for p = we get the classical square loss and for p = the robust L-loss Another class of robust losses are thefamous Huber losses

ℓ(t) = t if ∣t∣ le γ and ℓ(t) = γ∣t∣ minus γ if ∣t∣ gt γ

where γ gt is a xed constant Up to now we have shownsymmetrical losses ie L(θ ϑ) = L(ϑ θ)ere are manyproblems in which underestimating of the true value θhas to be dierently judged than overestimating For suchproblems Varian () introduced LinEx losses

ℓ(t) = b(exp(at) minus at minus )

where a b gt can be chosen suitably Here underestimat-ing is judged exponentially and overestimating linearlyFor other estimation problems corresponding losses

are used For instance let us consider the estimation of a

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 39: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents

Loss Function L

L

scale parameter and let Θ = (infin)en it is usual to con-sider losses of the form L(θ ϑ) = ℓ(ϑθ) where ℓmust bechosen suitably It is however more convenient to writeℓ(ln ϑ minus ln θ)en ℓ can be chosen as aboveIn theoretical works the assumed properties for loss

functions can be quite dierent Classically it was assumedthat the loss is convex (see 7RaondashBlackwell eorem)If the space Θ is not bounded then it seems to be moreconvenient in practice to assume that the loss is boundedwhich is also assumed in some branches of statistics Incase the loss is not continuous then it must be carefullydened to get no counter intuitive results in practice seeBischo ()In case a density of the underlying distribution of the

data is known up to an unknown parameter the class ofdivergence losses can be dened Specic cases of theselosses are the Hellinger and the Kulback-Leibler lossIn regression however the loss is used in a dier-

ent way Here it is assumed that the unknown locationparameter is an element of a known class F of real valuedfunctions Given n observations (data) y yn observedat design points t tn of the experimental region aloss function is used to determine an estimation for theunknown regression function by the lsquobest approximationrsquo

ie the function in F that minimizes sumni= ℓ (rfi ) f isin F

where r fi = yi minus f (ti) is the residual in the ith design pointHere ℓ is also called loss function and can be chosen asdescribed above For instance the least squares estimationis obtained if ℓ(t) = t

Cross References7Advantages of Bayesian Structuring Estimating Ranksand Histograms7Bayesian Statistics7Decisioneory An Introduction7Decisioneory An Overview7Entropy and Cross Entropy as Diversity and DistanceMeasures7Sequential Sampling7Statistical Inference for Quantum Systems

References and Further ReadingBischoff W () Best ϕ-approximants for bounded weak loss

functions Stat Decis ndashVarian HR () A Bayesian approach to real estate assessment

In Fienberg SE Zellner A (eds) Studies in Bayesian economet-rics and statistics in honor of Leonard J Savage North HollandAmsterdam pp ndash

  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading
Page 40: Large Deviations and ough Applications · 2012. 12. 3. · L Large Deviations and Applications F§ZŁhƒ« CŽfiu–« Professor UniversitéParisDiderot,Paris,France Largedeviationsisconcernedwiththestudyofrareevents
  • L
    • Large Deviations and Applications
      • About the Author
      • Cross References
      • References and Further Reading
        • Laws of Large Numbers
          • About the Author
          • Cross References
          • References and Further Reading
            • Learning Statistics in a Foreign Language
              • Background
              • Language and Cultural Problems
              • My SQU Experience
              • What Can Be Done
              • Cross References
              • References and Further Reading
                • Least Absolute Residuals Procedure
                  • About the Author
                  • Cross References
                  • References and Further Reading
                    • Least Squares
                      • About the Author
                      • Cross References
                      • References and Further Reading
                        • Leacutevy Processes
                          • Introduction
                          • Leacutevy Processes
                          • Examples of the Leacutevy Processes
                            • The Brownian Motion
                            • The Inverse Brownian Process
                            • The Compound Poisson Process
                            • The Gamma Process
                            • The Variance Gamma Process
                              • About the Author
                              • Cross References
                              • References and Further Reading
                                • Life Expectancy
                                  • Cross References
                                  • References and Further Reading
                                    • Life Table
                                      • About the Author
                                      • Cross References
                                      • References and Further Reading
                                        • Likelihood
                                          • Introduction
                                          • Inference
                                          • Nuisance Parameters
                                          • Extensions to Likelihood
                                          • Further Reading
                                          • About the Author
                                          • Cross References
                                          • References and Further Reading
                                            • Limit Theorems of Probability Theory
                                              • About the Author
                                              • Cross References
                                              • References and Further Reading
                                                • Linear Mixed Models
                                                  • About the Author
                                                  • Cross References
                                                  • References and Further Reading
                                                    • Linear Regression Models
                                                      • Aspects of Data Model and Notation
                                                      • Terminology
                                                      • General Remarks on Fitting Linear Regression Models to Data
                                                      • Background
                                                      • Aspects of Linear Regression Models
                                                      • Classical Linear Regression Modeland Least Squares
                                                      • Algebraic Properties of Least Squares
                                                      • Generalized Least Squares
                                                      • Reproducing Kernel Generalized Linear Regression Model
                                                      • Joint Normal Distributed (x y) as Motivation for the Linear Regression Model and Least Squares
                                                      • Comparing Two Basic Linear Regression Models
                                                      • Balancing Good Fit Against Reproducibility
                                                      • Regression to Mediocrity Versus Reversion to Mediocrity or Beyond
                                                      • Cross References
                                                      • References and Further Reading
                                                        • Local Asymptotic Mixed Normal Family
                                                          • About the Author
                                                          • Cross References
                                                          • References and Further Reading
                                                            • Location-Scale Distributions
                                                              • About the Author
                                                              • Cross References
                                                              • References and Further Reading
                                                                • Logistic Normal Distribution
                                                                  • About the Author
                                                                  • Cross References
                                                                  • References and Further Reading
                                                                    • Logistic Regression
                                                                      • About the Author
                                                                      • Cross References
                                                                      • References and Further Reading
                                                                        • Logistic Distribution
                                                                          • About the Author
                                                                          • Cross References
                                                                          • References and Further Reading
                                                                            • Lorenz Curve
                                                                              • Definition and Properties
                                                                              • Income Redistributions
                                                                              • About the Author
                                                                              • Cross References
                                                                              • References and Further Reading
                                                                                • Loss Function
                                                                                  • Cross References
                                                                                  • References and Further Reading