Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND...
Transcript of Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND...
Chapter 2
Optimization and Solving Nonlinear
Equations
This chapter deals with an important problem in mathematics and statistics: finding values of x to satisfyf(x) = 0. Such values are called the roots of the equation and also known as the zeros of f(x).
2.1 The bisection method
The goal is to find the solution of an equation f(x) = 0.
A question that should be raised is the following: Is there a (real) root of f(x) = 0? One answer isprovided by the intermediate value theorem.
Intermediate value theorem. If f(x) is continuous on an interval [a, b], and f(a) and f(b) have opposite signs,i.e., f(a)f(b) < 0, then there exists a point ξ ∈ (a, b) such that f(ξ) = 0.
The intermediate value theorem guarantees that a root exists under those conditions. However, it doesnot tell us the precise value of the root ξ.
The bisection method works by assuming that we know two values a and b such that f(a)f(b) < 0, andworks by repeatedly narrowing the gap between a and b until it closes in on the correct answer.
It narrows the gap by taking the average a+b2 of a and b. If f(a+b
2 ) = 0, then we find a root at a+b2 .
Otherwise, look at two subsections: (a, a+b2 ) and (a+b
2 , b). By the intermediate value theorem again , there
must be a root in the interval (a, a+b2 ) when f(a)f(a+b
2 ) < 0, or in the interval ( a+b2 , b) when f(a+b
2 )f(b) < 0.We continue this procedure until a desired accuracy has been achieved.
19
20 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS
Example 1. Find the zeros of f(x) = 5x5 − 4x4 + 3x3 − 2x2 + x − 1.
There is at least one real zero of f(x) (why?)
It would be better to start by drawing a graph of f(x).
> f=function(x){5*x^5-4*x^4+3*x^3-2*x^2+x-1}
> x=seq(-50, 50, length=500)
> plot(x, f(x))
> x=seq(-5, 5, length =500)
> plot(x, f(x))
> x=seq(0, 1, length =500)
> plot(x, f(x))
Next we use the bisection method to find the zero between 0 and 1.
> f(0)
[1] -1
> f(1)
[1] 2
> f(.5) # f value at midpoint of (0, 1)
[1] -0.71875 # This suggests next step go to (0.5, 1)
> f(0.75) # f value at midpoint of (0.5, 1)
[1] -0.1884766 # Go to (0.75, 1)
> f(0.875) # f value at midpoint of (0.75, 1)
[1] 0.5733337 # Go to (0.75, 0.875)
> f(0.8125) # f value at midpoint of (0.75, 0.875)
[1] 0.1285563 # Go to (0.75, 0.8125)
> f(0.78125) # at midpoint of (0.75, 0.8125)
[1] -0.04386625 # Go to (0.78125, 0.8125)
> f(0.796875) # at midpoint of (0.78125, 0.8125)
[1] 0.03862511 # Go to (0.78125, 0.796875)
> (0.78125+ 0.796875)/2
[1] 0.7890625
> f(0.7890625) # at midpoint of (0.78125, 0.796875)
[1] -0.003519249 # Go to (0.7890625, 0.796875)
2.1. THE BISECTION METHOD 21
> (0.7890625+ 0.796875)/2
[1] 0.7929688
> f(0.7929688) # at midpoint of (0.7890625, 0.796875)
[1] 0.01732467 # Go to (0.7890625, 0.7929688)
> (0.7890625+ 0.7929688)/2
[1] 0.7910157
> f(0.7910157)
[1] 0.006846331 # Go to (0.7890625, 0.7910157)
> (0.7890625+ 0.7910157)/2
[1] 0.7900391
> f((0.7890625+ 0.7910157)/2) # Go to (0.7890625, 0.7900391)
[1] 0.001649439
> (0.7890625+ 0.7900391)/2
[1] 0.7895508
> f((0.7890625+ 0.7900391)/2) # What do you think?
[1] -0.0009384231 # Is f(0.7895508) close enough to 0?
Below is a simple way.
f=function(x){5*x^5-4*x^4+3*x^3-2*x^2+x-1}
bisection=function(a,b,n){
xa=a
xb=b
for(i in 1:n){ if(f(xa)*f((xa+xb)/2)<0) xb=(xa+xb)/2
else xa=(xa+xb)/2}
list(left=xa,right=xb, midpoint=(xa+xb)/2)
}
> bisection(0,1,15)
$left
[1] 0.7897034
$right
[1] 0.7897339
$midpoint
[1] 0.7897186
22 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS
Example 2. Find the maximum of
g(x) =log x
1 + x, x > 0.
Since g(x) is differentiable, we look at its derivative
g′(x) =1
x(1 + x)− log x
(1 + x)2=
1
(1 + x)2
(
x + 1
x− log x
)
, x > 0,
and find a critical point of g(x) by solving g ′(x) = 0, or by solving
x + 1
x− log x = 0,
whose root is denoted by c. Clearly, c > 1. It can be shown that g ′(x) > 0 for all x ∈ (0, c), and g′(x) < 0for all x ∈ (c,∞). Thus, g(c) is the maximum value of g(x).
> gd=function(x){(1+x)/x-log(x)}
> x=seq(1, 2, length=50)
> plot(x, gd(x)) # It seems that c is between 1.2 and 1.4
> gd(3)
[1] 0.2347210
> gd(6)
[1] -0.6250928
> bisection=function(a,b,n){
xa=a
xb=b
for(i in 1:n){ if(gd(xa)*gd((xa+xb)/2)<0) xb=(xa+xb)/2
else xa=(xa+xb)/2}
list(left=xa,right=xb, midpoint=(xa+xb)/2)
}
> bisection(3,6,30)
$left
[1] 3.591121
$right
[1] 3.591121
$midpoint
[1] 3.591121
2.1. THE BISECTION METHOD 23
Example 3. A Cauchy density function takes the form
f(x) =1
{1 + (x − θ)2}π , x ∈ R,
where θ is a parameter.
(1) Generate 50 random numbers from a Cauchy distribution with θ = 1.
data = rcauchy(50, 1)
−40 0 20 40
−45
0−
350
−25
0−
150
Log−likelihood function
θ
l(dat
a, θ
)
−40 0 20 40
−10
−5
05
10
Derivative
θ
ld(d
ata,
θ)
(2) Treat the data you get from step (1) as sample observations from a Cauchy distribution with anunknown θ. Plot the log-likelihood function of θ,
l(θ) = −n lnπ −n∑
i=1
ln{1 + (xi − θ)2}, θ ∈ R.
l=function(x,t){
s=0
n=length(x)
for(j in 1:n) s=s + log(1+(x[j]-t)^2)
l=-n*log(pi)-s
l
}
theta=seq(-50, 50,length=500)
plot(theta, l(data,theta), type="l",main="Log-likelihood function",
xlab=expression(theta), ylab=expression(l(data, theta)))
24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS
(3) The maximum value seems to be shown in the above plot of the log-likelihood function of θ. Use thebisection method to find the maximum likelihood estimator of θ.
To do so, we calculate the derivative of l(θ), with a constant dropped,
l′(θ) =
n∑
i=1
θ − xi
{1 + (xi − θ)2} , θ ∈ R,
and draw a plot of l′(θ).
ld=function(x,t){
s=0
n=length(x)
for(j in 1:n) s=s + (t-x[j])/(1+(x[j]-t)^2)
l=s
l
}
theta=seq(-10, 10,length=500)
plot(theta, ld(data,theta), type="l",main="Derivative",
xlab=expression(theta), ylab=expression(ld(data, theta)))
The bisection method is applicable to l′(θ), since it is continuous everywhere.
f=function(t){ld(data, t)}
bisection(-10,10,30)
$left
[1] 0.9758892
$right
[1] 0.9758892
$midpoint
[1] 0.9758892
Hence, θ̂ = 0.9758892
2.2. SECANT METHOD 25
2.2 Secant method
The secant method begins by finding two points on the curve of f(x), (x0, f(x0)) and (x1, f(x1)), hopefullynear to a root r we seek. A straight line that passes these two points is
y − f(x0)
f(x1) − f(x0)=
x − x0
x1 − x0.
If x2 is the root of f(x) = 0, and the point (x2, f(x2)) is on the line, then
0 − f(x0)
f(x1) − f(x0)=
x2 − x0
x1 − x0.
From this we solve for x2,
x2 = x1 − f(x1)x0 − x1
f(x0) − f(x1).
Because f(x) is not exactly linear, x2 is not equal to r, but it should be closer than either of the two pointswe begin with.
If we repeat this, we have
xn+1 = xn − f(xn)xn−1 − xn
f(xn−1) − f(xn), n = 1, 2, . . .
Under the assumptions that the sequence {xn, n = 1, 2, . . .} converges to r, f(x) is differentiable near r,and f ′(r) 6= 0, we obtain
limn→∞
xn+1 = limn→∞
xn − f( limn→∞
xn)lim
n→∞
xn−1 − limn→∞
xn
f( limn→∞
xn−1) − f( limn→∞
xn),
orr = r − f(r)/f ′(r),
which gives f(r) = 0.
26 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS
Example. Find the zeros of f(x) = x3 − 2x2 + x − 1 using the secant method.
f=function(x){x^3-2*x^2+x-1}
secant=function(x,y,n){ # Katherine Earles’s code
if(abs(f(x))<abs(f(y))){xa=x} & {xb=y}
else {xa=y}&{xb=x}
xc=0
for(i in 1:n){xc=xb-f(xb)/(f(xa)-f(xb))*(xa-xb)
xa=xb
xb=xc}
list("x(n)"=xa, "x(n+1)"=xb)
}
> secant(0,5,12)
$"x(n)"
[1] 1.754878
$"x(n+1)"
[1] 1.754878
> secant(5,0, 12)
$"x(n)"
[1] 1.754878
$"x(n+1)"
[1] 1.754878
> secant(5,0, 15)
$"x(n)"
[1] 1.754878
$"x(n+1)"
[1] NaN
The above code does break down for high enough values of n (returns NaN). The following is an improvementon function h that fixes the problem. The “if statement” will break out of the loop if the values of xa andxb are equal.
2.2. SECANT METHOD 27
g=function(x,y){y-(f(y)/(f(x)-f(y)))*(x-y)}
h=function(x,y,n){ # Katherine Earles’s code
xa=x
xb=y
xc=0
for(i in (1:n)){if (identical(all.equal(xa, xb), TRUE)) break
else # or {xc=g(xa,xb)}&{ xa=xb}&{xb=xc}
xc=g(xa,xb)
xa=xb
xb=xc
}
list("x(n)"=xa,"x(n+1)"=xb)}
> h(-10,50,500)
$"x(n)"
[1] 1.754878
$"x(n+1)"
[1] 1.754878
28 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS
2.3 Newton’s method
Newton’s method or the Newton-Raphson method is a procedure or algorithm for approximating the zerosof a function f (or, equivalently, the roots of an equation f(x) = 0). It consists of the following three steps:
Step 1. Make a reasonable initial guess as to the location of a solution, which is denoted by x0.
Step 2. Calculate
x1 = x0 −f(x0)
f ′(x0).
Step 3. If x1 is sufficiently close to a solution, stop; otherwise, continue this procedure by
x2 = x1 − f(x1)f ′(x1)
,
x3 = x2 − f(x2)f ′(x2)
,
· · ·xn = xn−1 − f(xn−1)
f ′(xn−1) .
Under the assumptions that the sequence x0, x1, . . . , xn, . . . converges to r, and that f(x) is differentiablenear r with f ′(r) 6= 0, by taking the limit on both sides of
xn = xn−1 −f(xn−1)
f ′(xn−1),
we obtain
r = r − f(r)
f ′(r),
which results in f(r) = 0.
This method requires that the first approximation is sufficiently close to the root r.
A comparison between the secant method and Newton’s method. The secant method is obtained fromNewton’s method by approximating the derivative of f(x) at two points xn and xn−1 by
f ′(x) =f(xn) − f(xn−1)
xn − xn−1.
Geometrically, Newton’s method uses the tangent line and the secant method approximates the tangent lineby a secant line.
2.3. NEWTON’S METHOD 29
Example 1. Find a zero of f(x) = x2006 + 2006x + 1.
> newton=function(x0, n){
f=function(x){x^(2006)+2006*x+1}
fd=function(x){2006*x^(2005)+2006}
x=x0
for (i in 1:n){x=x-f(x)/fd(x)}
list(x)
}
> newton(-.5, 20)
[1] -0.0004985045
> newton(3.5, 10) * It is sensitive to an initial guess x0
[1] NaN
> nr=function(x0, numstp,eps){
numfin = numstp
small = 1.0*10^(-8)
istop = 0
while(istop == 0){
f=function(x){x^(2006)+2006*x+1}
fd=function(x){2006*x^(2005)+2006}
x1=x0-f(x0)/fd(x0)
check = abs(x0-x1)/abs(x0 + small)
if(check < eps){istop=1}
x0=x1
}
list(x1=x1,check=check)
}
> nr(20,0,20,0.3)
$x1
[1] -0.0004985045
$check
[1] 2.174953e-16
30 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS
Example 2. The Weibull distribution function is of the form
F (x) =
{
1 − exp{−(βx)λ}, if x ≥ 0,0, elsewhere,
where λ and β are positive parameters.
(1) Generate 50 random numbers from a Weibull distribution with β = 1 and λ = 1.8.
> weib = rweibull(50, shape=1.8, scale = 1)
(2) Add three more numbers to the above group. Treat these 53 observations as your data from a Weibulldistribution with an unknown λ, but keep β = 1 fixed. Plot the log-likelihood function of λ.
> mydata = c(weib, 0.9, 1, 1.1) # add 3 numbers 0.9, 1, 1.1
The likelihood and log-likelihood functions of λ are
L(λ) = λn
(
n∏
k=1
xk
)λ−1
exp
(
−n∑
k=1
xλk
)
,
and
l(λ) = n lnλ + (λ − 1)
n∑
k=1
lnxk −n∑
k=1
xλk ,
respectively.
> loglike=function(t){
x=mydata
s=0
for(i in 1:length(x)) s=s-x[i]^t+(t-1)*log(x[i])
loglike=53*log(t)+s
loglike
}
> l=seq(0.5, 3,len=200)
> plot(l, loglike(l), type=’l’, xlab=expression(lambda),
ylab=expression(l(lambda)),
main=‘loglikelihood function for Weibull Data’)
It can be seen from the plot of loglikelihood function that l(λ) is concave.
2.3. NEWTON’S METHOD 31
(3) Use Newton’s method to find the maximum likelihood estimator of λ.
To do so, we need solve the equation l′(λ) = 0 for stationary points. The first and second derivatives ofl(λ) are
l′(λ) =n
λ+
n∑
k=1
lnxk −n∑
k=1
xλk lnxk,
and
l′′(λ) = − n
λ2−
n∑
k=1
xλk ln2 xk.
Now l′′(λ) < 0 indicates that there is the unique maximum point of l(λ).
ld=function(t){ # define l’(lambda)
x=mydata
s=0
for(i in 1:length(x)) s=s-(log(x[i]))*x[i]^t+log(x[i])
ld=s+53/t
ld
}
ldd=function(t){ # define l’’(lambda)
x=mydata
s=0
for(i in 1:length(x)) s=s-(log(x[i]))^2*x[i]^t
ldd=s-53/t^2
ldd
}
newton=function(t,n){ # Newton’s iteration
for(i in 1:n) {t=t-ld(t)/ldd(t)}
t
}
> newton(0.1,20) # x_0=0.1
[1] 1.704811
32 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS
2.4 Fixed-point iteration: x = g(x) method
Suppose that we can bring an equation f(x) = 0 in the form x = g(x), which usually can be done in severalways. Whenever r = g(r), r is said to be a fixed point for the function g(x).
We can solve this equation, on certain conditions, using iteration.
Start with an approximation x0 of the root.
Calculate
x1 = f(x0),
x2 = f(x1),
· · ·xn+1 = g(xn), n = 0, 1, 2, . . .
Example. Consider a simple equation
x2 − 3x + 2 = 0.
It can be rewritten as x = g(x) in many ways. For instance,
x =x2 + 2
3,
x =√
3x − 2,
x = −√
3x − 2,
x = x2 − 2x + 2,
x =1
2
√3x − 2 +
x
2.
The for loop can be easily set down.
> fixed=function(x, n){
for(i in 1:n){ x = g(x) }
x
}
Let us take a look of x = x2+23 .
> g=function(x){(x^2+2)/3}
> fixed(0.1, 20)
[1] 0.9999037 # It’s close to 1, one of the roots.
> fixed(3, 20)
[1] Inf # A problem of the initial point?
> fixed(-4, 20)
[1] Inf
2.4. FIXED-POINT ITERATION: X = G(X) METHOD 33
A solution is guaranteed under the assumptions of the following theorem.
Theorem. If |g′(x)| ≤ k < 1 in an interval (a, b), and the sequence {x0, x1, ..., xn, ...} belongs to (a, b), thenthe sequence has a limit r, and r is the only root of x = g(x) in the interval (a, b).
Proof. Appealing on Lagrange’s theorem we can write
x2 − x1 = g(x1) − g(x0) = (x1 − x0)g′(c1), c1 is between x0 and x1,
x3 − x2 = g(x2) − g(x1) = (x2 − x1)g′(c2), c2 is between x1 and x2,
· · ·
xn+1 − xn = g(xn) − g(xn−1) = (xn − xn−1)g′(cn), cn is between xn−1 and xn.
Since |g′(x)| < k < 1, we obtain
|x2 − x1| < k|x1 − x0|,|x3 − x2| < k|x2 − x1| < k2|x1 − x0|,· · ·|xn+1 − xn| < k|xn − xn−1| < · · · < kn|x1 − x0|,
and for m > n,
|xm − xn| ≤ |xm − xm−1| + |xm−1 − xm−2| + . . . + |xn+1 − xn|< km−1|x1 − x0| + km−2|x1 − x0| + . . . + kn|x1 − x0|= (km−1 + . . . + kn)|x1 − x0|
=kn − km
1 − k|x1 − x0|.
Thus, by Cauchy’s criterion, the sequence {xn, n = 0, 1, 2, . . .} converges. Say the limit is r. By taking limitof both sides of the equation
xn+1 = g(xn),
we obtain limn→∞ xn+1 = limn→∞ g(xn), or
r = g(r),
which means that r is a root of the equation x = g(x).
If r1 is a second root of x = g(x) in the interval (a, b), then
r1 − r = g(r1) − g(r) = (r1 − r)g′(c), with c ∈ (a, b).
Then
g′(c) = 1,
and this gives a contradiction. 2
34 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS
Notice that Newton’s method is a special case of the fixed-point iteration, with
g(x) = x − f(x)
f ′(x),
and
g′(x) = 1 − {f ′(x)}2 − f(x)f ′′(x)
{f ′(x)}2=
f(x)f ′′(x)
{f ′(x)}2.
Applying the above theorem to this particular case, we obtain
Corollary. Assume that the function f(x) is continuous in the interval [a, b] and is twice differentiable in(a, b), with
∣
∣
∣
∣
f(x)f ′′(x)
{f ′(x)}2
∣
∣
∣
∣
≤ k < 1, x ∈ (a, b).
If the sequence {x0, x1, x2, . . .} is formulated by Newton’s method with
xn+1 = xn − f(xn)
f ′(xn), n = 0, 1, 2, . . . ,
and xn ∈ (a, b), n = 0, 1, 2, . . . , then the sequence has a limit r, and r is the only root of f(x) = 0 in theinterval [a, b].
This corollary indicates that the initial point x0 is very important for Newton’s method. A good tryshould start with a x0 that satisfies
∣
∣
∣
∣
f(x0)f′′(x0)
{f ′(x0)}2
∣
∣
∣
∣
≤ k < 1.
2.5 Convergence rate
Consider a fixed-point iteration for solving the equation x = g(x) with the procedure
xn+1 = g(xn), n = 0, 1, 2, . . .
Let r be the root of the equation. Define the nth step error by
en = r − xn, n = 1, 2, . . .
Since r = g(r), we obtain
en+1 = r − xn+1
= g(r) − g(xn)
= g′(cn)(r − xn) by the mean value theorem
= g′(cn)en.
This means the error at the (n + 1)th step is linearly related to the error at the nth step.
For Newton’s method, it can be shown that the error at the (n + 1)th step is quadratically related tothe error at the nth step.
2.6. NEWTON’S METHOD FOR A SYSTEM OF NONLINEAR EQUATIONS 35
2.6 Newton’s method for a system of nonlinear equations
Newton’s method can be applied for solving a system of nonlinear equations. This is particularly usefulwhen we try to find maximum likelihood estimators of several parameters.
Let F(x) be a vector-valued function of a vector argument x, assuming that both vectors contain mcomponents. To apply Newton’s method to the problem of approximating a solution of
F(x) = 0,
we would like to start from an initial point x0 and then write
xn+1 = xn − F(xn)/F′(xn), n = 0, 1, 2, . . . .
Two questions arise in the above procedure immediately. First, what is meant by F ′(xn)? and second, whatis meant by the division F(xn)/F′(xn)?
Here, F′(x) is a matrix defined by
F′(x) =
∂f1(x)∂x1
∂f1(x)∂x2
. . . ∂f1(x)∂xm
∂f2(x)∂x1
∂f2(x)∂x2
. . . ∂f2(x)∂xm
. . . . . .∂fm(x)
∂x1
∂fm(x)∂x2
. . . ∂fm(x)∂xm
.
This matrix is known as the Jacobian matrix for the system and is typically denoted by J(x).
For the division of two matrices, we use multiplication of an inverse. Thus, Newton’s method takes theform
xn+1 = xn − (J(xn))−1F(xn), n = 0, 1, 2, . . . .
When implementing this scheme, rather than actually computing the inverse of the Jacobian matrix, wedefine
vn = −(J(xn))−1F(xn),
and then solve the linear system of equations
J(xn)vn = −F(xn),
for vn. Once vn is known, the next iterate is computed according to the rule
xn+1 = xn + vn, n = 0, 1, 2, . . . .
36 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS
Example 1. Find the solution of the system of two nonlinear equations
{
x31 − 2x2 + 1 = 0,
x1 + 2x32 − 3 = 0.
First of all, we set up
x =
(
x1
x2
)
, and F(x) =
(
x31 − 2x2 + 1
x1 + 2x32 − 3
)
.
Then, find the Jacobian matrix for the system,
J(x) =
(
3x21 −2
1 6x22
)
.
The next codes were written by Katherine Earles.
F=function(x){ # define the (column) vector of equations
F=matrix(0,nrow=2) # nrow depends on the length of F
F[1]= x[1]^3-2*x[2]+1 # The first component of F
F[2]= x[1]+2*x[2]^3-3 # The second component of F
F # output F, a column vector of values
}
J=function(x){ # define the Jacobian of F
j=matrix(0,ncol=2,nrow=2) # ncol & nrow depend on the length of F
j[1,1]= 3*x[1]^2
j[1,2]= -2
j[2,1]= 1
j[2,2]= 6*x[2]^2
j # output j, a matrix of values
}
NNL=function(initial,n){ # Newton’s method for a system of non-linear equations
x=initial
v=matrix(0,ncol=length(x))
for (i in 1:n){
v=solve(J(x),-F(x))
x=x+v}
cat(" x1=",x[1],"\n","x2=",x[2],"\n")
}
Sometimes we may need check whether the Jacobian matrix is invertible. For this purpose, the above codesare improved.
2.6. NEWTON’S METHOD FOR A SYSTEM OF NONLINEAR EQUATIONS 37
NNL=function(initial,n){ # Newton’s method for a system of non-linear equations
x=initial
v=matrix(0,ncol=length(x))
for (i in 1:n){
d=det(J(x)) # check that J(x) is invertible
if (identical(all.equal(d,0),TRUE))
{cat("Jacobian has no inverse. Try a different initial point.","\n")
break}
else
v=solve(J(x),-F(x))
x=x+v
}
cat(" x1=",x[1],"\n","x2=",x[2],"\n")
}
> NNL(c(0.1, 0.2), 1)
x1= 2.901794
x2= 0.5425269
> NNL(c(0.1, 0.2), 2)
x1= 1.969765
x2= 0.9450524
> NNL(c(0.1, 0.2), 3)
x1= 1.387231
x2= 0.9309951
> NNL(c(0.1, 0.2), 4)
x1= 1.093614
x2= 0.9872401
> NNL(c(0.1, 0.2), 5)
x1= 1.007192
x2= 0.9989359
> NNL(c(0.1, 0.2), 6)
x1= 1.000047
x2= 0.9999933
> NNL(c(0.1, 0.2), 7)
x1= 1
x2= 1
38 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS
Example 2 (Logistic regression model). Let Y denote a binary response variable. The regression model
E(Y ) = π(x) =exp(β0 + β1x)
1 + exp(β0 + β1x)
is called the logistic regression model, where β0 and β1 are parameters.
Suppose that Y1, Y2, . . . , Yn are independent Bernoulli random variables with
E(Yi) , πi =exp(β0 + β1xi)
1 + exp(β0 + β1xi), i = 1, . . . , n,
where x observations are assumed to be known constants.
The likelihood function of parameters β0 and β1 is
L(β0, β1) =n∏
i=1
πyi
i (1 − πi)1−yi
=n∏
i=1
( πi
1 − πi
)yi ·n∏
i=1
(1 − πi)
= exp{
n∑
i=1
(β0 + β1xi)yi
}
·n∏
i=1
{1 + exp(β0 + β1xi)}−1 .
From this we obtain the log-likelihood function
`(β0, β1) =n∑
i=1
(β0 + β1xi)yi −n∑
i=1
ln {1 + exp(β0 + β1xi)} .
However, no closed-form solution exists for the values of β0 and β1) that maximize the log-likelihood function`(β0, β1). So we need maximize `(β0, β1) numerically.
A data set from Kutner et al. (2005), Applied Statistical Models, page 566, (x=months of experience,y=task success):
x=c(14,29,6,25,18,4,18,12,22,6,30,11,30,5,20,13,9,32,24,13,19,4,28,22,8) # months
y=c(0,0,0,1,1,0,0,0,1,0,1,0,1,0,1,0,0,1,0,1,0,0,1,1,1) # success
We start by defining the partial derivatives of `(β0, β1),∂
∂β0`(β0, β1) and ∂
∂β1`(β0, β1), which are our target
functions.
F1=function(b){
F1=0
for(i in 1:length(x)) F1=F1+y[i]-exp(b[1]+b[2]*x[i])/(1+exp(b[1]+b[2]*x[i]))
F1
}
2.6. NEWTON’S METHOD FOR A SYSTEM OF NONLINEAR EQUATIONS 39
F2=function(b){
F2=0
for(i in 1:length(x)) F2=F2+x[i]*y[i]-x[i]*exp(b[1]+b[2]*x[i])/(1+exp(b[1]+b[2]*x[i]))
F2
}
F=function(b){
F=matrix(0,nrow=2)
F[1]=F1(b)
F[2]=F2(b)
F
}
Alternatively, the vector function F(β0, β1) can be set as follows
F=function(b){
F=matrix(0,nrow=2)
s1=0
s2=0
for(i in 1:length(x)){
s1 = s1 +y[i]-((exp(b[1]+b[2]*x[i]))*(1+exp(b[1]+b[2]*x[i]))^(-1))
s2 = s2 +x[i]*y[i]-(x[i]*(exp(b[1]+b[2]*x[i]))*(1+exp(b[1]+b[2]*x[i]))^(-1))}
F[1]=s1
F[2]=s2
F}
The next step is to set down the Jacobian matrix, a 2 × 2 matrix.
J=function(b){
j=matrix(0,ncol=2,nrow=2) # The format of J is 2 by 2
s11=0
s12=0
s22=0
for(i in 1:length(x)){
s11 = s11-exp(b[1]+b[2]*x[i])*(1+exp(b[1]+b[2]*x[i]))^(-2)
s12 = s12 -x[i]*exp(b[1]+b[2]*x[i])*(1+exp(b[1]+b[2]*x[i]))^(-2)
s22 = s22 -(x[i]^(2))*exp(b[1]+b[2]*x[i])*(1+exp(b[1]+b[2]*x[i]))^(-2)
}
j[1,1]=s11
j[1,2]=s12
j[2,1]=s12
j[2,2]=s22
j
}
40 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS
The R codes for Newton’s method are
NNL = function(initial,n){
b=initial
v=matrix(0,ncol=length(b))
for (i in 1:n){
d=det(J(b)) # check that J(b0,b1) is invertible
if(identical(all.equal(d,0),TRUE))
{cat(’Jacobian has no inverse.Try a different initial point.’,’\n’)
break}
else
v=solve(J(b),-F(b))
b=b+v}
cat(’ b0=’,b[1],’\n’,’b1=’,b[2],’\n’)
}
Finally let us try several particular cases.
> NNL(c(1,1),10) # A good initial point is important!
Error in qr(x, tol = tol) : NA/NaN/Inf in foreign function call (arg 1)
> NNL(c(1,0),5) # A small n
b0= -3.059696
b1= 0.1614859
> NNL(c(1,0),200) # A large n
b0= -3.059696
b1= 0.1614859
> F(c(-3.059696, 0.1614859)) # check the value of F
[1,] 2.066355e-06
[2,] 4.156266e-05
Thus, the maximum likelihood estimators of β0 and β1 are -3.059696 and 0.1614859, respectively.