3 Calculus - fu-berlin.de

3 |Calculus

The relative quality of parameter values during model estimation is commonly evaluated by means of afunction that measures the relative goodness of a given parameter value with respect to the model andobserved data. The ability to study how the value of such a function changes in response to changes inthe parameter value is an essential requirement for adapting parameter values in a sensible fashion. Theevaluation of function changes is the core topic of differential calculus. In this Section, we first review someessential aspects of differential calculus from the viewpoint of developing the theory of the GLM, includingthe notions of derivatives of univariate real-valued functions (Section 3.1), the analytical optimizationof univariate real-valued functions (Section 3.2), and derivatives of multivariate real-valued functions(Section 3.3). In a final Section (Section 3.5), we review some essential aspects of integral calculus. In thecontext of the theory of the GLM, integrals primarily occur as expectations, variances, and covariances ofrandom variables and random vectors.

3.1 Derivatives of univariate real-valued functions

We first consider derivatives of univariate real-valued functions, by which we understand functions thatmap real numbers onto real numbers. In other words, we consider functions f of the type

f : R→ R, x 7→ f(x). (3.1)

The derivative f ′(x0) ∈ R of such a function at the location x0 ∈ R conveys two basic and familiarintuitions:

1. f ′(x0) is a measure of the rate of change of f at location x0,

2. f ′(x0) is the slope of the tangent line of f at the point (x0, f(x0)) ∈ R2.

Formally, this may be expressed by the differential quotient of f . The differential quotient, also referredto as Newton’s difference quotient, expresses the difference between two values of the function f(x+ h)and f(x) with respect to the difference between the two locations x and x+ h for h approaching zero:

f ′(x) = limh→0

f(x+ h)− f(x)

h. (3.2)

The differential quotient (3.2) represents the formal definition of the derivative of f at x and the basis formathematical proofs of the rules of differentiation to be discussed in the following. However, its practicalimportance for the development of the theory of the GLM is by and large negligible.

It is, however, important to distinguish two common usages of the term derivative: first, the derivative ofa function f can be considered at a specific value x0 in the domain of f , denoted by

f ′(x)|x=x0 ∈ R, (3.3)

and represented by a number. Second, if the derivative (3.3) is evaluated for all possible values in thedomain of f , the derivative of f can be conceived as a function

f ′ : R→ R, x0 7→ f ′(x0) := f ′(x)|x=x0 . (3.4)

Intuitively, (3.4) means that the derivative of a (differentiable) univariate real-valued function may beevaluated at any point of the real line.

Higher-order derivatives

The derivative f ′ of a function f is also referred to as the first-order derivative of a function (thezeroth-order derivative of a function corresponds to the function itself). Higher-order derivatives (i.e.,second-order, third-order, and so on) can be evaluated by recursively forming the derivative of therespective lower order derivative. For example, the second-order derivative of a function corresponds to

Derivatives of univariate real-valued functions 2

the (first-order) derivative of the first-order derivative of a function. To this end, the ddx operator notation

for derivatives is useful. Intuitively, the symbol

d

dxf (3.5)

can be understood as the imperative to evaluate the derivative of f , or simply as an alternative notationfor f ′. By itself, d

dx carries no meaning. We thus have for the first-order derivative

d

dxf(x) = f ′(x) (3.6)

and for the second-order derivative

d2

dx2f(x) =

d

dx

(d

dxf(x)

)= f ′′(x). (3.7)

Intuitively, the second-order derivative measures the rate of change of the first-order derivative in thevicinity of x. If these first-order derivatives, which may be visualized as tangent lines, change relativelyquickly in the vicinity of x, the second-order derivative is large, and the function is said to have a highcurvature.

Derivatives of important functions

We next collect the derivatives of essential functions without proofs.

Constant function. The derivative of any constant function

f : R→ R, x 7→ f(x) := a, where a ∈ R, (3.8)

is zero:f ′ : R→ R, x 7→ f ′(x) = 0. (3.9)

For example, the derivative of f(x) := 2 is f ′(x) = 0.

Single-term polynomial functions. Let f be a single-term polynomial function of the form

f : R→ R, x 7→ f(x) := axb, where a ∈ R, b ∈ R \ {0}. (3.10)

Then the derivative of f is given by

f ′ : R→ R, x 7→ f ′(x) = baxb−1. (3.11)

For example, the derivative of f(x) := 2x3 is f ′(x) = 6x2 and the derivative of g(x) :=√x = x

12 is

g′(x) = 12x− 1

2 = 12√x

.

Exponential function. Let f be the exponential function

f : R→ R>0, x 7→ f(x) := exp(x). (3.12)


f ′ : R→ R>0, x 7→ f ′(x) = exp(x). (3.13)

Natural logarithm. Let f be the natural logarithm

f : R>0 → R, x 7→ f(x) := ln(x). (3.14)


f ′ : R>0 → R, x 7→ f ′(x) =1

x. (3.15)

Rules of differentiation

We next state important rules for evaluating the derivatives of univariate functions without proof.

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0

Derivatives of univariate real-valued functions 3

Summation rule. Let

f : R→ R, x 7→ f(x) :=

n∑i=1

gi(x) (3.16)

be the sum of n arbitrary functions gi : R→ R (i = 1, 2, ..., n). Then the derivative of f is given by thesum of the derivatives of the functions gi:

f ′ : R→ R, x 7→ f ′(x) :=

n∑i=1

g′i(x). (3.17)

For example, the derivative of

f(x) = x2 + 2x, where g1(x) := x2 and g2(x) := 2x, (3.18)

isf(x) = g′1(x) + g′2(x) = 2x+ 2. (3.19)

Chain rule. Let h be the composition of two functions f : R→ R and g : R→ R, i.e.,

h : R→ R, x 7→ h(x) := (g ◦ f)(x) = g (f(x)) . (3.20)

Then the derivative of h is given by

h′ : R→ R, x 7→ h′(x) := g′ (f(x)) f ′(x). (3.21)

In words, the derivative of a function that can be written as the composition of a first function f with asecond function g is given by the derivative of the second function g “at the location of the function f”multiplied with the derivative of the first function f . For example, the derivative of

h(x) := exp

(−1

2x2), (3.22)

which can be written as the composition of a function g(x) := exp(x) with derivative g′(x) = exp(x) anda function f(x) := − 1

2x2 with derivative f ′(x) = −x, is given by

h′(x) = − exp

(−1

2x2)x. (3.23)

Product rule. Let f be the product of two functions gi : R→ R with i = 1, 2, i.e.,

f : R→ R, x 7→ f(x) := g1(x)g2(x). (3.24)


f ′ : R→ R, x 7→ f ′(x) := g′1(x)g2(x) + g1(x)g′2(x), (3.25)

where g′1 and g′2 denote the derivatives of g1 and g2, respectively. In words, if a function can be written asthe product of a first and a second function, its derivative corresponds to the product of the derivative ofthe first function with the second function plus the product of the first function with the derivative of thesecond function. For example, the derivative of

f(x) := x2 exp(x) (3.26)

can be found by writing f as g1 · g2 with g1(x) := x2 and g2(x) := exp(x) with derivatives g′1(x) = 2x andg′2(x) = exp(x), respectively. This then yields

f ′(x) = 2x exp(x) + x2 exp(x). (3.27)


Analytical optimization of univariate real-valued functions 4

Quotient rule. Let f be the quotient of two functions gi : R→ R with i = 1, 2, i.e.

f : R→ R, x 7→ f(x) :=g1(x)

g2(x). (3.28)


f ′ : R→ R, x 7→ f ′(x) :=g′1(x)g2(x)− g1(x)g′2(x)

g22(x). (3.29)

In words, the derivative of a function that can be written as the quotient of a first function in the numeratorand a second function in the denominator is given by the difference of the product of the derivative of thefirst function with the second function and the product of the first function with the derivative of thesecond function, divided by the square of the second function, i.e., the function in the denominator of theoriginal function. For example, the derivative of

f(x) :=exp(x)

x2 + 1(3.30)

can be evaluated by considering g1(x) := exp(x) with derivative g′1(x) = exp(x) and g2(x) := x2 + 1 withderivative g′2(x) = 2x yielding

f ′(x) =exp(x)

(x2 + 1

)− exp(x)2x

(x2 + 1)2 . (3.31)

3.2 Analytical optimization of univariate real-valued functions

First- and second-order derivatives can be used to find local maxima and minima of functions. Findingmaxima and minima of functions is a fundamental aspect of applied mathematics and is, in general,referred to as optimization. As discussed in the introductory statements of this Section, optimization iscentral for model estimation, as will become evident in Section 8 | Maximum likelihood estimation.

Extrema and extremal points

It is helpful to clearly differentiate between two aspects of optimization: on the one hand, when finding amaximum or a minimum, one finds a value of a function f : D → R for which the conditions f(x) ≥ f(x′)or f(x) ≤ f(x′) hold for at least all x′ in a vicinity of x ∈ D. These values, which are elements of therange of f , are called maxima or minima and are abbreviated by

maxx∈D

f(x) and minx∈D

f(x), (3.32)

respectively. On the other hand, one simultaneously finds those points x in the domain of f , for which f(x)assumes a maximum or minimum. These points, which are often more interesting than the correspondingvalues f(x) themselves, are referred to as extremal points and are abbreviated by

arg maxx∈D

f(x) and arg minx∈D

f(x), (3.33)

for extremal points that correspond to maxima and minima of f , respectively. Note the difference betweenmaxx∈D f(x) and arg maxx∈D f(x): the former refers to a point or a set of points in the range of f , thelatter to a point or a set of points in the domain of f .

Necessary condition for an extremum

When using first- and second-order derivatives to find extremal points and their corresponding maximaor minima, it is helpful to distinguish necessary and sufficient conditions for extrema. The necessarycondition for an extremum (i.e., a maximum or a minimum) of a function f : R→ R at a point x ∈ R isthat the first derivative is equal to zero: f ′(x) = 0. Intuitively, this can be made transparent by consideringa maximum of f at a point xmax: for all values of x which are smaller than xmax, the derivative is positive,because the function is increasing, leading up to the maximum. For all values of x which are larger thanxmax, the derivative is negative, because the function is decreasing. At the location of the maximum, the



function is neither increasing nor decreasing, and thus f ′(x) = 0. The reverse is true for a minimum off at a point xmin: for all values of x which are smaller than xmin, the derivative is negative, becausethe function is decreasing towards the minimum. For all values of x which are larger than xmin, thederivative is positive, because the function is increasing again and recovering from the minimum. Again,at the location of the minimum, the function is neither increasing nor decreasing, and thus f ′(x) = 0.Based on finding a point x∗ with f ′ (x∗) = 0 one cannot decide whether x∗ corresponds to a maximumor minimum, because in both cases f ′(x) = 0. On the other hand, if a minimum or maximum exists ata point x∗, it necessarily follows that f ′(x∗) = 0; hence, the nomenclature necessary condition. In fact,there are two more possibilities for points at which f ′(x∗) = 0: the function may be increasing for x < x∗

and for x > x∗, or the function may be decreasing for x < x∗ and for x > x∗. In both cases, there is nomaximum nor minimum in x∗, but what is referred to as a saddle point.

Sufficient conditions for an extremum

The second-order derivative f ′′(x) allows for testing whether a critical point x∗ for an extremum, i.e., apoint for which f ′(x∗) = 0, is a maximum, a minimum, or a saddle point. In brief, if f ′′(x∗) < 0, there isa maximum at x∗, if f ′′(x∗) > 0, there is a minimum at x∗, and if f ′′(x∗) = 0, there is a saddle point atx∗. Together with the condition f ′(x∗) = 0, these conditions are referred to as sufficient conditions for anextremum or a saddle point.

The role of the second derivative can be made intuitive by considering a maximum at x∗. For pointsx < x∗, the slope of the tangent line at f(x) must be positive, because f is increasing towards x∗. Likewisefor points x > x∗, the slope of the tangent line at f(x) must be negative, because f is decreasing afterassuming its maximum in x∗. In other words, f ′(x) > 0 (positive) for x < x∗ and f ′(x∗) < 0 (negative)for x > x∗ and f ′(x∗) = 0. We next consider the change of f ′, i.e., f ′′: in the region around the maximum,f ′ decreases from a positive value to zero to a negative value, as just stated. Because f ′(x) is positivejust before (to the left of) x∗ and negative just after (to the right of) x∗, it obviously decreases from justbefore to just after x∗. But this means that its own rate of change, f ′′, is negative in x∗. The reverseholds for a minimum of f in x∗.

To recapitulate, we have established the following conditions that use the derivatives of a functionf : R→ R to determine its extrema:

� If there is a maximum or minimum of f at x∗, then f ′(x∗) = 0.

� If f ′(x∗) = 0 and f ′′(x∗) > 0, then there is a minimum at x∗.

� If f ′(x∗) = 0 and f ′′(x∗) < 0, then there is a maximum at x∗.

The first of these conditions is referred to as the necessary condition for extrema, the latter two conditionsas the sufficient conditions for extrema. We next discuss three examples in which we use these conditionsto determine the location of extrema (Figure 3.1).

Example 1. Consider the function

f : [0, π]→ [−1, 1], x 7→ sin (x) (3.34)

depicted in Figure 3.1A by a blue curve. The first derivative of f is given by

f ′ : [0, π]→ [−1, 1], x 7→ d

dxsin(x) = cos(x) (3.35)

and depicted in Figure 3.1A by a red curve. Notably, in the interval [0, π], the cosine function assumes azero point at π

2 . We thus have the critical point x∗ = π/2 for an extremum. The second derivative of f isgiven by

f ′′ : [0, π]→ [−1, 1], x 7→ d

dxcos(x) = − sin (x) (3.36)

and is depicted in Figure 3.1A by a dashed red curve. Because sin(π/2) = −1, we can conclude that thereis a maximum of f at x = π/2. Of course, this is also obvious from the graph of f .



0 1 2 3

x

-1

-0.5

0

0.5

1

f(x) := sin(x)

A

-1 0 1 2 3

x

-2

0

2

f(x) := (x! 1)2B

-2 0 2

x

-2

0

2

f(x) := !x2

C

f(x)f 0(x)f 00(x)x$

Figure 3.1. Analytical optimization of basic functions. For a detailed discussion, please refer to the main text (glm 3.m).

Example 2. Consider the functionf : R→ R, x 7→ (x− 1)2 (3.37)

depicted in Figure 3.1B by a blue curve. The first derivative of f is given by

f ′ : R→ R, x 7→ d

dx(x− 1)2 = 2x− 2 (3.38)

and depicted in Figure 3.1B by a red curve. Setting this derivative to zero and solving for x yields

2x− 2 = 0⇔ 2x = 2⇔ x = 1. (3.39)

We thus have the critical point x∗ = 1 for an extremum. The second derivative of f is given by

f ′′ : R→ R, x 7→ d

dx(2x− 2) = 2 (3.40)

and is depicted in Figure 3.1B by a dashed red curve. The second derivative is thus a constant function,and f ′′(x∗) = 2 > 0. We thus conclude that there is a minimum of f at x = 1 . Again, this is also obviousfrom the graph of f .

Example 3. Finally, we consider the function

f : R→ R, x 7→ −x2 (3.41)

depicted in Figure 3.1C by a blue curve. The first derivative of f is given by

f ′ : R→ R, x 7→ d

dx

(−x2

)= −2x (3.42)

and depicted in Figure 3.1C by a red curve. Setting this derivative to zero and solving for x yields

− 2x = 0⇔ x = 0. (3.43)

We thus have the critical point x∗ = 0 for an extremum. The second derivative of f is given by

f ′′ : R→ R, x 7→ d

dx(−2x) = −2 (3.44)

and is depicted in Figure 3.1C by a dashed red curve. The second derivative is thus a constant functionand f ′′(x∗) = −2 < 0. We thus conclude that there is a maximum of f at x = 1.


Derivatives of multivariate real-valued functions 7

Figure 3.2. Visualization of multivariate (here bivariate), real-valued functions. Real-valued functions of multiple variablesare often visualized in a three-dimensional way as in the left panels of the Figure. Note that although this is a 3D plot,the function is bivariate, i.e., it is a function of two variables. The same information can be conveyed by using isocontourplots, which visualize the isocontours of functions in 2D. Isocontours are the lines assuming equal values in the range of thefunction. Usually isocontour plots suffice to convey all relevant information about a bivariate function (glm 03.m).

3.3 Derivatives of multivariate real-valued functions

Thus far, we considered functions of the form f : R → R, which map numbers x ∈ R onto numbersf(x) ∈ R. Another function type that is encountered in the development of the GLM are functions of theform

f : Rn → R, x 7→ f(x), (3.45)

where

x :=

x1...xn

∈ Rn (3.46)

is an n-dimensional vector. Because the input argument x to such a function can vary along n ≥ 1dimensions and its output argument f(x) is a scalar real number, such functions are also called multivariatereal-valued functions. In physics, such functions are referred to as scalar fields, because they allocatescalars f(x) ∈ R to points x in the n-dimensional space Rn. An example for a function of the type (3.45)is

f : R2 → R, x 7→ f(x) = f

((x1x2

)):= x21 + x22, (3.47)

which is visualized in Figure 3.2A. Another example is the function

g : R2 → R, x 7→ g(x) = g

((x1x2

)):= exp

(−1

2

((x1 − 1)

2+ (x2 − 1)

2))

(3.48)

which is visualized in Figure 3.2B. Note that functions defined on spaces Rn with n > 2 cannot bevisualized easily.



As for univariate real-valued functions, one can ask how much a change in the input argument at aspecific point in Rn of a multivariate real-valued function affects the value of the function. If one asks thisquestion for each of the subcomponents xi, i = 1, ..., n of x ∈ Rn independently of the remaining n− 1subcomponents, one is led to the concept of a partial derivative: the partial derivative of a multivariatereal-valued function f : Rn → R with respect to a variable xi, i = 1, ..., n captures how much the functionvalue changes “in the direction” of xi, i.e., in the cross-section through the space Rn defined by thevariable of interest. Stated differently, the partial derivative of a function f : Rn → R in a point x ∈ Rnwith respect to a variable xi is the derivative of the function f with respect to xi while all other variablesxj , j = 1, 2, i− 1, i+ 1, ..., n are held constant. The partial derivative of a function f : Rn → R in a pointx ∈ Rn with respect to a variable xi is denoted by

∂

∂xif(x), (3.49)

where the ∂ symbol is used to distinguish the notion of a partial derivative from a standard derivative.This notation is somewhat redundant, because the subscript i on the x in ∂

∂xialready makes it clear

that the derivative is with respect to xi only. The notation is, however, commonly used, and if thesubcomponents of x are not denoted by x1, ..., xn, but by, say, a := x1, b := x2, ..., e := x5, it is, in fact,helpful. Like the derivative of a univariate real-valued function, one may evaluate the partial derivative forall x ∈ Rn and hence also view the partial derivative of a multivariate real-valued function as a function

∂

∂xif : Rn → R, x 7→ ∂

∂xif(x). (3.50)

We next discuss two examples.

Examples 1. We first consider the function

f : R2 → R, x 7→ f(x) := x21 + x22. (3.51)

Because this function has a two-dimensional domain, one can evaluate two different partial derivatives,

∂

∂x1f : R2 → R, x 7→ ∂

∂x1f(x) (3.52)

and∂

∂x2f : R2 → R, x 7→ ∂

∂x2f(x). (3.53)

To evaluate the partial derivative (3.52), one considers the function

fx2: R→ R, x1 7→ fx2

(x1) := x21 + x22 (3.54)

where x2 assumes the role of a constant. To indicate that x2 is no longer an input argument of thefunction, but the function is still dependent on the constant x2, we have used the subscript notationfx2(x1). To evaluate the partial derivative, we evaluate the standard univariate derivative of fx2 ,

f ′x2(x) = 2x1. (3.55)

We thus have∂

∂x1f : R2 → R, x 7→ ∂

∂x1f(x) =

∂

∂x1(x21 + x22) = f ′x2

(x) = 2x1. (3.56)

Accordingly, with the corresponding definition of fx1, we have

∂

∂x2f : R2 → R, x 7→ ∂

∂x2f(x) =

∂

∂x2(x21 + x22) = f ′x1

(x) = 2x2. (3.57)

Example 2. We next consider the example

g : R2 → R, x 7→ g(x) := exp

(−1

2

((x1 − 1)

2+ (x2 − 1)

2))

. (3.58)

Again, there are two partial derivatives to consider. Using the chain rule of differentiation and the logic oftreating the variable with respect to which the derivative is not performed as a constant, we obtain



∂

∂x1g(x) =

∂

∂x1

(exp

(−1

2

((x1 − 1)2 + (x2 − 1)2)))

=∂

∂x1exp

(−1

2

((x1 − 1)2 + (x2 − 1)2)) ∂

∂x1

(−1

2

((x1 − 1)2 + (x2 − 1)2))

= − exp

(−1

2

((x1 − 1)2 + (x2 − 1)2)) (x1 − 1) ,

(3.59)

and

∂

∂x2g(x) =

∂

∂x2

(exp

(−1

2

((x1 − 1)2 + (x2 − 1)2)))

=∂

∂x2exp

(−1

2

((x1 − 1)2 + (x2 − 1)2)) ∂

∂x2

(−1

2

((x1 − 1)2 + (x2 − 1)2))

= − exp

(−1

2

((x1 − 1)2 + (x2 − 1)2)) (x2 − 1) ,

(3.60)

for the values of the respective partial derivative functions.

Higher-order partial derivatives

As for the standard derivative of a univariate real-valued function f : R → R, higher-order partialderivatives can be formulated and evaluated by taking partial derivatives of partial derivatives. Becausemultivariate real-valued functions are functions of multiple input arguments, more possibilities exist forhigher-order derivatives compared to the univariate case. For example, given the partial derivative ∂

∂x1f

of a function f : R3 → R, one may next form the partial derivative again with respect to x1, yieldingthe second-order partial derivative equivalent to the second-order derivative of a univariate function and

denoted by ∂2

∂x21f . However, one may also form the partial derivative with respect to x2, ∂2

∂x1∂x2f , or with

respect to x3, ∂2

∂x1∂x3f . Note that the numerator of the partial derivative sign increases its power with the

order of the derivative and the denominator denotes the variables with respect to which the derivativeis taken. If the derivative is taken multiple times with respect to the same variable, the variable in thedenominator is notated with the corresponding power. Again, note that these are mere conventions tosignal the form of the partial derivative, but the symbols themselves do not carry any meaning beyondthe implicit imperative to consider or evaluate the corresponding partial derivative.

Example. To exemplify the notation introduced above, we evaluate the first and second-order partialderivatives of the function

f : R3 → R, x 7→ f(x) := x21 + x1x2 + x2√x3. (3.61)

For the first-order-derivatives, we have

∂

∂x1f(x) =

∂

∂x1

(x2

1 + x1x2 + x2√x3

)= 2x1 + x2,

∂

∂x2f(x) =

∂

∂x2

(x2

1 + x1x2 + x2√x3

)= x1 +

√x3,

∂

∂x3f(x) =

∂

∂x3

(x2

1 + x1x2 + x2√x3

)=

x2

2√x3

.

(3.62)

For the second-order derivatives with respect to x1, we then have

∂2

∂x1∂x1f(x) =

∂

∂x1

(∂

∂x1f(x)

)=

∂

∂x1(2x1 + x2) = 2,

∂2

∂x2∂x1f(x) =

∂

∂x2

(∂

∂x1f(x)

)=

∂

∂x2(2x1 + x2) = 1,

∂2

∂x3∂x1f(x) =

∂

∂x3

(∂

∂x1f(x)

)=

∂

∂x3(2x1 + x2) = 0.

(3.63)



For the second-order derivatives with respect to x2, we have

∂2

∂x1∂x2f(x) =

∂

∂x1

(∂

∂x2f(x)

)=

∂

∂x1(x1 +

√x3) = 1,

∂2

∂x2∂x2f(x) =

∂

∂x2

(∂

∂x2f(x)

)=

∂

∂x2(x1 +

√x3) = 0,

∂2

∂x3∂x2f(x) =

∂

∂x3

(∂

∂x2f(x)

)=

∂

∂x3(x1 +

√x3) =

1

2√x3

.

(3.64)

Finally, for the second-order derivatives with respect to x3, we have

∂2

∂x1∂x3f(x) =

∂

∂x1

(∂

∂x3f(x)

)=

∂

∂x1

(x2

2

√x3

)= 0,

∂2

∂x2∂x3f(x) =

∂

∂x2

(∂

∂x3f(x)

)=

∂

∂x2

(x2

2√x3

)=

1

2√x3

,

∂2

∂x3∂x3f(x) =

∂

∂x3

(∂

∂x3f(x)

)=

∂

∂x3

(x2

1

2x− 1

23

)= −1

4x2x− 3

23 .

(3.65)

Note from the above that it does not matter in which order the second derivatives are taken, as

∂2

∂x1∂x2f(x) =

∂2

∂x2∂x1f(x) = 1,

∂2

∂x1∂x3f(x) =

∂2

∂x3∂x1f(x) = 0,

∂2

∂x2∂x3f(x) =

∂2

∂x3∂x2f(x) =

1

2√x3

.

(3.66)

This is a general property of partial derivatives known as Schwarz’ Theorem, which we state withoutproof.

Theorem 3.3.1 (Schwarz’ Theorem). For a multivariate real-valued function

f : Rn → R, x 7→ f(x), (3.67)

it holds that∂2

∂xi∂xjf(x) =

∂2

∂xj∂xif(x) for all 1 ≤ i, j ≤ n. (3.68)

Schwarz’ Theorem is helpful when evaluating partial derivatives: on the one hand, one can save somework by relying on it, on the other hand, it can help to validate one’s analytical results, because if onefinds that it does not hold for certain second-order partial derivatives, there must be an error.

Gradient and Hessian

The first- and second-order partial derivatives of a multivariate real-valued f functions can be summarizedin two entities known as the gradient (or gradient vector) and the Hessian (or Hessian matrix ).

Gradient. The gradient of a function

f : Rn → R, x 7→ f(x) (3.69)

at a location x ∈ Rn is defined as the n-dimensional vector of the function’s partial derivatives evaluatedat this location and is denoted by the ∇ (nabla) symbol:

∇f : Rn → Rn, x 7→ ∇f(x) :=

∂∂x1

f(x)∂∂x2

f(x)...

∂∂xn

f(x)

. (3.70)

Note that the gradient is a vector-valued function: it takes a vector x ∈ Rn as input and returns a vector∇f (x) ∈ Rn. We note without proof that the gradient evaluated at x ∈ Rn is a vector that points in thedirection of the greatest rate of increase (steepest ascent) of the function.


Derivatives of multivariate vector-valued functions 11

Hessian. The second-order partial derivatives of a multivariate-real valued function f can be summarizedin the Hessian of the function, which hereinafter is denoted by Hf . It is defined as

Hf : Rn → Rn×n, x 7→ Hf (x), (3.71)

where

Hf (x) :=

∂2

∂x1∂x1f(x) ∂2

∂x1∂x2f(x) · · · ∂2

∂x1∂xnf(x)

∂2

∂x2∂x1f(x) ∂2

∂x2∂x2f(x) · · · ∂2

∂x2∂xnf(x)

......

. . ....

∂2

∂xn∂x1f(x) ∂2

∂xn∂x2f(x) · · · ∂2

∂xn∂xnf(x)

. (3.72)

Note that in each row of the Hessian, the second of the two partial derivatives is constant (in the orderof differentiation, not in the order of notation), while the first partial derivative varies from 1 to n overcolumns, and the reverse is true for each column. Notably, the Hessian matrix is a matrix-valued function:it takes a vector x ∈ Rn as input and returns an n× n matrix Hf (x) ∈ Rn×n. Finally, note that due toSchwarz’ Theorem, the Hessian matrix is symmetric, i.e.,(

Hf (x))T

= Hf (x) . (3.73)

3.4 Derivatives of multivariate vector-valued functions

Multivariate vector-valued functions

Thus far, we have discussed univariate real-valued and multivariate real-valued functions. A further typeof function that is commonly encountered are functions that map vectors onto vectors. A principledaccount and theoretical development of derivatives for such multivariate vector-valued functions and theirderivatives is provided by Magnus and Neudecker (1989).

Multivariate vector-valued functions are functions of the form

f : Rn → Rm, x 7→ f(x) :=

f1 (x1, ..., xn)f2 (x1, ..., xn)

...fm (x1, ..., xn)

. (3.74)

In physics, such functions are referred to as vector fields. The multivariate real-valued functions

fi : Rn → R, i = 1, ..., n (3.75)

are referred to as the component functions of f .

Example. A first example for a multivariate vector-valued function is

f : R3 → R2, x 7→ f(x) :=

(x1 + x2x2x3

), (3.76)

for which the component functions are given by

f1 : R3 → R, x 7→ f1(x) := x1 + x2, (3.77)

f2 : R3 → R, x 7→ f1(x) := x2x3. (3.78)

The Jacobian matrix

The first derivative of multivariate vector-valued functions evaluated at x ∈ Rn is given by the Jacobianmatrix. The Jacobian matrix is denoted and defined by

Jf : Rn → Rm×n, x 7→ Jf (x) :=

(∂

∂xjfi(x)

)i=1,...,m,j=1,...,n

=

∂∂x1

f1(x) · · · ∂∂xn

f1(x)...

. . ....

∂∂x1

fm(x) · · · ∂∂xn

fm(x)

. (3.79)


Basic integrals 12

In words, the Jacobian matrix of a multivariate vector-valued function f : Rn → Rm with componentfunctions fi, i = 1, ...,m is the m × n matrix of n partial derivatives of the m component functionswith respect to the n input vector components xj , j = 1, ..., n. Note that the gradient of a multivariatereal-valued function corresponds to the transpose of the Jacobian matrix of the function: for

f : Rn → Rm (3.80)

with m = 1, we have

∇f(x) =(Jf (x)

)T. (3.81)

This can be readily seen by inspecting the entries of the first row of (3.82). Finally, note that thedeterminant of the Jacobian matrix is often referred to as Jacobian.

Example. As an example, consider the Jacobian matrix of the function defined in (3.76). By evaluationof the respective partial derivatives, we have

Jf : R3 → R2×3, x 7→ Jf (x) :=

(∂∂x1

f1(x) ∂∂x2

f1(x) ∂∂x3

f1(x)∂∂x1

f2(x) ∂∂x2

f2(x) ∂∂x3

f2(x)

)=

(1 1 0

0 x3 x2

). (3.82)

3.5 Basic integrals

In this Section, we review the intuition of the definite integral as the signed area under a function’s graphand the notion of indefinite integration as the inverse of differentiation.

Definite integrals of univariate real-valued functions

We denote the definite integral of a univariate real-valued function f on an interval [a, b] ⊂ R by the realnumber

I :=

∫ b

a

f (x)dx ∈ R. (3.83)

It is important to realize two aspects of (3.83): first, the definite integral is a real number and second,the right-hand side of (3.83) is merely notational and to be understood as the imperative for integratingthe function f on the interval [a, b]. In other words, there is no mathematical meaning associated with

the dx or the∫ ba

that goes beyond the definition of the integral boundaries a and b. The term definite isused here to distinguish this integral from the indefinite integral discussed below. Put simply, definiteintegrals are those integrals for which the integral boundaries appear at the integral sign - although theymay sometimes be omitted, e.g., if the interval of integration is the entire real line. Intuitively, the definite

integral∫ baf(x)dx is best understood as the continuous generalization of the discrete sum

n∑i=1

f(xi)∆x, (3.84)

wherea =: x1, x2 := x1 + ∆x, x3 := x2 + ∆x, ..., xn+1 := b (3.85)

corresponds to an equipartition of the interval [a, b], i.e., a partition of the interval [a, b] into n+ 1 binsof equal size ∆x. The term f (xi) ∆x for i = 1, ..., n in (3.84) corresponds to the area of the rectangleformed by the value of the function f at xi (i.e., the upper left corner of the rectangle) as height and thebin width ∆x as width. Summing over all rectangles then yields an approximation of the area under thegraph of the function f , where terms with negative values of f(xi) enter the sum with a negative sign.Intuitively, letting the bin width ∆x in the sum (3.84) approach zero then approximates the integral of fon the interval [a, b] ∫ b

a

f(x)dx ≈n∑i=1

f(xi)∆x for ∆x→ 0. (3.86)

This approximation approach to the definite integral is visualized in Figure 3.3.


Basic integrals 13

Figure 3.3. Evaluation of a definite integral by means of the approximation approach described in eq. (3.86).

Definite integrals have a linearity property, which is often useful when evaluating integrals analytically.Based on the intuition that for a function f : R→ R, the definite integral corresponds to∫ b

a

f(x)dx ≈n∑i=1

f(xi)∆x, (3.87)

and the fact that for a second function g : R→ R, we have

n∑i=1

(f (xi) + g(xi)) ∆x =

n∑i=1

(f (xi) ∆x+ g(xi)∆x) =

n∑i=1

f (xi) ∆x+

n∑i=1

g(xi)∆x, (3.88)

and for a constant c ∈ R we have

n∑i=1

cf (xi) ∆x = c

n∑i=1

f (xi) ∆x, (3.89)

we can infer the following two properties of the integral:∫ b

a

(f(x) + g(x))dx =

∫ b

a

f(x)dx+

∫ b

a

g(x)dx (3.90)

and ∫ b

a

cf(x)dx = c

∫ b

a

f(x)dx. (3.91)

In words, first, the integral of the sum of two functions f + g over an interval [a, b] corresponds to thesum of the integrals of the individual functions f and g on [a, b]. Second, the integral of a function fmultiplied by a constant c on an interval [a, b] corresponds to the integral of the function f on an interval[a, b] multiplied by the constant. Both properties are very useful when evaluating integrals analytically:the first allows for decomposing integrals of composite functions into sums of integrals of less complexfunctions, while the second allows for removing constants from integration.

Indefinite integrals

Consider a univariate real-valued function

f : R→ R, x 7→ f(x). (3.92)

Next, consider a second function that is defined in terms of definite integrals of f by making the upperintegration boundary of these definite integrals its argument:

F : R≥0 → R, x 7→ F (x) :=

∫ x

0

f(s) ds. (3.93)


Basic integrals 14

From the discussion above, we have that the value F at x corresponds to the signed area under the graphof the function f on the interval from 0 to x. Notably, the derivative F ′(x) of F at x corresponds to thevalue of the function f , i.e.,

F ′(x) =d

dx

(∫ x

0

f(s) ds

)= f(x). (3.94)

Intuitively, eq. (3.94) states that integration is the inverse of differentiation, in the sense that firstintegrating f from 0 to x and then computing the derivative with respect to x yields f . Any function Fwith the property F ′(x) = f(x) for a function f is called an anti-derivative or indefinite integral of f . Anindefinite integral is denoted by

F : R→ R, x 7→ F (x) =

∫f(s) ds. (3.95)

Note that the definite integral defined above corresponds to a real scalar number, while the indefiniteintegral is a function.

Proof of (3.94)

While the statement of equation (3.94) is familiar and intuitive, it is not necessarily formally easy to grasp. Here,we provide a proof of this equation based on Leithold (1976). The proof makes use of limiting processes and themean value theorem (Spivak, 2008). Let f : R → R, s 7→ f(s) be a univariate real-valued function, and defineanother function

F : R→ R, x 7→ F (x) :=

∫ x

a

f(s) ds. (3.96)

For any two numbers x1 and x1 + ∆x in the (closed) interval [a, b] ⊂ R, we then have

F (x1) =

∫ x1

a

f(s)ds and F (x1 + ∆x) =

∫ x1+∆x

a

f(s) ds. (3.97)

Subtraction of these two equalities yields

F (x1 + ∆x)− F (x1) =

∫ x1+∆x

a

f(s)ds−∫ x1

a

f(s) ds. (3.98)

From the intuition of the integral as the area between the function f and the x-axis, it follows naturally that thesum of the areas of two adjacent areas is equal to the area of both regions combined, i.e.,∫ x1

a

f(s)ds +

∫ x1+∆x

x1

f(s)ds =

∫ x1+∆x

a

f(s) ds. (3.99)

From this it follows that the difference above evaluates to

F (x1 + ∆x)− F (x1) =

∫ x1+∆x

x1

f(s) ds. (3.100)

According to the mean value theorem for integration, there exists a real number c∆x ∈ [x1, x1 +∆x] (the dependenceon ∆x of which we have denoted by the subscript) with∫ x1+∆x

x1

f(s)ds = f (c∆x) ∆x. (3.101)

and we hence obtainF (x1 + ∆x)− F (x1) = f (c∆x) ∆x. (3.102)

Division by ∆x then yieldsF (x1 + ∆x)− F (x1)

∆x= f (c∆x) , (3.103)

where the left-hand side corresponds to Newton’s difference quotient. Taking the limit ∆x→ 0 on both sides thenyields

F (x1 + ∆x)− F (x1)

∆x= f (c∆x)⇔ F ′(x1) = f (c∆x) (3.104)

by definition of the derivative as the limit of Newton’s difference quotient. The limit on the right-hand sideof (3.104) remains to be evaluated. To this end, we recall that c∆x ∈ [x1, x1 + ∆x] or, in other words, thatx1 ≤ c∆x ≤ x1 + ∆x. Notably, x1 = x1 and x1 + ∆x = x1. Therefore, we can conclude that c∆x = x1, as c∆x issqueezed between x1 = x1 and x1 + ∆x = x1. We thus find

F ′ (x1) = f (c∆x) = f (x1) , (3.105)


Basic integrals 15

which concludes the proof. 2

Indefinite integrals allow for the evaluation of definite integrals∫ baf(s)ds by means of the fundamental

theorem of calculus ∫ b

a

f(s)ds = F (b)− F (a). (3.106)

In words, to evaluate the integral of a univariate real-valued function f on the interval [a, b], one hasto first compute the anti-derivative of f , and then compute the difference between the anti-derivativeevaluated at the upper integral interval boundary b and the anti-derivative evaluated at the lower integralinterval boundary a. Equation (3.106) is very familiar. We first consider some of its properties and thenprovide a formal justification.

Properties of indefinite integrals. We first note without proof that the linearity properties of thedefinite integral also hold for the indefinite integral: for functions f, g : R→ R and a constant c ∈ R wehave ∫

(f(x) + g(x)) dx =

∫f(x)dx+

∫g(x) dx (3.107)

and ∫cf(x)dx = c

∫f(x) dx. (3.108)

As for differentiation, it is useful to know the anti-derivatives of a handful of univariate real-valuedfunctions that are commonly encountered. A selection of anti-derivatives is presented below. These canreadily be verified by evaluating the derivatives of the respective anti-derivatives to recover the originalfunctions. Note that the derivative of the constant function f(x) := c, c ∈ R is zero. We have

f(x) := a ⇒ F (x) = ax+ cf(x) := xa ⇒ F (x) = 1

a+1xa+1 + c (a 6= −1)

f(x) := x−1 ⇒ F (x) = lnx+ cf(x) := exp(x) ⇒ F (x) = exp (x) + cf(x) := sin(x) ⇒ F (x) = − cos (x) + cf(x) := cos(x) ⇒ F (x) = sin (x) + c

Proof of (3.106)

As for the statement that the derivative of an anti-derivative is the original function, the fundamental theorem ofcalculus is very familiar but a formal derivation is somewhat more involved. The proof provided here, which againfollows Leithold (1976), makes use of limiting processes and the mean value theorem of differentiation. We firstconsider the quantity F (b)− F (a). To this end, we select numbers x0, ..., xn such that

a := x0 < x1 < x2 < . . . < xn−1 < xn =: b. (3.109)

It then follows thatF (b)− F (a) = F (xn)− F (x0) . (3.110)

Next, each F (xi) , i = 1, . . . , n− 1 is added to the quantity F (b)− F (a) together with its additive inverse

F (b)− F (a) = F (xn) + (−F (xn−1) + F (xn−1)) + . . . + (−F (x1) + F (x1))− F (x0)

= (F (xn)− F (xn−1)) + (F (xn−1)− F (xn−2)) + . . . + (F (x1)− F (x0))

=

n∑i=1

(F (xi)− F (xi−1)) .

(3.111)

For a function F : [a, b]→ R, the mean value theorem of differentiation states that under certain constraints on F ,which we assume to be fulfilled, there exists a number c ∈]a, b[ such that

F ′(c) =F (b)− F (a)

b− a. (3.112)

From the mean value theorem of differentiation, it thus follows that for the terms of the sum above, we have withappropriately chosen ci ∈]a, b[, i = 1, . . . , n

F (xi)− F (xi−1) = F ′(ci)(xi − xi−1), (3.113)


Basic integrals 16

and substitution then yields

F (b)− F (a) =

n∑i=1

F ′(ci)(xi − xi−1). (3.114)

By definition, it follows thatF ′ (ci) = f(ci), (3.115)

and setting ∆xi−1 := xi − xi−1 yields

F (b)− F (a) =

n∑i=1

f(ci)∆xi−1. (3.116)

Now, F (b) and F (a) are independent of xi and the left-hand side of the above thus evaluates to F (b)− F (a). Forthe right-hand side, we note that xi−1 ≤ ci ≤ xi−1 + ∆xi−1 and thus xi−1 = xi−1 and xi−1 + ∆xi−1 = xi−1, fromwhich it follows that ci = xi−1. We thus have

F (b)− F (a) =

n∑i=1

f (xi−1) ∆xi−1 =

n−1∑i=0

f (xi) ∆xi ≈∫ b

a

f (s) ds (3.117)

with the definition of the definite integral under the generalization that the ∆xi may not be equally spaced. Thisconcludes the proof.

2

Example To illustrate the theory and interplay of indefinite and definite integrals, we evaluate the definiteintegral of the function

f : R→ R, x 7→ f(x) := 2x2 + x+ 1 (3.118)

on the interval [1, 2]. To this end, we first use the linearity property of the indefinite integral, which yields

F : R→ R, x 7→ F (x) : =

∫f(x)dx =

∫ (2x2 + x+ 1

)dx = 2

∫x2dx+

∫xdx+

∫1dx. (3.119)

We then make use of the table of commonly encountered anti-derivatives to evaluate the remaining integralterms, yielding

F (x) =2

3x3 +

1

2x2 + x+ c, (3.120)

where the constant c ∈ R comprises all constant terms. Importantly, this constant term vanishes once weevaluate a definite integral by means of the fundamental theorem of calculus:∫ 2

1

f(x) dx = F (2)− F (1)

=2

323 +

1

222 + 2 + c−

(2

313 +

1

212 + 1 + c

)=

16

3+

4

2+ 2 + c− 2

3− 1

2− 1− c

=32

6+

12

6+

12

6+ c− 4

6− 3

6− 6

6− c

=43

6.

(3.121)

Integration by parts

Integration by parts can be considered an analogon of the product rule of differentiation for integrals. Fortwo functions f : [a, b]→ R and g : [a, b]→ R, the integration by parts rule states that∫ b

a

f ′(x)g(x) dx = f(b)g(b)− f(a)g(a)−∫ b

a

f(x)g′(x) dx (3.122)

The integration by parts rule can be useful, if the anti-derivative of f and the integral on the left-handside are readily available.


Basic integrals 17

Proof. With the product rule of differentation, we have

(f(x)g(x))′ = f ′(x)g(x) + f(x)g′(x)⇔ f ′(x)g(x) = (f(x)g(x))′ − f(x)g′(x)

⇔∫ b

af ′(x)g(x) dx =

∫ b

a(f(x)g(x))′ dx−

∫ b

af(x)g′(x) dx

(3.123)

Because f(x)g(x) is an anti-derivative of (f(x)g(x))′, it follows immediately with the fundamental theorem of calculus that∫ b

af ′(x)g(x) dx = f(b)g(b)− f(a)b(a)−

∫ b

af(x)g′(x) dx (3.124)

Integration by substitution

The fundamental theorem of calculus allows the evaluation for certain integrals by means of an integrationrule that is known as integration by substitutions and sometimes referred to as “integration by a change ofvariables”. Specifically, for two functions f : I → R and g : [a, b]→ R, it holds that∫ b

a

f(g(x))g′(x) dx =

∫ g(b)

g(a)

f(x) dx (3.125)

Proof. We first note that the anti-derivative of f(g(x))g(x) is given by (F ◦ g)(x), where F denotes an anti-derivative of f ,because

(F ◦ g)′(x) = F ′(g(x))g(x) = f(g(x))g(x). (3.126)

With the fundamental theorem of calculus, we then have∫ b

af(g(x))g′(x) dx = (F ◦ g)(b)− (F ◦ g)(a) = F (g(b))− F (g(a)) =

∫ g(b)

g(a)f(x) dx (3.127)

Definite integrals of multivariate real-valued functions on rectangles

The notion of the definite integral of a univariate real-valued function can be generalized to the definiteintegral of multivariate real-valued functions. Specifically, let

R := [a1, b1]× · · · × [an, bn] ⊆ Rn (3.128)

denote a rectangle, where the ai, bi, i = 1, ..., n may be finite or infinite. Further, let

f : Rn → R, x 7→ f(x) (3.129)

denote a multivariate real-valued function. Then, under certain regularity conditions which are omittedhere, Fubini’s theorem states that∫

R

f(x) dx =

∫ b1

a1

· · ·∫ bn

an

f(x1, ..., xn) dxn · · · dx1. (3.130)

In words, the definite integral ∫R f(x) dx of the multivariate real-valued function f on the rectangle R canbe evaluated as the iterated integral ∫ b1a1 · · · ∫

bnan f(x1, ..., xn) dxn · · · dx1, which corresponds to a sequence

of definite integrals of univariate real-valued functions. Crucially, Fubini’s theorem implies that the orderof integration in iterated integrals does not matter.

Example. As an example for the definite integral of a multivariate real-valued function, consider theintegral of the function

f : R2 → R, x 7→ f(x) := 2x1 + x2 (3.131)

on R := [0, 2]× [0, 3]. We have∫R

f(x) dx =

∫[0,2]

(∫[0,3]

2x1 + x2 dx1

)dx2

=

∫[0,2]

(∫[0,3]

2x1 dx1 +

∫[0,3]

x2 dx1

)dx2

=

∫[0,2]

(∫[0,3]

2x1 dx1 + x2

∫[0,3]

1 dx1

)dx2.

(3.132)


Bibliographic remarks 18

For the integrals with respect to x1, we then have with the fundamental theorem of calculus and the factthat the anti-derivatives of g(x1) = 2x1 and h(x) = 1 evaluate to G(x1) = x21 + c and H(x1) = x1 + c,respectively, ∫

R

f(x) dx =

∫[0,2]

(32 − 02 + x2(3− 0)

)dx2 =

∫[0,2]

3x2 + 9 dx2. (3.133)

With the fact that the anti-derivative of g(x2) = 3x2 + 9 is given by G(x) = 32x

22 + 9x2, we then have∫

R

f(x) dx =3

222 + 9 · 2− 3

202 − 9 · 0 = 24. (3.134)

3.6 Bibliographic remarks

The material presented in this chapter is standard and can be found in any undergraduate textbook oncalculus. A good starting point is Spivak (2008), which also provides many justifications for the resultspresented in the current chapter. Abbott (2015) provides a very readable treatment to the more subtleaspects of real analysis. A theoretically grounded introduction to multivariate calculus is provided byMagnus and Neudecker (1989). As previously mentioned, for justifying the central results on definite andindefinite integration, we consulted Leithold (1976).

3.7 Study questions

1. Give a brief explanation of the notion of a derivative of a univariate function f in a point x.

2. Provide brief explanations of the symbols ddx ,

d2

dx2 , ∂∂x , and ∂2

∂x2 .

3. Compute the first derivatives of the following functions:

f : R→ R, x 7→ f(x) := 3 exp (−x2) (3.135)

g : R→ R, x 7→ g(x) := (x2 + 2 ln(x)− a)3. (3.136)

4. Determine the minimum of the function

f : R→ R, x 7→ f(x) := x2 + 3x− 2. (3.137)

5. Compute the partial derivatives of the function

f : R2 → R, (x, y) 7→ f(x, y) := ln(x) +

n∑i=1

(y − 3)2. (3.138)

6. Write down the definition of the gradient of a multivariate real-valued function.

7. Write down the definition of the Hessian of a multivariate real-valued function.

8. Evaluate the gradient and the Hessian of

f : R2 → R, (x, y) 7→ f(x, y) := 2 exp(x2 − 3y). (3.139)

9. State the intuitions for the definite integral∫ baf(x)dx and the indefinite integral

∫f(x)dx of a univariate

real-valued function f .

10. Evaluate the definite integral

I :=

∫ 3

1

5x2 + 2x dx. (3.140)


Study questions 19

References

Abbott, S. (2015). Understanding Analysis. Undergraduate Texts in Mathematics. Springer New York,New York, NY.

Leithold, L. (1976). The Calculus, with Analytic Geometry. Harper & Row, New York, 3d ed edition.

Magnus, J. R. and Neudecker, H. (1989). Matrix Differential Calculus with Applications in Statistics andEconometrics. Journal of the American Statistical Association, 84(408):1103.

Spivak, M. (2008). Calculus. Publish or Perish, Inc, fourth edition.


3 Calculus - fu-berlin.de

Documents

Transcript of 3 Calculus - fu-berlin.de