Download - Angewandte Multivariate Statistik · City Country Pop. (10000) Order Statistics Tokyo Japan 3420 x ... Jakarta Indonesia 1655 x (5) ... K x x i h K isthekernel.

Angewandte Multivariate Statistik


Prof. Dr. Ostap Okhrin

Ostap Okhrin 1 of 461


Basis

These slides strongly based on those made by Ladislaus vonBortkiewicz Chair of Statistics, Humboldt University Berlin

Applied Multivariate Statistical Analysis(W.Härdle, L.Simar)- lvb.wiwi.hu-berlin.de


Angewandte Multivariate Statistik Comparison of Batches Boxplots

Comparison of Batches

An old Swiss 1000-franc bank note.



Example: Swiss bank dataThe authorities have measured

X1 = length of the billX2 = height of the bill (left)X3 = height of the bill (right)X4 = distance of the inner frame to the lower borderX5 = distance of the inner frame to the upper borderX6 = length of the diagonal of the central picture.



Example: (cont.)The dataset consists of 200 measurements on Swiss bank notes. Thefirst half of these bank notes are genuine, the other half are forgedbank notes.It is important to be able to decide whether a given banknote isgenuine.We want to derive a good rule that separates the genuine andcounterfeit banknotes.Which measurement is the most informative? We have to visualize thedifference.



Boxplots

Boxplot is a graphical technique for displaying the distribution of variables. helps us in seeing location, skewness, spread, tail length and

outlying points. is particularly useful in comparing different batches. is a graphical representation of the Five Number Summary.



City Country Pop. (10000) Order StatisticsTokyo Japan 3420 x(15)

Mexico City Mexico 2280 x(14)

Seoul South Korea 2230 x(13)

New York USA 2190 x(12)

Sao Paulo Brazil 2020 x(11)

Bombay India 1985 x(10)

Delhi India 1970 x(9)

Shanghai China 1815 x(8)

Los Angeles USA 1800 x(7)

Osaka Japan 1680 x(6)

Jakarta Indonesia 1655 x(5)

Calcutta India 1565 x(4)

Cairo Egypt 1560 x(3)

Manila Philippines 1495 x(2)

Karachi Pakistan 1430 x(1)

Tabelle 1: The 15 largest world cities in 2006.



Five Number Summary

Upper quartile FU

Lower quartile FL

Median = deepest point Extremes

Consider the order statistics→ Depth of a data value x(i): mini , n − i + 1

depth of fourth =[depth of median] + 1

2



Median

Order statistics x(1), x(2), . . . , x(n) is the set of the ordered valuesx1, x2, . . . , xnwhere x(1) denotes the minimum and x(n) the maximum.Median M

M =

x( n+12 ) n odd

12

x( n

2) + x( n2+1)

n even



Construction of the Boxplot

Median: 1815 (depth of data 8)Fourths (depth = 4.5): 1610=FL, 2105=FUExtremes (depth = 1): 1430, 3420F-spread: FU − FL = dFOutside bars: FU + 1.5dF , FL − 1.5dF

1. Construct the box with borders at FU and FL

2. Draw Median as | and Mean as...

3. Draw whiskers a to data within the outside bars4. Mark outliers by • if they are outside [FL − 1.5dF ,FU + 1.5dF ]

and by ? if they lie outside [FL − 3dF ,FU + 3dF ]


1500

2000

2500

3000

World Cities

Val

ues

Boxplot

Boxplot for world cities. MVAboxcity

US JAPAN EU

1520

2530

3540

Car Data

Boxplot for the mileage of U.S. American, Japanese and European cars(from left to right). MVAboxcar

GENUINE COUNTERFEIT

138

139

140

141

142

Swiss Bank Notes

Variable X6 (diagonal) of bank notes, the genuine on the left.MVAboxbank6

GENUINE COUNTERFEIT

214

214.

521

521

5.5

216

Swiss Bank Notes

Variable X1 (length) of bank notes, the genuine on the left.MVAboxbank1


Summary: Boxplots

Median and mean bars indicate the central locations. The relative location of median (and mean) in the box is a

measure of skewness. The length of the box and whiskers is a measure of spread. The length of whiskers indicate the tail length of the distribution.



Summary: Boxplots

The outliers are marked by • if they are outside[FL − 1.5dF ,FU + 1.5dF ] and by ? if they lie outside[FL − 3dF ,FU + 3dF ]

The boxplots do not indicate multi-modality or clusters. If we compare the relative size and location of the boxes, we are

comparing distributions.


Angewandte Multivariate Statistik Comparison of Batches Histograms

Histograms

fh(x) = n−1h−1∑j∈Z

n∑i=1

Ixi ∈ Bj(x0, h)Ix ∈ Bj(x0, h)

Bj(x0, h) = [x0 + (j − 1)h, x0 + jh), j ∈ Z. [., .) denotes a left closed and right open interval. I• denotes the indicator function. h is a smoothing parameter and controls the width of the

histogram bins.


Swiss Bank Notes

h = 0.1

Dia

gona

l

138 139 140 141

04

8

Swiss Bank Notes

h = 0.3

Dia

gona

l

138 139 140 141

010

2030

Swiss Bank Notes

h = 0.2

Dia

gona

l

138 139 140 141

05

15

Swiss Bank Notes

h = 0.4D

iago

nal

138 139 140 141

020

40

Diagonal of counterfeit bank notes. Histograms with x0 = 137.8 andh = 0.1 (upper left), h = 0.2 (lower left), h = 0.3 (upper right),h = 0.4 (lower right). MVAhisbank1

Swiss Bank Notes

x_0 = 137.65

Dia

gona

l

138 139 140 141

020

40

Swiss Bank Notes

x_0 = 137.85

Dia

gona

l

138 139 140 141

020

40

Swiss Bank Notes

x_0 = 137.75

Dia

gona

l

138 139 140 141

020

40

Swiss Bank Notes

x_0 = 137.95D

iago

nal

138 139 140 141

020

40

Diagonal of counterfeit bank notes. Histograms with h = 0.4 andorigins x0 = 137.65 (upper left), x0 = 137.75 (lower left), x0 = 137.85(upper right), x0 = 137.95 (lower right). MVAhisbank2


Summary: Histograms

Modes of the density are detected with a histogram. Modes correspond to strong peaks in the histogram. Histograms with the same h need not be identical. They also

depend on the origin x0 of the grid. The influence of the origin x0 is drastic. Changing x0 creates

different looking histograms.



Summary: Histograms

The consequence of a too large h is a flat and unstructuredhistogram.

A too small binwidth h results in an unstable histogram.

There is an optimal bandwidth hopt =(

24√π

n

) 13 .

It is recommended to use averaged histograms. They are kerneldensities.


Angewandte Multivariate Statistik Comparison of Batches Kernel densities

Kernel densities

Histogram (at the center of a bin) can be written as

fh(x) = n−1(2h)−1n∑

i=1

I(|x − xi | ≤ h)

Define K (u) = I (|u| ≤ 1/2)

fh(x) = n−1h−1n∑

i=1

K

(x − xi

h

)K is the kernel.



Kernel functions

K (•) Kernel

K (u) = 12 I(|u| ≤ 1) Uniform

K (u) = (1− |u|)I(|u| ≤ 1) TriangleK (u) = 3

4(1− u2)I(|u| ≤ 1) EpanechnikovK (u) = 15

16(1− u2)2I(|u| ≤ 1) Quartic (Biweight)K (u) = 1√

2πexp(−u2

2 ) = ϕ(u) Gaussian

Tabelle 2: Kernel functions.



Kernel functions

−2 0 20

0.5

1

Uniform

−2 0 20

0.5

1

Triangle

−2 0 20

0.5

1

Epanechnikov

−2 0 20

0.5

1

Quartic (biweight)

−2 0 20

0.5

1

Gaussian

Kernel functions. MVAkernelfunctions


137 138 139 140 141 142 143

0.0

0.2

0.4

0.6

0.8

Swiss bank notes

Counterfeit / Genuine

Den

sity

est

imat

es fo

r di

agon

als

Densities of diagonals of genuine and counterfeit bank notes.MVAdenbank


Choice of the bandwidth h

Silverman’s rule of thumbGaussian kernel

K (u) =1√2π

exp(−u2

2)

hG = 1.06σn−15

Quartic kernel

K (u) =1516

(1− u2)2I(|u| ≤ 1)

hQ = 2.62hG

Sample standard deviation: σ =

√n−1

n∑i=1

(xi − x)2


0.02 0.04 0.06

0.08

0.1

0.1

0.12

0.12

0.14

0.14

0.16

0.16

0.18

9 10 11 12

138

139

140

141

142

143

Contours of the density of X5,X6 of genuine and counterfeit banknotes. MVAcontbank2

Contours of the density of X4,X5,X6 of genuine and counterfeit banknotes. MVAcontbank3


Summary: Kernel densities

Kernel densities estimate distribution densities by the kernelmethod.

The bandwidth h determines the degree of smoothness of theestimate f .

Kernel densities are smooth functions and they can graphicallyrepresent distributions (up to 3 dimensions).



Summary: Kernel densities

A simple (but not necessarily correct) way to find a goodbandwidth is to compute the rule of thumb bandwidthhG = 1.06σn−1/5. This bandwidth is to be used only incombination with a Gaussian kernel ϕ.

Kernel density estimates are a good descriptive tool for seeingmodes, location, skewness, tails, asymmetry, etc.


Angewandte Multivariate Statistik Comparison of Batches Scatterplots

Scatterplots

Scatterplots - bivariate or trivariate plots of variables against eachother

Rotation of data Separation lines Draftman’s plot Brushing Parallel coordinate plots


7 8 9 10 11 12 13

138

139

140

141

142

Swiss bank notes

2D scatterplot for X5 vs. X6 of the bank notes. Genuine notes arecircles, counterfeit are triangles. MVAscabank56

Swiss bank notes

8 10 11 13 14 8 9 10 11 12

139

140

140

141

142

Lower inner frame (X4) Upper inner frame (X5)

Dia

gona

l (X

6)

3D scatterplot for (X4,X5,X6) of the bank notes. Genuine notes arecircles, counterfeit are triangles. MVAscabank456

3

X

Y

128.5 129.0 129.5 130.0 130.5 131.0 131.5

78

910

1112

13

X

Y

128.5 129.0 129.5 130.0 130.5 131.0 131.5

78

910

1112

13

X

Y

128.5 129.0 129.5 130.0 130.5 131.0 131.5

137

138

139

140

141

142

143

7 8 9 10 11 12

129.

012

9.5

130.

013

0.5

131.

0

X

Y

4

X

Y

7 8 9 10 11 12 13

78

910

1112

13

X

Y

7 8 9 10 11 12 13

137

138

139

140

141

142

143

8 9 10 11 12

129.

012

9.5

130.

013

0.5

131.

0

X

Y

8 9 10 11 12

78

910

1112

X

Y

5

X

Y

7 8 9 10 11 12 13

137

138

139

140

141

142

143

138 139 140 141 142

129.

012

9.5

130.

013

0.5

131.

0

X

Y

138 139 140 141 142

78

910

1112

X

Y

138 139 140 141 142

89

1011

12

X

Y

6

Draftman’s plot of the bank notes. The pictures in the left-handcolumn show (X3,X4), (X3,X5) and (X3,X6), in the middle we have(X4,X5) and (X4,X6), and in the lower right (X5,X6). The upper righthalf contains the corresponding density contour plots.

MVAdraftbank4

Angewandte Multivariate Statistik Comparison of Batches Scatterplots

Summary: Scatterplots

Scatterplots in two and three dimensions help us in seeingseparated points, clouds or sub-clusters.

They help us in judging positive or negative dependence. Draftman scatterplot matrices are useful for detecting structures

conditioned on values of certain other variables. As the brush of a scatterplot matrix is moving in the point cloud

we can study conditional dependence.


Angewandte Multivariate Statistik Comparison of Batches Chernoff-Flury faces

Chernoff-Flury Faces

Index

91

Index

92

Index

93

Index

94

Index

95

Index

96

Index

97

Index

98

Index

99

Index

100

Index

101

Index

102

Index

103

Index

104

Index

105

Index

106

Index

107

Index

108

Index

109

Index

110

Chernoff-Flury faces for observations 91 to 110 of the bank notes.MVAfacebank10



Six variables - face elements

X1 = 1, 19 (eye sizes)X2 = 2, 20 (pupil sizes)X3 = 4, 22 (eye slants)X4 = 11, 29 (upper hair lines)X5 = 12, 30 (lower hair lines)X6 = 13, 14, 31, 32 (face lines and darkness of hair)


Index

1

Index

2

Index

3

Index

4

Index

5

Index

6

Index

7

Index

8

Index

9

Index

10

Index

11

Index

12

Index

13

Index

14

Index

15

Index

16

Index

17

Index

18

Index

19

Index

20

Index

21

Index

22

Index

23

Index

24

Index

25

Index

26

Index

27

Index

28

Index

29

Index

30

Index

31

Index

32

Index

33

Index

34

Index

35

Index

36

Index

37

Index

38

Index

39

Index

40

Index

41

Index

42

Index

43

Index

44

Index

45

Index

46

Index

47

Index

48

Index

49

Index

50

Observations 1 to 50

Flury faces for observations 1 to 50 of the bank notes.MVAfacebank50

Index

51

Index

52

Index

53

Index

54

Index

55

Index

56

Index

57

Index

58

Index

59

Index

60

Index

61

Index

62

Index

63

Index

64

Index

65

Index

66

Index

67

Index

68

Index

69

Index

70

Index

71

Index

72

Index

73

Index

74

Index

75

Index

76

Index

77

Index

78

Index

79

Index

80

Index

81

Index

82

Index

83

Index

84

Index

85

Index

86

Index

87

Index

88

Index

89

Index

90

Index

91

Index

92

Index

93

Index

94

Index

95

Index

96

Index

97

Index

98

Index

99

Index

100



Index

101

Index

102

Index

103

Index

104

Index

105

Index

106

Index

107

Index

108

Index

109

Index

110

Index

111

Index

112

Index

113

Index

114

Index

115

Index

116

Index

117

Index

118

Index

119

Index

120

Index

121

Index

122

Index

123

Index

124

Index

125

Index

126

Index

127

Index

128

Index

129

Index

130

Index

131

Index

132

Index

133

Index

134

Index

135

Index

136

Index

137

Index

138

Index

139

Index

140

Index

141

Index

142

Index

143

Index

144

Index

145

Index

146

Index

147

Index

148

Index

149

Index

150



Index

151

Index

152

Index

153

Index

154

Index

155

Index

156

Index

157

Index

158

Index

159

Index

160

Index

161

Index

162

Index

163

Index

164

Index

165

Index

166

Index

167

Index

168

Index

169

Index

170

Index

171

Index

172

Index

173

Index

174

Index

175

Index

176

Index

177

Index

178

Index

179

Index

180

Index

181

Index

182

Index

183

Index

184

Index

185

Index

186

Index

187

Index

188

Index

189

Index

190

Index

191

Index

192

Index

193

Index

194

Index

195

Index

196

Index

197

Index

198

Index

199

Index

200




Summary: Faces

Faces can be used to detect subgroups in multivariate data. Subgroups are characterized by similar looking faces. Outliers are identified by extreme faces (e.g. dark hair, smile or

happy face). If one element of X is unusual, the corresponding face element

changes significantly in shape.


Angewandte Multivariate Statistik Comparison of Batches Andrews’ Curves

Andrews’ Curves

Each multivariate observation Xi = (Xi ,1, ..,Xi ,p) ∈ Rp is transformedinto a curve as follows

p odd

fi (t) =Xi,1√

2+Xi,2 sin(t)+Xi,3 cos(t)+. . .+Xi,p−1 sin

(p − 1

2t

)+Xi,p cos

(p − 1

2t

)

p even

fi (t) =Xi,1√

2+ Xi,2 sin(t) + Xi,3 cos(t) + . . .+ Xi,p sin

(p2t)

such that the observation represents the coefficients of a so-calledFourier series, t ∈ [−π, π].



Andrews’ Curves

Subgroups are characterized by similar curves. Outliers are characterized by single curves. Order plays an important role in the interpretation.



Let us take the 96th observation of the Swiss bank note dataset,

X96 = (215.6, 129.9, 129.9, 9.0, 9.5, 141.7)

The Andrews’ curve is:

f96(t) =215.6√

2+ 129.9 sin(t) + 129.9 cos(t) + 9.0 sin(2t) + 9.5 cos(2t) + 141.7 sin(3t)


Andrews curves (Bank data)

−0.

250

0.25

0.5

0 1 2 3 4 5 6

Andrews’ curves of the observations 96–105 of the Swiss bank notedata. The order of the variables is 1,2,3,4,5,6. MVAandcur


Let us take the 96th observation of the Swiss bank note dataset,

X96 = (215.6, 129.9, 129.9, 9.0, 9.5, 141.7)

The Andrews’ curve using the reversed order of variables is:

f96(t) =141.7√

2+ 9.5 sin(t) + 9.0 cos(t) + 129.9 sin(2t) + 129.9 cos(2t) + 215.6 sin(3t)


Andrews curves (Bank data)

−0.

250

0.25

0.5

0 1 2 3 4 5 6

Andrews’ curves of the observations 96–105 of the Swiss bank notedata. The order of the variables is 6,5,4,3,2,1. MVAandcur2


Summary: Andrews’ Curves

Outliers appear as single Andrew’s curves, which looks differentfrom the rest.

A subgroup is characterized by a set of similar curves. The order of the variables plays an important role for

interpretation. The order of variables may be optimized by Principal Component

Analysis. For more than 20 observations we obtain a bad ”signal-to-ink-

ratio”, which means we cannot see the structure of so manycurves obtained.


Angewandte Multivariate Statistik Comparison of Batches Parallel coordinate plots

Parallel Coordinate Plots

Parallel Coordinate Plots Are not based on an orthogonal coordinate system Allow to see more than four dimensions

IdeaInstead of plotting observations in an orthogonal coordinate systemone draws their coordinates in a system of parallel axes. This way ofrepresentation is however sensitive to the order of the variables.


Parallel coordinates plot (Bank data)

V1 V2 V3 V4 V5 V6

00.

20.

40.

60.

81

Parallel coordinate plot of observations 96–105 MVAparcoo1

Parallel coordinates plot (Bank data)

V1 V2 V3 V4 V5 V6

00.

20.

40.

60.

81

The full bank dataset. Genuine bank notes displayed as black lines.The forged bank notes are shown as red lines. MVAparcoo2

Angewandte Multivariate Statistik Comparison of Batches Parallel coordinate plots

Summary: Parallel coordinate plots

Parallel coordinate plots overcome the visualisation problem ofthe Cartesian coordinate system for dimensions greater than 4.

Outliers are seen as outlying polygon curves. The order of variables is still important for detection of subgroups. Subgroups may be screened by selective coloring in an interactive

manner.


Angewandte Multivariate Statistik A Short Excursion into Matrix Algebra Elementary Operations

A Short Excursion into Matrix Algebra

A(n×p) =

a11 · · · a1p...

. . ....

an1 · · · anp



Definition Notation

Transpose A>Sum A+ BDifference A− BScalar product c · AProduct A · BRank rank(A)Trace tr(A)Determinant det(A) = |A|Inverse A−1

Generalised Inverse A− : AA−A = A

Tabelle 3: Elementary matrix operations.



Name Definition Notation Example

scalar p = n = 1 a 3

column vector p = 1 a

(13

)row vector n = 1 a>

(1 3

)vector of ones (1, . . . , 1︸︷︷︸

n

)> 1n

(11

)vector of zeros (0, . . . , 0︸︷︷︸

n

)> 0n

(00

)square matrix n = p A(p × p)

(2 00 2

)Tabelle 4: Special matrices and vectors.



Name Definition Notation Example

diagonal matrix aij = 0, i 6= j , n = p diag(aii )

(1 00 2

)identity matrix diag(1, . . . , 1︸︷︷︸

p

) Ip(

1 00 1

)unit matrix aij = 1, n = p 1n1>n

(1 11 1

)symmetric matrix aij = aji

(1 22 3

)Tabelle 5: Special matrices and vectors.



Name Definition Example

null matrix aij = 0(

0 00 0

)upper triangular matrix aij = 0, i < j

1 2 40 1 30 0 1

idempotent matrix A2 = A

( 12

12

12

12

)orthogonal matrix A>A = I = AA>

(1√2

1√2

1√2− 1√

2

)

Tabelle 6: Special matrices and vectors.



Properties of a Square Matrix

For any A(n × n) and B(n × n) and any scalar c

tr(A+ B) = tr(A) + tr(B)

tr(cA) = c tr(A)

|cA| = cn|A|tr(AB) = tr(BA)

|AB| = |BA||AB| = |A||B||A−1| = |A|−1



Eigenvalues and Eigenvectors

Square matrix A(n × n)Eigenvalue λ = Eval(A)Eigenvector γ = Evec(A)

Aγ = λγ

Using spectral decomposition, it can be shown that:

|A| =n∏

j=1

λj

tr(A) =n∑

j=1

λj



Summary: Matrix Algebra

The determinant |A| is a product of the eigenvalues of A. The inverse of a matrix A exists if |A| 6= 0. The trace tr(A) is the sum of the eigenvalues of A. The sum of the traces of two matrices equals the trace of the

sum of the two matrices. The trace tr(AB) equals tr(BA). The rank(A) is the maximum number of linearly independent

rows (columns) of A .


Angewandte Multivariate Statistik A Short Excursion into Matrix Algebra Spectral Decomposition

Spectral Decomposition

Every symmetric matrix A(p × p) can be written as:

A = ΓΛΓ>

=

p∑j=1

λjγjγ>j

Λ = diag(λ1, · · · , λp)

Γ = (γ1, · · · , γp)



Covariance matrix

Σ =

(1 ρρ 1

)Eigenvalues: ∣∣∣∣ 1− λ ρ

ρ 1− λ

∣∣∣∣ = 0

λ1 = 1 + ρ, λ2 = 1− ρ, Λ = diag(1 + ρ, 1− ρ)Eigenvectors: (

1 ρρ 1

)(x1x2

)= (1 + ρ)

(x1x2

)MVAspecdecomp



x1 + ρx2 = x1 + ρx1ρx1 + x2 = x2 + ρx2

⇒ x1 = x2.

γ1 =

(1/√

21/√

2

).

γ2 =

(1/√

2−1/√

2

).

Γ = (γ1, γ2) =

(1/√

2 1/√

21/√

2 −1/√

2

)Check: A = ΓΛΓ>



Eigenvectors

The direction of the first eigenvector is the main direction of the pointcloud. The second eigenvector is orthogonal to the first one.

This eigenvector direction is in general different from the LSregression shape line.


normal sample, n=150

-2 0 2

original data (x1), rotated data (x1)

-20

24

orig

inal

dat

a (y

2), r

otat

ed d

ata

(y2)

Scatterplot of observed data () (sample size n = 150) and the samedata (N) displayed in the coordinate system given by the eigenvectorsof the covariance matrix.


Singular Value Decomposition (SVD)

A(n × p), rank(A) = r

A = Γ Λ ∆>

Γ(n× r), ∆(p × r), Γ>Γ = ∆>∆ = Ir and Λ = diag(λ

1/21 , . . . , λ

1/2r

),

λj > 0.λj = Eval(ATA)Γ and ∆ consist of the corresponding eigenvectors of AA> and A>AG-inverse of A may be defined as A− = ∆Λ−1ΓT .AA−A = A



Summary: Spectral Decomposition

The spectral (Jordan) decomposition gives a representation of asymmetric matrix in terms of eigenvalues and eigenvectors.

The eigenvectors belonging to the largest eigenvalues point intothe "main directionöf the data.

The Jordan decomposition allows to easily compute the power ofa matrix A: Aα = ΓΛαΓ>.

A−1 = ΓΛ−1Γ>, A1/2 = ΓΛ1/2Γ>.



Summary: Spectral Decomposition

The singular value decomposition (SVD) is a generalization of theJordan decomposition to non-quadratic matrices.

The direction of the first eigenvector of the covariance matrix of atwo-dimensional point cloud is different from the least squaresregression line.


Angewandte Multivariate Statistik A Short Excursion into Matrix Algebra Quadratic Forms

Quadratic Forms

A(p × p) symmetric matrix can be written as

Q(x) = x>Ax =

p∑i=1

p∑j=1

aijxixj

Definiteness

Q(x) > 0 for all x 6= 0 positive definite (pd),Q(x) ≥ 0 for all x 6= 0 positive semidefinite (psd).

A is pd (psd) iff Q(x) = x>Ax is pd (psd).



Example:Q(x) = x>Ax = x2

1 + x22 , A =

(10

01

)Eigenvalues: λ1 = λ2 = 1 positive definiteQ(x) = (x1 − x2)2, A =

(1−1−1

1

)Eigenvalues λ1 = 2, λ2 = 0 positive semidefiniteQ(x) = x2

1 − x22

Eigenvalues λ1 = 1, λ2 = −1 indefinite.



TheoremIf A is symmetric and Q(x) = x>Ax is the corresponding quadraticform, then there exists a transformation x 7→ Γ>x = y such that

x> A x =

p∑i=1

λiy2i ,

where λi are the eigenvalues of A.Lemma

A > 0 ⇔ λi > 0,A ≥ 0 ⇔ λi ≥ 0, i = 1, . . . , p.



Theorem (Theorem 2.5)If A and B are symmetric and B > 0, then the maximum of x>Ax

x>Bx isgiven by the largest eigenvalue of B−1A. More generally,

maxx

x>Axx>Bx

= λ1 ≥ λ2 ≥ · · · ≥ λp = minx

x>Axx>Bx

,

where λ1, . . . , λp denote the eigenvalues of B−1A. The vector whichmaximises (minimises) x>Ax

x>Bx is the eigenvector of B−1A whichcorresponds to the largest (smallest) eigenvalue of B−1A. Ifx>Bx = 1, we get

maxx

x>Ax = λ1 ≥ λ2 ≥ · · · ≥ λp = minx

x>Ax



Summary: Quadratic forms

A quadratic form can be described by a symmetric quadraticmatrix A.

Quadratic forms can always be diagonalized. Positive definiteness of a quadratic form is equivalent to

positiveness of the eigenvalues of the matrix A. The maximum and minimum of a quadratic form under

constraints can be expressed in terms of eigenvalues.


Angewandte Multivariate Statistik A Short Excursion into Matrix Algebra Derivatives

Derivatives

f : Rp → R, (p × 1) vector x :

∂f (x)

∂xcolumn vector of partial derivatives

∂f (x)

∂xj

, j = 1, . . . , p

∂f (x)

∂x>row vector of the same derivatives

∂f (x)

∂xis called the gradient of f .



Second order derivatives:

∂2f (x)

∂x∂x>

(p × p) Hessian matrix of the second derivatives

∂2f (x)

∂xi∂xj

, i = 1, . . . , p, j = 1, . . . , p.

Some useful formulaeA(p × p), x(p × 1) ∈ Rp, a(p × 1) andA = A>

∂a>x

∂x=∂x>a

∂x= a



Example:f : Rp → R, f (x) = a>x

a = (1, 2)>, x = (x1, x2)>

∂a>x

∂x=∂(x1 + 2x2)

∂x= (1, 2)> = a



Derivatives of the quadratic form

∂x>Ax∂x

= 2Ax

∂2x>Ax∂x∂x>

= 2A



Summary: Derivatives

The column vector ∂f (x)∂x is called the gradient.

The gradient ∂a>x∂x = ∂x>a

∂x equals a.

The derivative of the quadratic form ∂x>Ax∂x equals 2Ax .

The Hessian of f : Rp → R is the (p × p) matrix of the secondderivatives ∂2f (x)

∂xi∂xj.

The Hessian of the quadratic form x>Ax equals 2A.


Angewandte Multivariate Statistik A Short Excursion into Matrix Algebra Partitioned Matrices

Partitioned Matrices

A(n × p),B(n × p), A =

(A11 A12A21 A22

)Aij(ni × pj), n1 + n2 = n and p1 + p2 = p

A+ B =

(A11 + B11 A12 + B12A21 + B21 A22 + B22

)B> =

(B>11 B>21B>12 B>22

)AB> =

(A11B>11 +A12B>12 A11B>21 +A12B>22A21B>11 +A22B>12 A21B>21 +A22B>22

)



A(p × p) nonsingular partitioned in such a way that A11,A22 aresquare matrices

A−1 =

(A11 A12

A21 A22

)where

A11 = (A11 −A12A−122 A21)−1 def

= (A11·2)−1

A12 = −(A11·2)−1A12A−122

A21 = −A−122 A21(A11·2)−1

A22 = A−122 +A−1

22 A21(A11·2)−1A12A−122



Matrix A11 is non-singular

|A| = |A11||A22 −A21A−111 A12|

and A22 non-singular

|A| = |A22||A11 −A12A−122 A21|

B =

(1 b>

a A

)→ |B| = |A − ab>| = |A||1− b>A−1a|

(A− ab>)−1 = A−1 +A−1ab>A−1

1− b>A−1a



Summary: Partitioned Matrices

For a partitioned matrix A(n × p) =

(A11 A12A21 A22

)and

B(n × p) =

(B11 B12B21 B22

)holds

A+ B =

(A11 + B11 A12 + B12A21 + B21 A22 + B22

).




The product AB> equals(A11B>11 +A12B>12 A11B>21 +A12B>22A21B>11 +A22B>12 A21B>21 +A22B>22

).




For A nonsingular, A11, A22 square matrices,

A−1 =

(A11 A12

A21 A22

)A11 = (A11 −A12A−1

22 A21)−1 def= (A11·2)−1

A12 = −(A11·2)−1A12A−122

A21 = −A−122 A21(A11·2)−1

A22 = A−122 +A−1

22 A21(A11·2)−1A12A−122




For B =

(1 b>

a A

)and for non-singular A we have

|B| = |A − ab>| = |A||1− b>A−1a|. (A− ab>)−1 = A−1 + A−1ab>A−1

1−b>A−1a


Angewandte Multivariate Statistik A Short Excursion into Matrix Algebra Geometrical Aspects

Geometrical Aspects

Distance function d : R2p → R+

d2(x , y) = (x − y)>A(x − y), A > 0

A = Ip, Euclidean distance

Ed = x ∈ Rp | (x − x0)>(x − x0) = d2

Example: x ∈ R2, x0 = 0, x21 + x2

2 = 1Norm of a vector w.r.t metric Ip

‖x‖Ip = d(0, x) =√x>x


Distance d . d2(x , y) = (x − y)>(x − y)

Iso–distance sphere. A = I2, (x1 − x01)2 + (x2 − x02)2 = d2

Iso–distance ellipsoid. Ed = x : (x − x0)>A(x − x0) = d2,γj = Evec(A), A > 0


Angle between Vectors

Scalar product

< x , y > = x>y

< x , y >A = x>Ay

Norm of a vector

‖x‖Ip = d(0, x) =√x>x

‖x‖A =√x>Ax

Unit vectorsx : ‖x‖ = 1



Angle between Two Vectors

Angle of vectors x and y can be calculated as

cos θ =x>y

‖x‖ ‖y‖

Example: Angle = CorrelationObservations xini=1, yini=1x = y = 0

rXY =

∑xiyi√∑x2i

∑y2i

= cos θ

Correlation corresponds to the angle between x , y ∈ Rn.


Angle between vectors.

cos θ =x>y

‖x‖‖y‖=

x1y1 + x2y2

‖x‖‖y‖= cos θ1 cos θ2 + sin θ1 sin θ2


Column space

X (n × p) data matrix

C (X ) = x ∈ Rn | ∃a ∈ Rp so that X a = x

Projection matrixP(n × n), P = P> = P2 (P is idempotent)let b ∈ Rn, a = Pb is the projection of b on C (P)



Projection on C (X )

X (n × p), P = X (X>X )−1X>

PX = X , P is a projector, PP = P.

Q = In − P,Q2 = Q

px =y>x

‖y‖2y

PX = XQX = 0


Projection. px = y(y>y)−1y>x =y>x

‖y‖2y


Summary: Geometrical aspects

A distance between two p-dimensional points x , y is a quadraticform (x − y)>A(x − y) in the vectors of differences (x − y). Adistance defines the norm of a vector.

Iso-distance curves of a point x0 are all those points which havethe same distance from x0. Iso-distance curves are ellipsoidswhose principal axes are determined by the direction of theeigenvectors. The half-length of principal axes is proportional tothe inverse of the roots of the eigenvalues of A.



Summary: Geometrical aspects

The angle between two vectors x and y is given bycos θ = x>Ay

‖x‖A ‖y‖A w.r.t. the metric A. For the Euclidean distance with A = I the correlation between

two centered data vectors x and y is given by the cosine of theangle between them, i.e. cos θ = rXY .

The projection P = X (X>X )−1X> is the projection onto thecolumn space C (X ) of X .

The projection of x ∈ Rn on y ∈ Rn is given bypx = y>x

‖y‖2 y .


Angewandte Multivariate Statistik Moving to Higher Dimensions Covariance

Covariance

Covariance is a measure of (linear) dependency between variables.

σXY = Cov(X ,Y ) = E(XY )− (EX )(EY )

Covariance of X with itself:

σXX = Var(X ) = Cov(X ,X )

Covariance matrix for p-dimensional X :

Σ =

σX1X1 . . . σX1Xp

.... . .

...σXpX1 . . . σXpXp



Empirical versions:

sXY = n−1n∑

i=1

(xi − x)(yi − y)

sXX = n−1n∑

i=1

(xi − x)2

Empirical covariance matrix:

S =

sX1X1 . . . sX1Xp

.... . .

...sXpX1 . . . sXpXp



Example: Swiss bank data

X1 = length of the billX2 = height of the bill (left)X3 = height of the bill (right)X4 = distance of the inner frame to the lower borderX5 = distance of the inner frame to the upper borderX6 = length of the diagonal of the central picture.



X full bank dataset

S =

0.14 0.03 0.02 −0.10 −0.01 0.080.03 0.12 0.10 0.21 0.10 −0.210.02 0.10 0.16 0.28 0.12 −0.24−0.10 0.21 0.28 2.07 0.16 −1.03−0.01 0.10 0.12 0.16 0.64 −0.540.08 −0.21 −0.24 −1.03 −0.54 1.32

sX1X1 = s11 = 0.14sX4X5 = 0.16



Scatterplots with point clouds that are ”upward-sloping” areshowing variables with positive covariance.

Scatterplots with ”downward-sloping” structure are showingnegative covariance.


7 8 9 10 11 12 13

78

910

1112

13

Swiss bank notes

Scatterplot of variables X4 vs. X5 of the full bank dataset.MVAscabank45


Example: “classic blue” pulloverSales of “classic blue” pullovers in 10 periods.X1 number of pullovers soldX2 price in EURX3 advertisement cost in EURX4 presence of sales assistant in hours per periodDoes price have a big influence on pullovers sold?sX1X2 = −80.02


Pullovers Data

Price (X2)

Sal

es (

X1)

8012

016

020

024

0

80 90 100 110 120

Scatterplot of variables X2 vs. X1 of the pullovers dataset.MVAscapull1


Summary: Covariance

The covariance is a measure of dependence. Covariance measures only linear dependence. There are nonlinear dependencies that have zero covariance. Zero covariance does not imply independence. Independence implies zero covariance. Covariance is scale dependent.



Summary: Covariance

Negative covariance corresponds to downward-slopingscatterplots.

Positive covariance corresponds to upward-sloping scatterplots. The covariance of a variable with itself is its variance

Cov(X ,X ) = σXX . For small n we should replace the factor 1

n for the computation ofthe covariance by 1

n−1 .


Angewandte Multivariate Statistik Moving to Higher Dimensions Correlation

Correlation

ρXY =Cov(X ,Y )√Var(X )Var(Y )

The empirical version of ρXY :

rXY =sXY√sXX sYY



Correlation matrix:

P =

ρX1X1 . . . ρX1Xp

.... . .

...ρXpX1 . . . ρXpXp

Empirical correlation matrix:

R =

rX1X1 . . . rX1Xp

.... . .

...rXpX1 . . . rXpXp



Example: Swiss bank dataFor genuine bank notes:

Rg =

1.00 0.41 0.41 0.22 0.05 0.030.41 1.00 0.66 0.24 0.20 −0.250.41 0.66 1.00 0.25 0.13 −0.140.22 0.24 0.25 1.00 −0.63 −0.000.05 0.20 0.13 −0.63 1.00 −0.250.03 −0.25 −0.14 −0.00 −0.25 1.00



For forged bank notes:

Rf =

1.00 0.35 0.24 −0.25 0.08 0.060.35 1.00 0.61 −0.08 −0.07 −0.030.24 0.61 1.00 −0.05 0.00 0.20−0.25 −0.08 −0.05 1.00 −0.68 0.370.08 −0.07 0.00 −0.68 1.00 −0.060.06 −0.03 0.20 0.37 −0.06 1.00

The correlation between X4 and X5 is negative!



If X and Y are independent, then Cov(X ,Y ) = ρ(X ,Y ) = 0.

The converse is not true in generalExample:

standard normal distributed random variable X

random variable Y = X 2 which is surely not independent of X

Cov(X ,Y ) = E(XY )− E(X )E(Y ) = E(X 3) = 0

(because E(X ) = 0 and E(X 2) = 1) and therefore ρ(X ,Y ) = 0, too.



Test of Correlation

Fisher’s Z -transformation (variance stabilizing transformation):

W =12log(1 + rXY1− rXY

)E(W ) ≈ 1

2 log(

1+ρXY1−ρXY

)Var(W ) ≈ 1

(n−3)

Z =W − E(W )√

Var(W )

L−→ N(0, 1)



Example: Car datasetCorrelation between mileage (X2) and weight (X8)n = 74, rX2X8 = −0.823

H0 : ρ = 0 H1 : ρ 6= 0

w =12log(1 + rX2X8

1− rX2X8

)= −1.166, z =

−1.166− 0√171

= −9.825

H0 : ρ = −0.75

z =−1.166− (−0.973)√

171

= −1.627.


Car Data

Mileage (X2)

Wei

ght (

X8)

2000

3000

4000

15 20 25 30 35 40

Mileage (X2) vs. weight (X8) of U.S. (star), European (plus) andJapanese (circle) cars. MVAscacar


Summary: Correlation

The correlation is a standardized measure of dependence. The absolute value of the correlation is always less or equal to

one. Correlation measures only linear dependence. There are nonlinear dependencies that have zero correlation. Zero correlation does not imply independence.



Summary: Correlation

Independence implies zero correlation. Negative correlation corresponds to downward-sloping

scatterplots. Positive correlation corresponds to upward-sloping scatterplots. Fisher’s Z-transformation helps us in testing hypotheses on

correlation. For small samples, Fisher’s Z-transformation can be improved by

W ∗ = W − 3W+tanh(W )4(n−1) .


Angewandte Multivariate Statistik Moving to Higher Dimensions Summary Statistics

Summary Statistics

X (n × p) data matrix

X =

x11 · · · x1p...

......

...xn1 . . . xnp

xi = (xi1, · · · , xip)> ∈ Rp: i-th observation of a p-dimensional randomvariable X ∈ Rp



Mean

x =

x1...xp

= n−1X>1n

Empirical covariance matrix

S = n−1X>X − x x>

= n−1(X>X − n−1X>1n1>n X ) = n−1X>HX

Centering matrixH = In − n−11n1>n



Empirical correlation matrix

R = D−1/2SD−1/2

with D = diag(sXjXj) and D−1/2 = diag(s

−1/2XjXj

) for j = 1, . . . , p.



Linear Transformations

A (q × p) matrix

Y = XA> = (y1, . . . , yn)>

y = n−1Y>1n = AxSY = n−1Y>HY = ASXA>

Example:Let x = (1, 2)> and y = 4x , x ∈ R2

Then y = 4x = (4, 8)>.



Mahalanobis Transformation

Z = (z1, . . . , zn)>

zi = S−1/2(xi − x), i = 1, . . . , n

SZ = n−1Z>HZ = IpZ = 0

The Mahalanobis transformation leads to standardized uncorrelatedzero mean data matrix Z.



Summary: Summary Statistics

The center of gravity of a data matrix is given by its mean vectorx = n−1X>1n.

The dispersion of the observations in a data matrix is given by theempirical covariance matrix S = n−1X>HX .

The empirical correlation matrix is given by R = D−1/2SD−1/2.



Summary: Summary Statistics

A linear transformation Y = XA> of a data matrix X has meanAx and empirical covariance ASXA>.

The Mahalanobis transformation is a linear transformationzi = S−1/2(xi − x) which gives a standardized, uncorrelated datamatrix Z.


Angewandte Multivariate Statistik Moving to Higher Dimensions One-Sample and Two-Sample t-Test

One-sample t-test

We have iid observations x1, . . . , xn.Assume that the observations stem from N(µ, σ2).Then xn ∼ N(µ, σ2/n), i.e.

√n

(xn − µ)

σ∼ N(0, 1).



H0 : µ = µ0 H1 : µ 6= µ0Assume that σ2 is known:

√n|xn − µ0|

σ∼ N(0, 1)

Show that P(reject H0|H0 is true) = α.



Usually σ2 is not known and we have to estimate it:

σ2n =

1n − 1

n∑i=1

(xi − xn)2.

It can be shown that

√n

(xn − µ)

σn∼ tn−1.

Note: t-distribution tn approaches N(0, 1) as n→∞ (parameter n:degrees of freedom).



Test:H0 : E (X ) = µ0 H1 : E (X ) 6= µ0We reject H0 if

√n|xn − µ0|

σn> t1−α/2;n−1.

t1−α/2;n−1: 1− α critical value (i.e. 1− α/2 quantile) of the Student’st-distribution with (n − 1) degrees of freedom



Example: Car damageMcCullagh and Nelder (1989). The response variable Cn is “averagecosts of claims (in British pounds)”.H0 : average costs = 200 H1 : average costs 6= 200

Cn = 222.11

σ2n = 123.22n = 128

√n

(Cn − 200)

σn= 2.0301 > t0.975;n−1 = 1.9788

We reject that average costs are equal to 200.



Two-sample t-test

We have two iid samples y11, . . . , y1n and y21, . . . , y2m.Assume that Y11 ∼ N(µ1, σ

2) and Y21 ∼ N(µ2, σ2)

H0 : µ1 = µ2 H1 : µ1 6= µ2Pooled estimate of variance

σ2P =

1m + n − 2

n∑

i=1

(y1i − y1)2 +m∑j=1

(y2j − y2)2



Test statistic

T =

√m + n

mn

(y1 − y2)− (µ1 − µ2)

σP∼ tn+m−2

Reject H0 if |T | > t1−α/2;n+m−2.


Angewandte Multivariate Statistik Moving to Higher Dimensions Linear Model for Two Variables

Linear Model for Two Variables

yi = β0 + β1xi + εi , E (εi ) = 0, Var (εi ) = σ2, i = 1, . . . , nβ0 = intercept, β1 = slope

Estimate (β0, β1) by least squares

(β0, β1) = arg min(β0,β1)

n∑i=1

(yi − β0 − β1xi )2

β1 =sXYsXX

=Cov(X ,Y )

Var(X )

β0 = y − β1x


Price (X2)

Sal

es (

X1)

Pullovers Data

8012

016

020

024

0

80 90 100 110 120

Regression of sales (X1) on price (X2) of pullovers, β0 = 210.8,β1 = −0.36. MVAregpull

Lower inner frame (X4), genuine

Upp

er in

ner

fram

e (X

5), g

enui

ne

7.5

8.5

9.5

10.5

11.5

7 8 9 10

Swiss bank notes

Regression of upper inner frame (X5) on lower inner frame (X4) forgenuine bank notes. MVAregbank


Total variation

Regression equations: yi = β0 + β1xi + εi and yi = β0 + β1xi

n∑i=1

(yi − y)2

︸︷︷︸SSTO

=n∑

i=1

(yi − y)2

︸︷︷︸SSTR

+n∑

i=1

(yi − yi )2

︸︷︷︸SSE

SSTO = SSTR + SSE

SSTO - Variation in the response variable (total variation)SSTR - Variation explained by linear regressionSSE - Error sum of squares


88 90 92 94 96 98 100 102

165

175

185

195

Price (X2)

Sal

es (

X1)

Pullover Data

Regression of sales (X1) on price (X2) of pullovers with highlighteddistances. MVAregzoom


Coefficient of determination

r2 =

n∑i=1

(yi − y)2

n∑i=1

(yi − y)2=

SSTR

SSTO

r2 = 1: variation fully explained by linear regression, i.e. y is a linearfunction of x .

r2 = 1−

n∑i=1

(yi − yi )2

n∑i=1

(yi − y)2



Example: “Classic blue” pullover dataRegress sales on price: β0 = 210.774, β1 = −0.364, r2 = 0.028.Low r2: sales are not influenced very much by the price (in a linearway).

Regression of Y on X is dissimilar to regression of X on Y .



t-Test for β1

H0 : β1 = 0 (ρXY = 0) H1 : β1 6= 0

Var(β1) =σ2

(n · sXX ), SE (β1) =

σ

(n · sXX )1/2 , t =β1

SE (β1)

t1−α/2;n−2: 1− α critical value (i.e. 1− α/2 quantile) of the Student’st-distribution with (n − 2) degrees of freedom

Do not reject H0 if |t| ≤ t1−α/2;n−2



Example: Swiss bank data

Distance of the inner frame to the lower and to the upper border, i.e.X4 vs. X5.Why is negative slope to be expected?

β0 = 14.666 and β1 =sXYsXX

=−0.263470.41321

= −0.626.

|t| = |−8.064| > t0.975;98 = 1.9845



Summary: Linear Regression

The linear regression y = β0 + β1x + ε models a linear relationbetween two one-dimensional variables.

The sign of the slope β1 is the same as that of the covariance andthe correlation of x and y .

A linear regression predicts values of Y given a possibleobservation x of X .




The coefficient of determination r2 measures the amount ofvariation in Y which is explained by a linear regression on X .

If the coefficient of determination is r2 = 1, then all points lie onone line.

The regression line of X on Y and the regression line of Y on Xare in general different.




The t-test for the hypothesis β1 = 0 is t = β1SE(β1)

, where

SE (β1) = σ(n·sXX )1/2

.

The t-test rejects the null hypothesis β = 0 at the level ofsignificance α if |t| ≥ t1−α/2;n−2 where t1−α;n−2 is the 1− α/2quantile of the Student’s t-distribution with (n − 2) degrees offreedom.

The standard error SE (β) increases/decreases with less/morespread in the X variables.


Angewandte Multivariate Statistik Moving to Higher Dimensions Simple Analysis of Variance

Simple Analysis of Variance (ANOVA)

Assumptions

Average values of the response variable y are induced by onesimple factor

Factor takes on p values For each factor level, we have m = n/p observations All observations are independent



sample element factor levels l1 y11 · · · y1l · · · y1p

2...

......

......

......

k yk1 · · · ykl · · · ykp...

......

...m = n/p ym1 · · · yml · · · ymp

Tabelle 7: Observations structure of a simple ANOVA.



Simple ANOVA Model

ykl = µl + εkl for k = 1, . . . ,m and l = 1, . . . , p. (1)

NoteI Each factor has a mean value µl

I Observation ykl equals the sum of µl and a zero mean randomerror εkl

I Linear regression model: m = 1, p = n and µi = α+ βxi , where xiis the i-th level value of the factor



Example: “Classic blue” pullover data

Analyse the effect of three marketing strategies:1. Advertisement in local newspapers2. Presence of sales assistant3. Luxury presentation in shop windows

p = 3 factors, 10 different shops and n = mp = 30 observations



shop marketing strategyk factor l

1 2 31 9 10 182 11 15 143 10 11 174 12 15 95 7 15 146 11 13 177 12 7 168 10 15 149 11 13 1710 13 10 15

Tabelle 8: Pullover sales as function of marketing strategy.



Do all three strategies have the same mean effect?

Test

H0 : µl = µ for l = 1, . . . , p vs. H1 : µl 6= µl ′ for some l and l ′

Alternative: one marketing strategy is better than the others



Decomposition of sums of squares

p∑l=1

m∑k=1

(ykl − y)2 = m

p∑l=1

(yl − y)2 +

p∑l=1

m∑k=1

(ykl − yl)2

Total variation (sum of squares = SS)

SS(reduced) =

p∑l=1

m∑k=1

(ykl − y)2, y = n−1p∑

l=1

m∑k=1

ykl

Variation under H1

SS(full) =

p∑l=1

m∑k=1

(ykl − yl)2, yl = m−1

m∑k=1

ykl



F -test

F =SS(reduced)− SS(full)/df (r)− df (f )

SS(full)/df (f )

Degrees of freedomI Number of observations minus the number of parametersI Full model df (f ) = n − pI Reduced model df (r) = n − 1



ANOVA Table

SS df MS F -stat p-value

SS(explained) p − 1 SS(explained)

p−1SS(explained)/(p−1)

MSEp-value

SS(full) n − p SS(full)n−p

= MSE

SS(reduced) n − 1

F ∼ Fp−1,n−p

Test: reject H0 if F > F1−α;p−1,n−p, or if p-value< α



Example: “Classic blue” pullover data

Reduced model: H0 : µl = µ l = 1, 2, 3Full model: H1 : µl different

df (r) = n −#parameters(r) = 30− 1 = 29

df (f ) = n −#parameters(f ) = 30− 3 = 27

SS(reduced) = 260.3

SS(full) = 157.7

F =(260.3− 157.7)/(29− 27)

157.7/27= 8.78 > F2;27(0.95) = 3.35



SS df MS F -stat p-value

102.6 2 51.30 8.78 0.001157.7 27 5.84

260.3 29



F -test in a linear regression model

Reduced model: yi = β0 + 0 · xi + εi

SS(reduced) =n∑

i=1

(yi − y)2

SS(full) =n∑

i=1

(yi − yi )2 = RSS

F =SS(reduced)− SS(full)/1

SS(full)/ (n − 2)



Explained Variation

n∑i=1

(yi − y)2 =n∑

i=1

(β0 + β1xi − y

)2

=n∑

i=1

β21(xi − x)2

= β21n · sXX

F =β2

1n · sXXRSS/(n − 2)

=

(β1

SE(β1)

)2



Summary: ANOVA

Simple ANOVA models an output Y as a function of one factor. The reduced model is the hypothesis of equal means. The full model is the alternative hypothesis of different means. The F -test is based on a comparison of the sum of squares under

the full and the reduced models.



Summary: ANOVA

The degrees of freedom are calculated as the number ofobservations minus the number of parameters.

The F -statistic is


SS(full)/df (f ).

Reject the null if the F -statistic is larger than the(1− α)-quantile of the Fdf (r)−df (f ),df (f ) distribution.

The F -test statistic for the slope of the linear regression modelyi = β0 + β1xi + εi is the square of the t-test statistic.


Angewandte Multivariate Statistik Moving to Higher Dimensions The Multiple Linear Model

Multiple Linear Model

y(n × 1), X (n × p), β = (β1, . . . , βp)

Approximate y by a linear combination y of columns of XFind β such that y = X β is the best fit of y = Xβ + ε (errors ε)

β = argminβ

(y −Xβ)>(y −Xβ)

= argminβ

n∑i=1

(yi − x>i β)2 =(X>X

)−1X>y ,

if X>X is of full rank.



Linear Model with Intercept

yi = β0 + β1xi1 + . . .+ βpxip + εi i = 1, . . . , n

can be written asy = X ∗β∗ + ε

whereX ∗ = (1n X )

β∗ =

(β0β

)= (X ∗>X ∗)−1X ∗>y



Example: “Classic blue” pullover dataApproximate the sales as a linear function of the three other variables:price (X2), advertisement (X3) and presence of sales assistants (X4)Adding a column of ones to the data (in order to estimate also theintercept β0) leads to

β0 = 65.670, β1 = −0.216, β2 = 0.485, β3 = 0.844.

Coefficient of determination: r2 = 0.907



Remark:The coefficient of determination is influenced by the number ofregressors.For a given sample size n, the r2 value will increase by adding moreregressors into the linear model.A corrected coefficient of determination for p regressors anda constant intercept:

r2adj = r2 − p(1− r2)

n − (p + 1)



Example: “Classic blue” pullover dataCorrected coefficient of determination:

r2adj = 0.907− 3(1− 0.9072)

10− 3− 1= 0.818.

81.8% of the variation of the response variable is explained by theexplanatory variables.



Simple ANOVA ModelExample: “Classic blue” pullover data

X =

1m 0m 0m0m 1m 0m0m 0m 1m

m = 10, p = 3, n = mp = 30; X (n × p)β = (µ1, µ2, µ3)> parameter vectory = Xβ + ε linear model



Reduced model (µ1 = µ2 = µ3 = µ)

βH0 = y

df (r) = n − 1

Full model (µi 6= µj)

βH1 = (X>X )−1X>ydf (f ) = n − 3

SS(reduced) =n∑

i=1

(yi − yi )2 = ‖y −X βH0‖2

SS(full) = ‖y −X βH1‖2



Simple ANOVA Model - F−test


SS(full)/df (f )

=||y −X βH0 ||2 − ||y −X βH1 ||2/df (r)− df (f )

||y −X βH1||2/df (f )

Comparing the lengths of projections into different column spaces.



Summary: Multiple Linear Model

The relation y = Xβ + ε models a linear relation betweena one-dimensional variable Y and a p-dimensional variable X . Pygives the best linear regression fit of the vector y onto C (X ). Theleast squares parameter estimator is β = (X>X )−1X>y .

The simple ANOVA model can be written as a linear model.



Summary: Multiple Linear Model

The ANOVA model can be tested by comparing the length of theprojection vectors.

The test statistic of the F -test can be written as

||y −X βH0 ||2 − ||y −X βH1 ||2/df (r)− df (f )||y −X βH1 ||2/df (f )

.

The adjusted coefficient of determination is

r2adj = r2 − p(1− r2)

n − (p + 1).


Angewandte Multivariate Statistik Multivariate Distributions Multivariate Distributions

Multivariate Distributions

Random vector X ∈ Rp

(Multivariate) distribution function is

F (x) = P(X ≤ x) = P(X1 ≤ x1,X2 ≤ x2, . . . ,Xp ≤ xp)

f (x) denotes density of X , i.e.

F (x) =

∫ x

−∞f (u)du

∫ ∞−∞

f (u) du = 1

PX ∈ (a, b) =

b∫a

f (x)dx



X = (X1,X2)>, X1 ∈ Rk X2 ∈ Rp−k

Marginal density of X1 is

fX1(x1) =

∫ ∞−∞

f (x1, x2)dx2

Conditional density of X2 (conditioned on X1 = x1)

fX2|X1=x1(x2) = f (x1, x2)/fX1(x1)



Example

f (x1, x2) =

12x1 + 3

2x2 0 ≤ x1, x2 ≤ 1,0 otherwise.

f (x1, x2) is a density since∫f (x1, x2)dx1x2 =

12

[x212

]1

0+

32

[x222

]1

0=

14

+34

= 1.



The marginal densities

fX1(x1) =

∫f (x1, x2)dx2 =

∫ 1

0

(12x1 +

32x2

)dx2 =

12x1 +

34

;

fX2(x2) =

∫f (x1, x2)dx1 =

∫ 1

0

(12x1 +

32x2

)dx1 =

32x2 +

14·

The conditional densities

f (x2 | x1) =12x1 + 3

2x212x1 + 3

4and f (x1 | x2) =

12x1 + 3

2x232x2 + 1

4·

These conditional pdf’s are nonlinear in x1 and x2 although the jointpdf has a simple (linear) structure.



Definition of independence

X1, X2 are independent iff

f (x) = f (x1, x2) = fX1(x1)fX2(x2)

Two random variables may have identical marginals but differentjoint distribution.



Example

f (x1, x2) = 1, 0 < x1, x2 < 1,

f (x1, x2) = 1 + α(2x1 − 1)(2x2 − 1), 0 < x1, x2 < 1, −1 ≤ α ≤ 1.

fX1(x1) = 1, fX2(x2) = 1.∫ 1

01 + α(2x1 − 1)(2x2 − 1)dx2 = 1 + α(2x1 − 1)[x2

2 − x2]10 = 1.


7 8 9 10 11

0.0

0.1

0.2

0.3

0.4

0.5

Swiss bank notes

Lower Inner Frame (X4)

Den

sity

7 8 9 10 11 12

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Swiss bank notes

Upper Inner Frame (X5)D

ensi

ty

Univariate estimates of the density of X4 (left) and X5 (right) of thebank notes. MVAdenbank2

68

1012

14

810

12

0

0.05

0.1

0.15

0.2

Product of univariate density estimates for X4 and X5 of the banknotes. MVAdenbank3

5

10

15 510

15

0

0.05

0.1

0.15

0.2

Joint density estimate for X4 and X5 of the bank notes.MVAdenbank3


Summary: Distributions

The cumulative distribution function (cdf) is F (x) = P(X < x). If a probability density function (pdf) f exists then

F (x) =

∫ x

−∞f (u)du.

Let X = (X1,X2)> be partitioned in subvectors X1 and X2 withjoint cdf F . Then FX1(x1) = P(X1 ≤ x1) is the marginal cdf of

X1. The marginal pdf of X1 is fX1(x1) =

∫ ∞−∞

f (x1, x2)dx2.



Summary: Distributions

Different joint pdf’s may have the same marginal pdf’s.

The conditional pdf of X2 given X1 = x1 is f (x2 | x1) =f (x1, x2)

fX1(x1)·

Two random variables X1,X2 are called independent ifff (x1, x2) = fX1(x1)fX2(x2). This is equivalent tof (x2 | x1) = fX2(x2).


Angewandte Multivariate Statistik Multivariate Distributions Moments and Characteristic Functions

Moments and Characteristic Functions

EX ∈ Rp denotes the p-dimensional vector of expected values of therandom vector X

EX =

EX1...

EXp

=

∫xf (x)dx =

∫x1f (x)dx

...∫xpf (x)dx

= µ.

The properties of the expected value follow from the properties of theintegral:

E (αX + βY ) = αEX + β EY



If X and Y are independent then

E(XY>) =

∫xy>f (x , y)dxdy

=

∫xf (x)dx

∫y>f (y)dy = EX EY>

Definition of the covariance matrix (Σ)

Σ = Var(X ) = E(X − µ)(X − µ)>

We say that a random vector X has a distribution with the vector ofexpected values µ and the covariance matrix Σ,

X ∼ (µ,Σ)



Properties of the Covariance Matrix

Elements of Σ are variances and covariances of the components of therandom vector X :

Σ = (σXiXj)

σXiXj= Cov(Xi ,Xj)

σXiXi= Var(Xi )

Computational formula: Σ = E(XX>)− µµ>Covariance matrix is positive semidefinite, Σ ≥ 0(variance a>Σa of any linear combination a>X cannot be negative).



Properties of Variances and Covariances

Var(a>X ) = a> Var(X ) a =∑i ,j

aiajσXiXj

Var(AX + b) = A Var(X ) A>

Cov(X + Y ,Z ) = Cov(X ,Z ) + Cov(Y ,Z )

Var(X + Y ) = Var(X ) + Cov(X ,Y ) + Cov(Y ,X ) + Var(Y )

Cov(AX ,BY ) = A Cov(X ,Y ) B>.



Example

f (x1, x2) =

12x1 + 3

2x2 0 ≤ x1, x2 ≤ 1,0 otherwise.

The conditional densities

f (x2 | x1) =12x1 + 3

2x212x1 + 3

4and f (x1 | x2) =

12x1 + 3

2x232x2 + 1

4·



µ1 =

∫ ∫x1f (x1, x2)dx1dx2 =

∫ 1

0

∫ 1

0x1

(12x1 +

32x2

)dx1dx2

=

∫ 1

0x1

(12x1 +

34

)dx1 =

12

[x313

]1

0+

34

[x212

]1

0

=16

+38

=4 + 924

=1324

,

µ2 =

∫ ∫x2f (x1, x2)dx1dx2 =

∫ 1

0

∫ 1

0x2

(12x1 +

32x2

)dx1dx2

=

∫ 1

0x2

(14

+32x2

)dx2 =

14

[x222

]1

0+

32

[x323

]1

0

=18

+12

=1 + 48

=58·



Covariance Matrix

σX1X1 = EX 21 − µ2

1 with

EX 21 =

∫ 1

0

∫ 1

0x21

(12x1 +

32x2

)dx1dx2

=12

[x414

]1

0+

34

[x313

]1

0=

38

σX2X2 = EX 22 − µ2

2 with

EX 22 =

∫ 1

0

∫ 1

0x22

(12x1 +

32x2

)dx1dx2

=14

[x323

]1

0+

32

[x424

]1

0=

1124



σX1X2 = E(X1X2)− µ1µ2 with

E(X1X2) =

∫ 1

0

∫ 1

0x1x2

(12x1 +

32x2

)dx1dx2

=

∫ 1

0

(16x2 +

34x22

)dx2

=16

[x222

]1

0+

34

[x323

]1

0=

13.

Σ =

(0.0815 0.00520.0052 0.0677

)



Conditional Expectations

Random vector X = (X1,X2)>, X1 ∈ Rk X2 ∈ Rp−k

Conditional expectation of X2, given X1 = x1:

E(X2 | x1) =

∫x2f (x2 | x1) dx2

and conditional expectation of X1, given X2 = x2:

E(X1 | x2) =

∫x1f (x1 | x2) dx1

The conditional expectation E(X2 | x1) is a function of x1.Typical example of this setup is a simple linear regression, whereE(Y | X = x) = Xβ.



Error term in approximation:

U = X2 − E(X2 | X1)

(1) E(U) = 0(2) E(X2|X1) is the best approximation of X2 by a function h(X1) of

X1 in the sense of mean squared error (MSE) whenMSE (h) = E[X2 − h(X1)> X2 − h(X1)] andh : Rk −→ Rp−k .



Summary: Moments

The expectation of a random vector X is µ =∫xf (x) dx , the

covariance matrix Σ = Var(X ) = E(X − µ)(X − µ)>. We denoteX ∼ (µ,Σ).

Expectations are linear, i.e., E(αX + βY ) = αEX + β EY . IfX ,Y are independent then E(XY>) = EX EY>.



Summary: Moments

The covariance between two random vectors X ,Y is ΣXY =Cov(X ,Y ) = E(X − EX )(Y − EY )> = E(XY>)− EX EY>. IfX ,Y are independent then Cov(X ,Y ) = 0.

The conditional expectation E(X2|X1) is the MSE bestapproximation of X2 by a function of X1.



Characteristic Functions

The characteristic function (cf) of a random vector X ∈ Rp is definedas

ϕX (t) = E(e it>X ) =

∫e it>x f (x) dx , t ∈ Rp,

where i is the complex unit: i2 = −1.



Properties of cf: ϕX (0) = 1, |ϕX (t)| ≤ 1

if ϕ is absolutely integrable (∫ ∞−∞|ϕ(x)|dx exists and is finite)

thenf (x) =

1(2π)p

∫ ∞−∞

e−it>xϕX (t) dt.

if X = (X1,X2, . . . ,Xp)> then for t = (t1, t2, . . . , tp)> :

ϕX1(t1) = ϕX (t1, 0, . . . , 0), . . . , ϕXp(tp) = ϕX (0, . . . , 0, tp).



For X1, . . . ,Xp independent RV’s and t = (t1, t2, . . . , tp)> is:

ϕX (t) =

p∏j=1

ϕXj(tj).

For X1, . . . ,Xp independent RV’s, t ∈ R is:

ϕX1+...+Xp(t) =

p∏j=1

ϕXj(t).

The characteristic function allows to recover all the cross-productmoments of any order: ∀jk ≥ 0, k = 1, . . . , p, t = (t1, . . . , tp)> wehave

E(X j1

1 · · ·Xjpp

)=

1ij1+...+jp

[∂ϕX (t)

∂t j11 · · · ∂tjpp

]t=0

.



X ∈ R1 follows the standard normal distribution

fX (x) =1√2π

exp(−x2

2

)

ϕX (t) =1√2π

∫ ∞−∞

e itx exp(−x2

2

)dx

= exp(− t2

2

) ∫ ∞−∞

1√2π

exp−(x − it)2

2

dx

= exp(− t2

2

),

since i2 = −1 and∫ 1√

2πexp− (x−it)2

2

dx = 1.



Theorem (Cramér-Wold)The distribution of X ∈ Rp is completely determined by the set of all(one-dimensional) distributions of t>X , t ∈ Rp.This theorem says that we can determine the distribution of X in Rp

by specifying all the one-dimensional distributions of the linearcombinations

p∑j=1

tjXj = t>X , t = (t1, t2, . . . , tp)>.



Summary: Characteristic Functions

The characteristic function (cf) of a random vector X isϕX (t) = E(eit>X ).

The distribution of a p-dimensional random variable X iscompletely determined by all one-dimensional distributions oft>X , t ∈ Rp (Theorem of Cramer-Wold).



Cumulants

For a random variable X with density f and finite moments of order kthe characteristic function ϕX (t) = E(e itX ) has a derivative[

∂ϕ(j)X

∂t

]t=0

= κj , j = 1, . . . , k.

The values κj are called cumulants or semi-invariants since κj does notchange (for j > 1) under a shift transformation X 7→ X + a. Thecumulants are natural parameters for dimension reduction methods, inparticular the Projection Pursuit method.



The relation between the first k moments m1, . . . ,mk and thecumulants is given by

κk = (−1)k−1

∣∣∣∣∣∣∣∣∣∣∣∣∣

m1 1 . . . 0

m2

(10

)m1 . . .

......

. . ....

mk

(k − 10

)mk−1 . . .

(k − 1k − 2

)m1

∣∣∣∣∣∣∣∣∣∣∣∣∣.



Suppose that k = 1, thenκ1 = m1.

For k = 2 we obtain

κ2 = −

∣∣∣∣∣∣m1 1

m2

(10

)m1

∣∣∣∣∣∣ = m2 −m21



For k = 3 we have to calculate

κ3 = −

∣∣∣∣∣∣m1 1 0m2 m1 1m3 m2 2m1

∣∣∣∣∣∣Calculating this determinant we arrive at:

κ3 = m1

∣∣∣∣ m1 1m2 2m1

∣∣∣∣−m2

∣∣∣∣ 1 0m2 2m1

∣∣∣∣+ m3

∣∣∣∣ 1 0m1 1

∣∣∣∣= m1(2m2

1 −m2)−m2(2m1) + m3

= m3 − 3m1m2 + 2m31.

In a similar way one calculates

κ4 = m4 − 4m3m1 − 3m22 + 12m2m

21 − 6m4

1.



In a similar fashion we find the moments from the cumulants:

m1 = κ1

m2 = κ2 + κ21

m3 = κ3 + 3κ2κ1 + κ31

m4 = κ4 + 4κ3κ1 + 3κ22 + 6κ2κ

21 + κ4

1

A very simple relationship can be observed between the semi-invariantsand the central moments µk = E(X − µ)k , where µ = m1 as definedbefore. We have, in fact, κ2 = µ2, κ3 = µ3, κ4 = µ4 − 3µ2

2.



Skewness γ3 and kurtosis γ4 are defined as:

γ3 = E(X − µ)3/σ3

γ4 = E(X − µ)4/σ4

The skewness and kurtosis determine the shape of one-dimensionaldistributions. The skewness of a normal distribution is 0 and thekurtosis equals 3. The relation of these parameters to the cumulants isgiven by:

γ3 =κ3

κ3/22

γ4 =κ4

κ22


Angewandte Multivariate Statistik Multivariate Distributions Transformations

Transformations

Suppose X ∼ fX . What is the pdf of Y = 3X?

X = u(Y )

one-to-one transformation u: Rp → Rp

Jacobian:

J =

(∂xi∂yj

)=

∂ui (y)

∂yj

fY (y) = abs(|J |)fXu(y)



Example

(x1, . . . , xp)> = u(y1, . . . , yp)

Y = 3X → X = 13Y = u(y)

J =

13 0

. . .0 1

3

abs(|J |) =

(13

)p



Y = AX + b, A nonsingular

X = A−1(Y − b)

J = A−1

fY (y) = abs(|A|−1)fXA−1(y − b)



X = (X1,X2) ∈ R2 with density fX (x) = fX (x1, x2)

A =

(1 11 −1

), b =

(00

).

Y = AX + b =

(X1 + X2X1 − X2

)|A| = −2, abs(|A|−1) =

12, A−1 = −1

2

(−1 −1−1 1

).

fY (y) =12fX

12

(y1 + y2),12

(y1 − y2)

.



Summary: Transformations

If X has pdf fX (x), then a transformed random vector Y ,X = u(Y ), has pdf fY (y) = abs(|J |) · fXu(y), where Jdenotes the Jacobian J =

(∂u(yi )∂yj

).

In the case of a linear relation Y = AX + b the pdf’s of X and Yare related via fY (y) = abs(|A|−1)fXA−1(y − b).


Angewandte Multivariate Statistik Multivariate Distributions Multinormal Distribution

Multinormal Distribution

The pdf of a multinormal is (assuming that Σ has full rank):

f (x) = |2πΣ|−1/2 exp−12

(x − µ)>Σ−1(x − µ)

.

X ∼ Np(µ,Σ)

Expected value is EX = µ,

Covariance matrix of X is Var(X ) = Σ > 0.(What is the meaning of the quadratic form (x − µ)>Σ−1(x − µ) inthe formula for density?)



Geometry of the Np(µ,Σ) Distribution

Density of Np(µ,Σ) is constant on ellipsoids of the form

(x − µ)>Σ−1(x − µ) = d2

If X ∼ Np(µ,Σ), then the variable Y = (X − µ)>Σ−1(X − µ) is χ2p

distributed, since the Mahalanobis transformation yieldsZ = Σ−1/2(X − µ) ∼ Np(0, Ip) and Y = Z>Z =

∑pj=1 Z

2j .


−1 0 1 2 3 4 5 6

−2

02

46

8

X1

X2

Normal sample

X1

X2

0.005

0.01

0.015

0.02

0.025

0.03

0.035

−2 0 2 4 6

−5

05

10

Contour Ellipses

Scatterplot of normal sample and contour ellipses for µ =

(32

)and

Σ =

(1.0 −1.5−1.5 4.0

)MVAcontnorm


Singular Normal Distribution

Definition of “Normal” distribution in case that the matrix Σ issingular-we use its eigenvalues λi and the generalized inverse Σ−:rank(Σ) = k < p, λ1, . . . , λk

f (x) =(2π)−k/2

(λ1 · · ·λk)1/2 exp−12

(x − µ)>Σ−(x − µ)

Σ− = G-inverse



Summary: Multinormal Distribution

The pdf of a p-dimensional multinormal X ∼ Np(µ,Σ) is

f (x) = |2πΣ|−1/2 exp−12

(x − µ)>Σ−1(x − µ)

.

The contour curves of a multinormal are ellipsoids withhalf-lengths proportional to

√λi , where λi , i = 1, · · · , p, denote

the eigenvalues of Σ. The Mahalanobis transformation transforms X ∼ Np(µ,Σ) to

Y = Σ−1/2(X − µ) ∼ Np(0, Ip). Vice versa, one can createX ∼ Np(µ,Σ) from Y ∼ Np(0, Ip) via X = Σ1/2Y + µ.



Summary: Multinormal Distribution

If the covariance matrix Σ is singular (i.e., rank(Σ) < p), then itdefines a singular normal distribution.

The density of a singular normal distribution is given by

f (x) =(2π)−k/2

(λ1 · . . . · λk)1/2 exp−12

(x − µ)>Σ−(x − µ)

,

where Σ− denotes the G-inverse of Σ.


Angewandte Multivariate Statistik Multivariate Distributions Limit Theorems

Limit Theorems

Central Limit Theorem describes the (asymptotic) behaviour ofsample meanX1,X2, . . . ,Xn, i.i.d. with Xi ∼ (µ,Σ)

√n(x − µ)

L−→ Np(0,Σ) for n −→∞.

The CLT can be easily applied for testing.Normal distribution plays a central role in statistics.


−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

1000 Random Samples

Est

imat

ed a

nd N

orm

al D

ensi

ty

Asymptotic Distribution, n = 5

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

1000 Random Samples

Est

imat

ed a

nd N

orm

al D

ensi

ty

Asymptotic Distribution, n = 35

The CLT for Bernoulli distributed random variables. Sample size n = 5(left) and n = 35 (right). MVAcltbern

−4−2

02

4

−5

0

50

0.05

0.1

0.15

0.2

−4−2

02

4

−5

0

50

0.05

0.1

0.15

0.2

The CLT in the two-dimensional case. Sample size n = 5 (left) andn = 85 (right). MVAcltbern2


Σ a consistent estimator of Σ: ΣP→ Σ.

x is asymptotically normal:

√nΣ−

12 (x − µ)

L−→ Np(0, Ip) as n→∞

Confidence interval for (univariate) mean µXi ∼ N(µ, σ2)

√n

(x − µσ

)L−→ N(0, 1) as n→∞



Define u1−α/2 as the 1− α/2 quantile of the N(0, 1) distribution.Then we get the following 1− α confidence interval:

C1−α =

[x − σ√

nu1−α/2, x +

σ√nu1−α/2

]P(µ ∈ C1−α) −→ 1− α for n→∞.


−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

EDF and CFD

X

ED

F(X

), C

DF

(X)

EmpiricalTheoretical

The standard normal cdf and the empirical distribution function forn = 100. MVAedfnormal

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

EDF and CFD

X

ED

F(X

), C

DF

(X)

EmpiricalTheoretical

The standard normal cdf and the empirical distribution function forn = 1000 MVAedfnormal

−3 −2 −1 0 1 2

0.0

0.2

0.4

0.6

0.8

1.0

X

edfs

[1..3

](x)

EDF and 2 bootstrap EDFs, n = 100

edf

1. bootstrap edf

2. bootstrap edf

The edf Fn and two bootstrap edf‘s F ∗n . MVAedfbootstrap


Bootstrap confidence intervals

Empirical distribution functionedf Fn(x) = n−1∑n

i=1 I(xi ≤ x)Xi ∼ FX ∗i ∼ Fnx∗ = mean of bootstrap sample

supu

P∗(√

n(x∗ − x)

σ∗< u

)− P

(√n(x − µ)

σ< u

)a.s.−→ 0

Construction of confidence intervals possible! The unknowndistribution of x can be approximated by the known distribution of x∗.



Transformation of Statistics

If√n(t − µ)

L−→ Np(0,Σ) and if f = (f1, . . . , fq)> : Rp → Rq

are real valued functions which are differentiable at µ ∈ Rp, then f (t)is asymptotically normal with mean f (µ) and covariance D>ΣD, i.e.,

√nf (t)− f (µ) L−→ Nq(0,D>ΣD) for n −→∞,

where

D =

(∂fj∂ti

)(t)

∣∣∣∣t=µ

(p × q) matrix of all partial derivatives.This theorem can be applied e.g. to find the “variance stabilizing”transformation.



ExampleSuppose

Xini=1 ∼ (µ,Σ); µ =

(00

), Σ =

(1 0.50.5 1

), p = 2.

We have by CLT for n→∞√n(x − µ)

L→ N(0,Σ).

The distribution of(

x21 − x2

x1 + 3x2

)?

This means to consider f = (f1, f2)> with

f1(x1, x2) = x21 − x2, f2(x1, x2) = x1 + 3x2, q = 2.



Then f (µ) =(00

)and

D = (dij), dij =

(∂fj∂xi

)∣∣∣∣x=µ

=

(2x1 1−1 3

)∣∣∣∣x=0

=

(0 1−1 3

).

We have the covariance(0 −11 3

) (1 1

212 1

) (0 1−1 3

)=

(1 −7

2−7

2 13

)D> Σ D D>ΣD

.

This yields

√n

(x21 − x2

x1 + 3x2

)L→ N2

((00

),

(1 −7

2−7

2 13

)).



Summary: Limit Theorems

If X1, . . . ,Xn are i.i.d. random vectors with Xi ∼ (µ,Σ), then thedistribution of

√n(x − µ) is asymptotically N(0,Σ) (Central Limit

Theorem). If X1, . . . ,Xn are i.i.d. random variables with Xi ∼ (µ, σ), then an

asymptotic confidence interval can be constructed by the CLT:

x ± σ√nu1−α/2.



Summary: Limit Theorems

For small sample sizes the Bootstrap improves the precision ofthis confidence interval.

The Bootstrap estimates x∗ have the same asymptotic limit. If t is a statistic that is asymptotically normal, i.e.,√n(t − µ)

L→ Np(0,Σ), then this holds also for a function f (t),i.e.,√nf (t)− f (µ) is asymptotically normal.


Angewandte Multivariate Statistik Multivariate Distributions Heavy-Tailed Distributions

Heavy-Tailed Distributions

Introduced by Pareto, studied by Paul Lévy Applications: finance, medicine, seismology, engineering

I asset returns in financial marketsI stream flow in hydrologyI insuranceI precipitation and hurricane damage in meteorologyI earthquake prediction in seismologyI pollutionI material strength



Definition

A distribution is called heavy-tailed if it has higher probability densityin its tail area compared with a normal distribution with same mean µand variance σ2.



−6 −2 0 2 4 60.

00.

20.

4

X

Y

−2f −1f 1f 2f

GaussCauchy

Distribution Comparison

Abbildung 1: Comparison of the pdf of a standard Gaussian (blue) and aCauchy distribution (red) with location parameter 0 and scale parameter 1.

MVAgausscauchy



Kurtosis

In terms of kurtosis, a heavy-tailed distribution has kurtosis greaterthan 3, which is called leptokurtic, in contrast to mesokurticdistribution (kurtosis = 3) and platykurtic distribution (kurtosis < 3).



Generalised Hyperbolic Distribution

Introduced by Barndorff-Nielsen and at first applied to model grainsize distributions of wind blown sands.Applications: stock price modelling, market risk measurement.



PDF of GH Distribution

The density of a one-dimensional generalised hyperbolic (GH)distribution for x ∈ R is

fGH(x ;λ, α, β, δ, µ) =

=(√α2 − β2/δ)λ

√2πKλ(δ

√α2 − β2)

·Kλ−1/2α

√δ2 + (x − µ)2

(√δ2 + (x − µ)2/α)1/2−λ

eβ(x−µ),

Kλ is a modified Bessel function of the third kind with index λ

Kλ(x) =12

∫ ∞0

yλ−1e−x2 (y+y−1)dy



Parameters

The domain of variation of the parameters is µ ∈ R and

δ ≥ 0, |β| < α, if λ > 0δ > 0, |β| < α, if λ = 0δ > 0, |β| ≤ α, if λ < 0

where µ is location, δ scale parameter.



Mean and Variance of GH Distribution

E[X ] = µ+δβ√α2 − β2

Kλ+1(δ√α2 − β2)

Kλ(δ√α2 − β2)

Var[X ] = δ2

[Kλ+1(δ

√α2 − β2)

δ√α2 − β2Kλ(δ

√α2 − β2)

+β2

α2 − β2

[Kλ+2(δ

√α2 − β2)

Kλ(δ√α2 − β2)

−Kλ+1(δ

√α2 − β2)

Kλ(δ√α2 − β2)

2]]



Hyperbolic and Normal-Inverse GaussianDistributions

With specific values of λ we obtain different sub-classes of GH.

For λ = 1 we obtain the hyperbolic distributions (HYP)

fHYP(x ;α, β, δ, µ) =

√α2 − β2

2αδK1(δ√α2 − β2)

e−α√δ2+(x−µ)2+β(x−µ)

where x , µ ∈ R, δ ≥ 0 and |β| < α.

For λ = −1/2 we obtain the normal-inverse Gaussian distribution(NIG)

fNIG (x ;α, β, δ, µ) =αδ

π

K1(α√

(δ2 + (x − µ)2))√

δ2 + (x − µ)2eδ√α2−β2+β(x−µ).



−6 −2 2 6

0.0

0.1

0.2

0.3

0.4

0.5

X

Y

GHNIGHYP

PDF of GH, HYP and NIG

−6 −2 2 6

0.0

0.2

0.4

0.6

0.8

1.0

X

Y

GHNIGHYP

CDF of GH, HYP and NIG

Abbildung 2: pdf (left) and cdf (right) of GH (λ = 0.5), HYP and NIG withα = 1, β = 0, δ = 1, µ = 0 MVAghdis



Student’s t-distribution

Introduced by Gosset (1908). Published under the pseudonymßtudent"by request of his employer.Let X be a normally distributed rv with mean µ and variance σ2, andY be the rv such that Y 2/σ2 has a chi-square distribution with ndegrees of freedom. Assume that X and Y are independent, then

tdef=

X√n

Y

is distributed as Student’s t with n degrees of freedom.



PDF of Student’s t-distribution

The t-distribution has the following density function

ft(x ; n) =Γ(n+12

)√nπΓ

(n2

)(1 +x2

n

)− n+12

where n is the number of degrees of freedom, −∞ < x <∞, and Γ isthe gamma function

Γ(α) =

∫ ∞0

xα−1e−xdx



−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

X

Y

t3t6t30

PDF of t−distribution

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

X

Y

t3t6t30

CDF of t−distribution

Abbildung 3: pdf (left) and cdf (right) of t-distribution with different degrees offreedom (t3 stands for t-distribution with 3 degrees of freedom) MVAtdis



Mean, Variance, Skewness and Kurtosis

The mean, variance, skewness and kurtosis of Student’s t-distribution(n > 4) are:

µ = 0

σ2 =n

n − 2Skewness = 0

Kurtosis = 3 +6

n − 4.



Property

Student’s t-distribution approaches the normal distribution as nincreases, since

limn→∞

ft(x ; n) =1√2π

e−x22 .



Tail of Student’s t-distribution

In the tail x is proportional to |x |−(n+1).

2.5 3.0 3.5 4.0

0.00

0.01

0.02

0.03

0.04

X

Y

t1t3t9t45Gaussian

Tail comparison of t−distribution

Abbildung 4: Tails of pdf curves of t-distributions. With higher degree offreedom, the t-distribution decays faster. MVAdistail



Laplace Distribution

The univariate Laplace distribution with mean zero was introduced byLaplace (1774).The Laplace distribution can be defined as the distribution ofdifferences between two independent variates with identicalexponential distributions.



PDF and CDF of Laplace Distribution

The Laplace distribution with mean µ and scale parameter θ has thepdf

fLaplace(x ;µ, θ) =12θ

e−|x−µ|θ

and the cdf

FLaplace(x ;µ, θ) =121 + sgn(x − µ)(1− e−

|x−µ|θ ),

where sgn is signum function.




The mean, variance, skewness and kurtosis of the Laplace distribution:

µ = µ

σ2 = 2θ2

Skewness = 0Kurtosis = 6



−6 −2 2 6

0.0

0.1

0.2

0.3

0.4

0.5

X

Y

L1L1.5L2

PDF of Laplace distribution

−6 −2 2 6

0.0

0.2

0.4

0.6

0.8

1.0

XY

L1L1.5L2

CDF of Laplace distribution

Abbildung 5: pdf (left) and cdf (right) of Laplace distributions with zero meanand different scale parameters (L1 stands for Laplace distribution with θ = 1)

MVAlaplacedis



Standard Laplace Distribution

Standard Laplace distribution has mean 0 and θ = 1

f (x) =e−|x |

2

F (x) =

ex

2 for x < 01− e−x

2 for x ≥ 0



Cauchy Distribution

Named after Augustin Cauchy and Hendrik Lorentz.Applications in physics – the solution to the differential equation describing

forced resonance, in spectroscopy – the description of the line shape of spectral

lines.



PDF and CDF of the Cauchy Distribution

fCauchy (x ;m, s) =1sπ

11 + ( x−ms )2

FCauchy (x ;m, s) =12

+1πarctan

(x −m

s

)where m and s are location and scale parameter respectively.



Standard Cauchy Distribution

Standard Cauchy distribution has m = 0 and s = 1:

fCauchy (x) =1

π(1 + x2)

FCauchy (x) =12

+arctan(x)

π



−6 −2 2 6

0.0

0.2

0.4

X

Y

C1C1.5C2

PDF of Cauchy distribution

−6 −2 2 6

0.2

0.6

X

Y

C1C1.5C2

CDF of Cauchy distribution

Abbildung 6: pdf (left) and cdf (right) of Cauchy distributions with m = 0and different scale parameters (C1 stands for Cauchy distribution with s = 1)

MVAcauchy




The mean, variance, skewness and kurtosis of Cauchy distribution areall undefined, since its moment generating function diverges. But ithas mode and median, both equal the location parameter m.



Mixture Model

Mixture modelling concerns modelling a distribution by a mixture(weighted sum) of different distributions.



PDF of Mixture Model

The pdf of a mixture distribution

f (x) =n∑

l=1

wlpl(x)

under the constraints:

0 ≤ wl ≤ 1n∑

l=1

wl = 1∫pl(x)dx = 1,

pl(x) is the pdf of the l ’th component density and wl is a weight.




µ =n∑

l=1

wlµl

σ2 =n∑

l=1

wlσ2l + (µl − µ)2

Skewness =n∑

l=1

wl

(σlσ

)3

SKl +3σ2

l (µl − µ)

σ3 +

(µl − µσ

)3

Kurtosis =n∑

l=1

wl

(σlσ

)4

Kl +6(µl − µ)2σ2

l

σ4 +4(µl − µ)σ3

l

σ4 SKl

+

(µl − µσ

)4,

where µl , σl ,SKl and Kl correspond to l ’th distribution.Ostap Okhrin 258 of 461


Gaussian Mixture ModelsThe pdf for a Gaussian mixture:

fGM(x) =n∑

l=1

wl√2πσl

e− (x−µl )

2

2σ2l .

When Gaussian distributions have mean 0:

fGM(x) =n∑

l=1

wl√2πσl

e− x2

2σ2l ,

with variance, skewness and kurtosis

σ2 =n∑

l=1

wlσ2l Skewness = 0

Kurtosis =n∑

l=1

3wl

(σlσ

)4



−6 −2 2 6

0.0

0.1

0.2

0.3

0.4

X

Y

Gaussian MixtureGaussian

Pdf of a Gaussian mixture and Gaussian

−6 −2 2 6

0.0

0.2

0.4

0.6

0.8

1.0

XY

Gaussian MixtureGaussian

Cdf of a Gaussian mixture and Gaussian

Abbildung 7: pdf (left) and cdf (right) of a Gaussian mixture MVAmixture

Remark The Gaussian Mixture is not in general a Gaussiandistribution.



Multivariate Generalised HyperbolicDistribution

The multivariate Generalised Hyperbolic Distribution (GHd) has thefollowing pdf

fGHd (x ;λ, α, β, δ,∆, µ) = adKλ− d

2

α√δ2 + (x − µ)>∆−1(x − µ)

α−1

√δ2 + (x − µ)>∆−1(x − µ)

d2−λ

eβ>(x−µ)

ad = ad(λ, α, β, δ,∆) =

(√α2 − β>∆β/δ

)λ(2π)

d2 Kλ(δ

√α2 − β>∆β

.



Parameters of GHd

The domain of variation of the parameters:

λ ∈ R, β, µ ∈ Rd

δ > 0, α > β>∆β∆ ∈ Rd×d positive definite matrix|∆| = 1

For λ = d+12 we obtain the multivariate hyperbolic (HYP) distribution;

for λ = −12 we get the multivariate normal inverse Gaussian (NIG)

distribution.



Second Parameterization

Blæsild and Jensen (1981) introduced a second parameterization(ζ,Π,Σ), where

ζ = δ√α2 − β>∆β

Π = β

√∆

α2 − β>∆β

Σ = δ2∆



Second Parameterization

The mean and variance of X ∼ GHd

E[X ] = µ+ δRλ(ζ)Π∆12

Var[X ] = δ2ζ−1Rλ(ζ)∆ + Sλ(ζ)(Π∆

12 )>(Π∆

12 )

where

Rλ(x) =Kλ+1(x)

Kλ(x)

Sλ(x) =Kλ+2(x)Kλ(x)− K 2

λ+1(x)

K 2λ(x)



Multivariate t-distribution

If X ∼ Np(µ,Σ) and Y ∼ χ2n are independent and X

√n/Y = t − µ,

then the pdf of t is

ft(t; n,Σ, µ) =Γ (n + p)/2

Γ(n/2)np/2πp/2 |Σ|1/21 + 1

n (t − µ)>Σ−1(t − µ)(n+p)/2

The distribution of t is the noncentral t-distribution with n degrees offreedom and the noncentrality parameter µ.



Multivariate Laplace Distribution

Let g and G be the pdf and cdf of a d-dimensional Gaussiandistribution Nd(0,Σ), the pdf and cdf of a multivariate Laplacedistribution can be written as

fMLaplaced (x ;m,Σ) =

∫ ∞0

g(z−12 x − z

12m)z−

d2 e−zdz

FMLaplaced (x ,m,Σ) =

∫ ∞0

G (z−12 x − z

12m)e−zdz



PDF of Multivariate Laplace Distribution

The pdf can also be described as

fMLaplaced (x ;m,Σ) =2ex

>Σ−1m

(2π)d2 |Σ|

12

(x>Σ−1x

2 + m>Σ−1m

)λ2

×Kλ(√

(2 + m>Σ−1m)(x>Σ−1x))

where λ = 2−d2 and Kλ(x) is the modified Bessel function of the third

kind

Kλ(x) =12

(x

2

)λ ∫ ∞0

t−λ−1e−t−x24t dt, x > 0



Mean and Variance of Multivariate LaplaceDistribution

E[X ] = m

Cov[X ] = Σ + mm>



Multivariate Mixture Model

A multivariate mixture model comprises multivariate distributions, e.g.the pdf of a multivariate Gaussian distribution can be written as

f (x) =n∑

l=1

wl

|2πΣl |12e−

12 (x−µl )>Σ−1(x−µl )



Generalised Hyperbolic Distribution

The GH distribution has an exponential decaying speed

fGH(x ;λ, α, β, δ, µ = 0) ∼ xλ−1e−(α−β)x as x →∞.



−6 −4 −2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

0.5

X

Y

LaplaceNIGCauchyGaussian

Distribution comparison

−5.0 −4.8 −4.6 −4.4 −4.2 −4.0

0.00

00.

005

0.01

00.

015

0.02

0

X

Y

LaplaceNIGCauchyGaussian

Tail comparison

Abbildung 8: Graphical comparison of tail behavior. For all distributions meansequal 0 and variances equal 1. The NIG distribution (line) with λ = − 1

2decays second fast in the tails and has the highest peak. The Cauchy (dots)distribution has the lowest peak and the fattest tails. MVAghadatail


Angewandte Multivariate Statistik Multivariate Distributions Copulae

Copulae vs Normal Distribution

1. The empirical marginal distributions are skewed and fat tailed.2. Multivariate normal distribution does not consider the possibility

of extreme joint co-movement of asset returns.The dependency structure of portfolio asset returns is differentfrom the Gaussian one.



Advantages

1. Copulae are useful tools to simulate asset return distributions in amore realistic way.

2. Copulae allow to model the dependence structure independentlyfrom the marginal distributions

I construct a multivariate distribution with different marginsI the dependence structure is given by the copula.



Dependency Structures

−4 −2 0 2 4

−4

−2

02

4

−4 −2 0 2 4

−4

−2

02

4

−4 −2 0 2 4

−4

−2

02

4

Abbildung 9: Scatter plots of bivariate samples with different dependencystructures and equal correlation coefficients.



Varying Dependency

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

Bayer

Sie

men

s

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

Bayer

Sie

men

s

Abbildung 10: Standardized log returns of Bayer and Siemens 20000103-20020101 (left) and 20040101-20060102 (right). MVAscalogret



Outline

1. Motivation X2. Copulae3. Parameter Estimation4. Sampling from Copulae5. Tail Dependence6. Value-at-Risk with Copulae7. Application



Copulae

A copula is a multivariate distribution function defined on the unitcube [0, 1]d , with uniformly distributed margins.

P(X1 ≤ x1, . . . ,Xd ≤ xd) = C P(X1 ≤ x1), . . . ,P(Xd ≤ xd)= C F1(x1), . . . ,Fd(xd)



Applications

1. medicine2. hydrology3. finance (portfolio selection, time series, risk management)

2

the others work in banks, insurance companies and financial institutions. Their writings appeared in some 165 journals and conference proceedings. The most striking feature of the data set is the rapid growth in the annual number of contributions to the subject. This is illustrated in Figure 1. More detailed examination reveals that the growth falls into three periods:

a) Before 1986, the literature was sparse and mostly mathematical. The concept of copula can be traced back at least to the work of Wassily Hoeffding and Maurice Fréchet, though the term itself was coined by Sklar (1959). Many contributions were related to the study of probabilistic metric spaces, as described in the book by Schweizer & Sklar (1983).

b) Beginning in 1986, one can see a slow, systematic rise in the number of publications. Growth was largely due to the emergence of the concept of copula in statistics and to three conferences devoted to the subject: Rome (1990), Seattle (1993), and Prague (1996).

c) From 1999 on, the number of contributions grew considerably. The books by Joe (1997) and Nelsen (1999) were influential in disseminating copula theory; the book by Drouet-Mari & Kotz (2001), which focuses on correlation and dependence, is also noteworthy. Actuarial and financial applications were fuelled by Frees & Valdez (1998) and Embrechts et al. (1999), who illustrated the potential for copula modeling in these fields.

Figure 1. Number of documents on copula theory, 1971–2005 3. Breakdown by field of study What is the part of finance to the spectacular growth of copula methodology in the past few years? To investigate this issue, we subjectively grouped the 871 documents in our database into 9 mutually exclusive categories: mathematics; statistics; biostatistics; operations research; natural sciences; engineering; actuarial science; economics; and finance. We achieved this classification by carefully examining the contents of each document. About 1% of them did not match any of the categories and were left unclassified. Figure 2 shows the results of the grouping. Even though people in finance have been interested in copulas only since 2000, they produced the largest proportion of documents, i.e., 41%. Next come statistics (28%), biostatistics (10%), mathematics (8%), and actuarial science (8%). Interestingly, in June 2006 finance and actuarial science together contributed 47% of the literature, whereas mathematics, statistics, and biostatistics together accounted for 46%. No doubt

0 25 50 75

100 125 150 175 200 225

1972 1974 197619781980198219841986198819901992199419961998200020022004

3

finance-related documents now account for over half of the literature on the subject. We will later discuss the nature of these contributions.

2%

8%

10%

28%

41%

2%1% 1%

1%

6%

Unclassified Economics Operations research Engineering

Natural sciences Actuarial science Mathematics Biostatistics

Statistics Finance

Figure 2. Breakdown by discipline of the 871 documents in the database

The level of activity in each discipline is also reflected by Table 1, which lists the peer-review journals that carried the largest number of articles concerned with copulas. As of June 2006, statistics continued to lead the rooster. This is not surprising, given that copulas have a long history in this area. Interestingly, Risk Magazine and Quantitative Finance make the list, even though the earliest papers on the topic appeared there in 2001. A fair proportion of copula-related articles in Insurance: Mathematics and Economics also pertain to finance.

Table 1. List of journals that published the largest number of copula-related articles

Rank Journal Papers published

1 Journal of Multivariate Analysis 29 2 Statistics & Probability Letters 26 3 Insurance, Mathematics & Economics 23 4 Communications in Statistics: Theory & Methods 19 5 Biometrika 14 6 Risk 14 7 The Canadian Journal of Statistics 12 8 Biometrics 12 9 Quantitative Finance 11

10 Journal of Nonparametric Statistics 10

Abbildung 11: Number of documents on copula theory, 1971 - 2005. Break-down by discipline of the 871 documents in the database (41% Finance, 28%Statistics, 10% Biostatistics, 8% Mathematics, 6% Insurance)



F-volume

Let U1 and U2 be two sets in R = R ∪ +∞ ∪ −∞ and considerthe function F : U1 × U2 −→ R.The F -volume of a rectangle B = [x1, x2]× [y1, y2] ⊂ U1 × U2 isdefined as:

VF (B) = F (x2, y2)− F (x1, y2)− F (x2, y1) + F (x1, y1) (2)



2-increasing Function

F is said to be a 2-increasing function if for everyB = [x1, x2]× [y1, y2] ⊂ U1 × U2,

VF (B) ≥ 0 (3)

Remark Note, that “to be 2-increasing function” neither implies nor isimplied by “to be increasing in each argument”.



2-increasing Function

LemmaLet U1 and U2 be non-empty sets in R and let F : U1 ×U2 −→ R be atwo-increasing function. Let x1, x2 be in U1 with x1 ≤ x2, and y1, y2be in U2 with y1 ≤ y2. Then the function t 7→ F (t, y2)− F (t, y1) isnon-decreasing on U1 and the function t 7→ F (x2, t)− F (x1, t) isnon-decreasing on U2.



Grounded Function

If U1 and U2 have a smallest element minU1 and minU2 respectively,then we say, that a function F : U1 × U2 −→ R is grounded if :

for all x ∈ U1 : F (x ,minU2) = 0 and (4)for all y ∈ U2 : F (minU1, y) = 0 (5)



Distribution Function

A distribution function is a function from R2 7→ [0, 1] which: is grounded is 2-increasing satisfies F (∞,∞) = 1.



Margins

If U1 and U2 have a greatest element maxU1 and maxU2 respectively,then we say, that a function F : U1 × U2 −→ R has margins and thatthe margins of F are given by:

F (x) = F (x ,maxU2) for all x ∈ U1 (6)F (y) = F (maxU1, y) for all y ∈ U2 (7)



Bivariate Copulae

A 2-dimensional copula is a function C : [0, 1]2 → [0, 1] with thefollowing properties:1. For every u ∈ [0, 1], C (0, u) = C (u, 0) = 0 (grounded).2. For every u ∈ [0, 1], C (u, 1) = u and C (1, u) = u.

3. For every (u1, u2), (v1, v2) ∈ [0, 1]× [0, 1] with u1 ≤ v1 andu2 ≤ v2: C (v1, v2)− C (v1, u2)− C (u1, v2) + C (u1, u2) ≥ 0(2-increasing).



Fréchet-Hoeffding Bounds

1. every copula C satisfies

W (u1, u2) ≤ C (u1, u2) ≤ M(u1, u2)

2. upper and lower bounds are copulae

M(u1, u2) = min(u1, u2)

W (u1, u2) = max(u1 + u2 − 1, 0)



Fréchet Copulae

Abbildung 12: M(u, v) = min(u, v), W (u, v) = max(u + v − 1, 0)

and Π(u, v) = uv SFEfrechet

Fréchet, Maurice R. on BBI:



Sklar’s Theorem in Two Dimensions

Let F be a two-dimensional distribution function with marginaldistribution functions FX1 and FX2 . Then a copula C exists such thatfor all x1, x2 ∈ R2:

F (x1, x2) = CCC FX1 (x1) ,FX2 (x2) (8)

Moreover, if FX1 and FX2 are continuous, then C is unique. OtherwiseC is uniquely determined on the Cartesian product Im(FX1)× Im(FX2).Conversely, if C is a copula and FX1 and FX2 are distribution functions,then F defined by (24) is a two-dimensional distribution function withmarginals FX1 and FX2 .



Gauss Copula

C(u1, u2) = ΦρΦ−1(u1),Φ−1(u2)

=

Φ−1(u1)∫−∞

Φ−1(u2)∫−∞

1

2π√

1− ρ2exp

−x2 − 2ρxy + y2

2(1− ρ2)

dx dy

Abbildung 13: Gauss copula density, ρ = 0.4. MSRpdf_cop_Gauss

Gauss, Carl F. on BBI:



t-Student Copula

C(u1, u2) = tρ,νt−1ν (u1), t−1ν (u2)

=

t−1ν (u1)∫−∞

t−1ν (u2)∫−∞

12π√

1− ρ2exp

1 +

x2 − 2ρxy + y2

ν(1− ρ2)

−(ν+2)/2

dx dy

Abbildung 14: t-Student copula density, ν = 3, ρ = 0.4.MSRpdf_cop_tStudent

Gosset, Emil J. on BBI:



Archimedean Copulae

Archimedean copula:

C (u, v) = ψ[−1]ψ(u) + ψ(v)

for a continuous, decreasing and convex ψ, ψ(1) = 0.

ψ[−1](t) =

ψ−1(t), 0 ≤ t ≤ ψ(0),0, ψ(0) < t ≤ ∞.

The function ψ is a generator of the Archimedean copula.For ψ(0) =∞: ψ[−1] = ψ−1 and the ψ is called a strict generator.



Gumbel Copula

C(u, v) = exp[−

(− log u)θ + (− log v)θ 1

θ]

Abbildung 15: Gumbel copula density, parameter θ = 2.MSRpdf_cop_Gumbel

E. Gumbel on BBI:



Clayton Copula

C(u, v) = max

(u−θ + v−θ − 1)−1θ , 0

Abbildung 16: Clayton copula density, θ = 2. MSRpdf_cop_Clayton



Frank Copula

C(u, v) = −1θ

log

1 +(e−θu − 1)(e−θv − 1)

e−θ − 1

Abbildung 17: Frank copula density, θ = 2. MSRpdf_cop_Frank



0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

X

YClayton

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

X

Y

Gumbel

Abbildung 18: Monte Carlo sample of 10.000 realizations of pseudo randomvariable with uniform marginals in [0, 1] and dependence structure given byClayton (left) and Gumbel (right) copula with θ = 3. MVAgumbelclayton



Transformations of Margins

If (X1,X2) have copula C and set g1, g2 two continuous increasingfunctions, then g1 (X1) , g2 (X2) have the copula C, too.



Product Copula

Independence implies that the product of the cdf’s FX1 and FX2 equalsthe joint distribution function F , i.e.:

F (x1, x2) = FX1(x1)FX2(x2) (9)

Thus, we obtain the independence or product copulaC = Π(u, v) = uv .



Product Copula

Let X1 and X2 be random variables with continuous distributionfunctions F1 and F2 and joint distribution function H.Then X1 and X2 are independent if and only if CX1X2 = Π.According to Sklar’s Theorem, there exists a unique copula C with

P(X1 ≤ x1,X2 ≤ x2) = H(x1, x2)

= C F1(x1),F2(x2)= F1(x1) · F2(x2)



Partial DerivativesLet C (u, v) be a copula. For any v ∈ I , the partial derivative ∂C(u,v)

∂vexists for almost all u ∈ I . For such u and v one has:

∂C (u, v)

∂v∈ I (10)

The analogous statement is true for the partial derivative ∂C(u,v)∂u :

∂C (u, v)

∂u∈ I (11)

Moreover, the functions

u 7→ Cv (u)def= ∂C (u, v)/∂v and

v 7→ Cu(v)def= ∂C (u, v)/∂u

are defined and non-increasing almost everywhere on I .Ostap Okhrin 299 of 461


Copulae in d-Dimensions

Let U1,U2, . . . ,Ud be non-empty sets in R and consider the functionF : U1 × U2 × . . .× Ud −→ R. For a = (a1, a2, . . . , ad) andb = (b1, b2, . . . , bd) with a ≤ b (i.e. ak ≤ bk for all k) letB = [a, b] = [a1, b1]× [a2, b2]× . . .× [ad , bd ] be the d-box withvertices c = (c1, c2, . . . , cd). It is obvious, that each ck is either equalto ak or to bk .



F -volume

The F -volume of a d-boxB = [a, b] = [a1, b1]× [a2, b2]× . . .× [ad , bd ] ⊂ U1 ×U2 × . . .×Ud isdefined as follows:

VF (B) =d∑

k=1

sgn(ck)F (ck) (12)

where sgn(ck) = 1, if ck = ak for even k and sgn(ck) = −1, if ck = akfor odd k .



d-increasing Function

F is said to be a d-increasing function if for all d-boxes B with verticesin U1 × U2 × . . .× Ud holds:

VF (B) ≥ 0. (13)



Grounded Function

If U1,U2, . . . ,Ud have a smallest elementminU1,minU2, . . . ,minUd respectively, then we say, that a functionF : U1 × U2 × . . .× Ud −→ R is grounded if :

F (x) = 0 for all x ∈ U1 × U2 × . . .× Ud (14)

such that xk = minUk for at least one k.



Multivariate Copula

A d-dimensional copula is a function C : [0, 1]d → [0, 1]:1. C (u1, . . . , ui−1, 0, ui+1, . . . , ud) = 0 (at least one ui is 0);2. u ∈ [0, 1]d , C (1, . . . , 1, ui , 1, . . . , 1) = ui (all coordinates except

ui is 1)3. For each u < v ∈ [0, 1]d (ui < vi )

VC [u, v ] =∑a

sgn(a)C (a) ≥ 0

where a is taken over all vertices of [u, v ]. sgn(a) = 1 if ak = ukfor an even number of k ′s and sgn(a) = −1 if ak = uk for an oddnumber of k ′s (d-increasing).



Sklar’s Theorem

For a distribution function F with marginals FX1 . . . ,FXd, there exists

a copula C : [0, 1]d → [0, 1], such that

F (x1, . . . , xd) = CCCFX1(x1), . . . ,FXd(xd) (15)

for all xi ∈ R, i = 1, . . . , d . If FX1 , . . . ,FXdare cts, then C is unique.

If C is a copula and FX1 , . . . ,FXdare cdfs, then the function F defined

in (15) is a joint cdf with marginals FX1 , . . . ,FXd.



a copula C and marginal distributions can be "coupled"togetherinto a distribution function:

FX (x1, . . . , xd) = CFX1(x1), . . . ,FXd(xd)

a (unique) copula is obtained from "decouplingëvery (continuous)multivariate distribution function from its marginal distributions:

C (u1, . . . , ud) = FXF−1X1

(u1), . . . ,F−1Xd

(ud)

uj = FXj(xj), j = 1, . . . , d



if C is absolute continuous, there exists a copula density

c(u1, . . . , ud) =∂dC (u1, . . . , ud)

∂u1 . . .∂ud

the joint density fX is

fX (x1, . . . , xd) = cFX1(x1), . . . ,FXd(xd)

d∏j=1

fj(xj)



Fréchet-Hoeffding Bounds, Product Copula

1. Every copula C satisfies

W d(u1, . . . , ud) ≤ C (u1, . . . , ud) ≤ Md(u1, . . . , ud)

2. Upper and lower bounds

Md(u1, . . . , ud) = min(u1, . . . , ud)

W d(u1, . . . , ud) = max

(d∑

i=1

ui − d + 1, 0

)

3. Product copula Πd(u1, . . . , ud) =∏d

j=1 uj

4. The functions Md and Πd are d-copulae for all d ≥ 2, thefunction W d is not a d-copula for any d > 2.



Multivariate Elliptical Copulae

Gauss∫ Φ−1(u1)−∞ . . .

∫ Φ−1(ud )−∞ (2π)−

d2 |R|−

12 exp

(−1

2 r>R−1r

)dr1 . . . drd ,

where r = (r1, . . . , rd)>

t-Student∫ t−1ν (u1)−∞ . . .

∫ t−1ν (ud )−∞ (2π)−

d2 |R|−

12

(1 + r>R−1r

ν

)− v+d2

dr1 . . . drd

where r = (r1, . . . , rd)>



Multivariate Archimedean Copulae

Gumbel

C (u1, . . . , ud) = exp[−

(− log u1)θ + . . .+ (− log ud)θ 1θ

] Cook-Johnson

C (u1, . . . , ud) =

d∑j=1

u−θj − d + 1

− 1θ

Frank

C (u1, . . . , ud) = −1θlog1 +

(e−θu1 − 1) . . . (e−θud − 1)

(e−θ − 1)d−1



Dimensionality

In d-dimension1. Elliptical Copulae: correlation matrix with d(d−1)

2 parameters2. Archimedean Copulae: 1 parameter



Conclusions

Pluses of copulae flexible and wide range of dependence easy to simulate, estimate, implement explicit form of densities of copulae modelling of fat tails, assymetries

Minuses of copulae Elliptical: correlation matrix, symmetry Archimedean: too restrictive, single parameter, exchangable selection of copula


Angewandte Multivariate Statistik Theory of the Multinormal Elementary Properties

Theory of the Multinormal

Elementary Properties of the MultinormalThe pdf of X ∼ Np (µ,Σ) is given by:

f (x) = |2πΣ|−1/2 exp−12

(x − µ)>Σ−1(x − µ)

The expectation and variance are respectively given by:

E(X ) = µ,Var(X ) = Σ



Linear transformations

Linear transformations turn normal random variables into normalrandom variables. X ∼ Np(µ, Σ),A(p × p), c ∈ Rp

Y = AX + c ∼ Np(Aµ+ c ,AΣA>).



TheoremX =

(X1

X2

)∼ Np(µ,Σ), X1 ∈ Rr X2 ∈ Rp−r

X2.1 = X2 − Σ21Σ−111 X1 with

Σ =(

Σ11 Σ12Σ21 Σ22

).

⇒ X1 ∼ Nr (µ1,Σ11),independent

⇒ X2.1 ∼ Np−r (µ2.1,Σ22.1)

µ2.1 = µ2 − Σ21Σ−111 µ1,

Σ22.1 = Σ22 − Σ21Σ−111 Σ12.



Corollary

Let X =

(X1

X2

)∼ Np(µ,Σ).

Σ12 = 0 if and only if X1 is independent of X2.

The independence of two linear transforms of a multinormal X can beshown via the following corollary.

CorollaryIf X ∼ Np(µ,Σ),A and B matrices, then AX and BX are independentif and only if AΣB> = 0.



TheoremIf X ∼ Np(µ,Σ) and A(q × p), c ∈ Rq, q ≤ p, then Y = AX + c is aq-variate Normal, i.e.,

Y ∼ Nq(Aµ+ c ,AΣA>).



TheoremThe conditional distribution of X2 given X1 = x1 is normal with meanµ2 + Σ21Σ−1

11 (x1 − µ1) and covariance Σ22.1, i.e.,

(X2 | X1 = x1) ∼ Np−r (µ2 + Σ21Σ−111 (x1 − µ1),Σ22.1).

The conditional mean E(X2 | X1 = x1) is a LINEAR function of X1!



Example

p = 2, r = 1, µ =

(00

), Σ =

(1−0.8

−0.82

)Σ11 = 1, Σ21 = −0.8, Σ22.1 = 2− (0.8)2 = 1.36.

⇒ fX1(x1) = 1√2π

exp(− x21

2

)⇒ f (x2 | x1) = 1√

2π(1.36)exp− (x2+0.8x1)2

2·(1.36)

.


−10 −5 0 50.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

01

23

45

Conditional Normal Densities f(X2|X1)

Shifts in the conditional density. MVAcondnorm


TheoremIf X1 ∼ Nr (µ1,Σ11) and (X2|X1 = x1) ∼ Np−r (Ax1 + b,Ω) where Ωdoes not depend on x1, then

X =

(X1

X2

)∼ Np(µ,Σ),

where

µ =

(µ1

Aµ1 + b

)and

Σ =

(Σ11 Σ11A>AΣ11 Ω +AΣ11A>

).



Conditional Approximations

Best approximation of X2 ∈ Rp−r by X1 ∈ Rr :

X2 = E(X2|X1) + U = µ2 + Σ21Σ−111 (X1 − µ1)

= β0 + BX1 + U

with B = Σ21Σ−111 , β0 = µ2 − Bµ1 and U ∼ Np−r (0,Σ22.1).



Consider the case where X2 ∈ R, i.e., r = p − 1.Now B is (1× r)-row vector β> such that:

X2 = β0 + β> X1 + U.

This means that the best MSE approximation of X2 by a function ofX1 is a hyperplane.



Σ =

(Σ11 σ12σ21 σ22

)with σ12 ∈ Rr and σ22 ∈ R.

Marginal variance of X2:

σ22 = β>Σ11β + σ22.1 = σ21Σ−111 σ12 + σ22.1.

Squared multiple correlation between X2 and the r variables X1:

ρ22.1...r =

σ21Σ−111 σ12

σ22.



Example: classic blue pullover data

Suppose that X1 (sales), X2 (price), X3 (advertisement) and X4 (salesassistants) are normally distributed with

µ =

172.7104.6104.093.8

and Σ =

1037.21−80.02 219.841430.70 92.10 2624.00271.44 −91.58 210.30 177.36

.



The conditional distribution of X1 given (X2X3X4) is univariate normalwith mean

µ1 + σ12Σ−122

X2 − µ2X3 − µ3X4 − µ4

= 65.7− 0.2X2 + 0.5X3 + 0.8X4

and varianceσ11.2 = σ11 − σ12Σ−1

22 σ21 = 96.761

The multiple correlation is ρ21.234 =

σ12Σ−122 σ21σ11

= 0.907.



The correlation matrix between the 4 variables is given by

P =

1

−0.168 10.867 0.121 10.633 −0.464 0.308 1

.

The conditional distribution of (X1,X2) given (X3,X4) is bivariatenormal with mean:(

µ1

µ2

)+

(σ13 σ14σ23 σ24

)(σ33 σ34σ43 σ44

)−1(X3 − µ3X4 − µ4

)

=

(32.516 + 0.467X3 + 0.977X4153.644 + 0.085X3 − 0.617X4

)



and covariance matrix:(σ11 σ12σ21 σ22

)−(σ13 σ14σ23 σ24

)(σ33 σ34σ43 σ44

)−1(σ31 σ32σ41 σ42

)

=

(104.006−33.574 155.592

).

This covariance matrix allows to compute the partial correlationbetween X1 and X2 for a fixed level of X3 and X4:

ρX1X2|X3X4 =−33.574√

104.006 · 155.592= −0.264.



Mahalanobis Transform

If X ∼ Np(µ,Σ) then the Mahalanobis transform is

Y = Σ−1/2(X − µ) ∼ Np(0, Ip)

and it holds

Y>Y = (X − µ)> Σ−1(X − µ) ∼ χ2p.

Y is random vector and Y>Y is scalar. Y>Y can be used for testing (assuming that Σ is known). Normally, we do not know Σ. The tests in this situation can be

carried out using Wishart and Hotelling distributions.



Summary: Elementary Properties

If X ∼ Np(µ,Σ) then a linear transformationAX + c ,A(q × p), c ∈ Rq has distribution Nq(Aµ+ c ,AΣA>).

Two linear transformations AX and BX of X ∼ Np(µ,Σ) areindependent if and only if AΣB> = 0.

If X1 and X2 are partitions of X ∼ Np(µ,Σ) then the conditionaldistribution of X2 given X1 = x1 is normal again.



Summary: Elementary Properties

In the multivariate normal case, X1 is independent of X2 if andonly if Σ12 = 0.

The conditional expectation of (X2|X1) is a linear function if(X1X2

)∼ Np(µ,Σ).

The multiple correlation coefficient is defined asρ22.1...r =

σ21Σ−111 σ12σ22

.

The multiple correlation coefficient is the percentage of thevariance of X2 explained by the linear approximation β0 + β>X1.


Angewandte Multivariate Statistik Theory of the Multinormal The Wishart Distribution

Wishart Distribution

X ∼ Np(µ,Σ), µ = 0X (n × p) data matrix

M(p × p) = X>X ∼Wp(Σ, n)

Example (Wishart is generalization of χ2):p = 1, X ∼ N1(0, σ2)

X =

x1...xn

M = X>X =n∑

i=1x2i ∼ σ2χ2

n = W1(σ2, n)



Linear Transformation of the Data Matrix

Theorem

M∼Wp(Σ, n), B(p × q)

⇒ B>MB ∼Wq(B>ΣB, n)



Wishart and χ2p- Distribution

Theorem

M∼Wp(Σ, n) , a ∈ Rp , a>Σa 6= 0

⇒ a>Ma

a>Σa∼ χ2

n



Theorem (Cochran)

X (n × p) data matrix from a Np(0,Σ) distribution

nS = X>HX ∼Wp(Σ, n − 1)S is the sample covariance matrix

x and S are independent



Summary: Wishart Distribution

The Wishart distribution is a generalization of the χ2-distribution.In particular W1(σ2, n) = σ2χ2

n. The empirical covariance matrix S has a 1

nWp(Σ, n − 1)distribution.

In the normal case, x and S are independent. ForM∼Wp(Σ,m), a>Ma

a>Σa∼ χ2

m.


Angewandte Multivariate Statistik Theory of the Multinormal Hotelling Distribution

Hotelling’s T 2-Distribution

Assume that random vector Y ∼ Np(0, I) is independent of randommatrixM∼Wp(I, n).

n Y>M−1Y ∼ T 2(p, n)

Hotelling’s T 2 is a generalization of Student’s t-distributionThe critical values of Hotelling’s T 2 can be calculated usingF -distribution:

T 2(p, n) =np

n − p + 1Fp,n−p+1


Angewandte Multivariate Statistik Theory of the Multinormal Hotelling Distribution

Summary: Hotelling’s T 2-Distribution

Hotelling’s T 2-distribution is a generalization of thet-distribution. In particular T (1, n) = tn.

(n − 1)(x − µ)>S−1(x − µ) has a T 2(p, n − 1) distribution. The relation between Hotelling’s T 2− and Fisher’s F -distribution

is given by T 2(p, n) = npn−p+1 Fp,n−p+1.


Angewandte Multivariate Statistik Theory of Estimation The Likelihood Function

Theory of Estimation

In parametric statistics, θ is a k-variate vector θ ∈ Rk characterizingthe unknown properties of the population pdf f (x ; θ)

The aim will be to estimate θ from the sample X through estimators θwhich are functions of the sample: θ = θ(X ).

We must derive the sampling distribution of θ to analyze its properties(is it related to the unknown quantity θ it is supposed to estimate?).

We will utilise the maximum likelihood theory.



The Likelihood Function

X ∼ f (x ; θ) pdf of an i.i.d. sample xini=1 with parameter θLikelihood function

L(X ; θ) =n∏

i=1

f (xi ; θ)

MLEθ = argmax

θL(X ; θ)

log-likelihood`(X ; θ) = log L(X ; θ)



ExampleSample xini=1 from Np(µ, I), i.e. from the pdf

f (x ; θ) = (2π)−p/2 exp−12

(x − θ)>(x − θ)

where θ = µ ∈ Rp is the mean vector parameter.The log-likelihood is

`(X ; θ) =n∑

i=1

logf (xi ; θ) = log (2π)−np/2 − 12

n∑i=1

(xi − θ)>(xi − θ).

The term (xi − θ)>(xi − θ) equals

(xi − x)>(xi − x) + (x − θ)>(x − θ) + 2(x − θ)>(xi − x).



ExampleIf we sum up this term over i = 1, . . . , n we see that

n∑i=1

(xi − θ)>(xi − θ) =n∑

i=1

(xi − x)>(xi − x) + n(x − θ)>(x − θ).

Hence

`(X ; θ) = log(2π)−np/2 − 12

n∑i=1

(xi − x)>(xi − x)− n

2(x − θ)>(x − θ).

Only the last term depends on θ and is obviously maximized for

θ = µ = x .

Thus x is the MLE.



Example (MLE’s from a Normal Distribution)xini=1 is a sample from a normal distribution Np(µ,Σ)Due to the symmetry of Σ, the unknown parameter θ is in factp + 1

2p(p + 1)-dimensional.Then

L(X ; θ) = |2πΣ|−n/2 exp

−12

n∑i=1

(xi − µ)>Σ−1(xi − µ)

and

`(X ; θ) = −n

2log |2πΣ| − 1

2

n∑i=1

(xi − µ)>Σ−1(xi − µ).



Example (MLE’s from a Normal Distribution - cont’d)The term (xi − µ)>Σ−1(xi − µ) equals

(xi − x)>Σ−1(xi − x) + (x − µ)>Σ−1(x − µ) + 2(x − µ)>Σ−1(xi − x).

If we sum up this term over i = 1, . . . , n we see that

n∑i=1

(xi − µ)>Σ−1(xi − µ)

=n∑

i=1

(xi − x)>Σ−1(xi − x) + n(x − µ)>Σ−1(x − µ).



Example (MLE’s from a Normal Distribution - cont’d)Note that

(xi − x)>Σ−1(xi − x) = tr

(xi − x)>Σ−1(xi − x)

= tr

Σ−1(xi − x)(xi − x)>.

We sum this up over the index i :n∑

i=1

(xi − µ)>Σ−1(xi − µ)

= trΣ−1n∑

i=1

(xi − x)(xi − x)>+ n(x − µ)>Σ−1(x − µ)

= trΣ−1nS+ n(x − µ)>Σ−1(x − µ).



Example (MLE’s from a Normal Distribution - cont’d)Thus the log-likelihood function for Np(µ,Σ) is

`(X ; θ) = −n

2log |2πΣ| − n

2trΣ−1S − n

2(x − µ)>Σ−1(x − µ)

We can easily see that the third term would be maximized by µ = x .The MLE’s are given by

µ = x , Σ = S.

Note that the unbiased covariance estimator Su = nn−1S is not the

MLE!



Example (Linear Regression Model)Linear regression model yi = β>xi + εi ; i = 1, . . . , n, with εi i.i.d.N(0, σ2) and xi ∈ Rp.Here θ = (β>, σ) is a (p + 1)-dimensional parameter vector.Denote

y =

y1...yn

, X =

x>1...x>n

.

Then

L(y ; θ) =n∏

i=1

1√2πσ

exp− 12σ2 (yi − β>xi )2

and



Example (Linear Regression Model - cont’d)

`(y ; θ) = log(

1(2π)n/2σn

)− 1

2σ2

n∑i=1

(yi − β>xi )2

= −n

2log(2π)− n log σ − 1

2σ2 (y −Xβ)>(y −Xβ)

= −n

2log(2π)− n log σ − 1

2σ2 (y>y + β>X>Xβ − 2β>X>y)

Differentiating w.r.t. the parameters yields

∂

∂β` = − 1

2σ2 (2X>Xβ − 2X>y)

∂

∂σ` = −n

σ+

1σ3

(y −Xβ)>(y −Xβ)

.



Example (Linear Regression Model - cont’d)∂∂β ` is the vector of the derivatives w.r.t. all components of β (thegradient).Since the first equation is only dependent on β, we start with derivingβ.

X>X β = X>y =⇒ β = (X>X )−1X>y

Now we plug-in β into the second equation which gives

n

σ=

1σ3 (y −X β)>(y −X β) =⇒ σ2 =

1n||y −X β||2,

|| • ||2 denoting the Euclidean vector norm.



Example (Linear Regression Model - cont’d)We see that the MLE β is identical with the least squares estimator.The variance estimator

σ2 =1n

n∑i=1

(yi − β>xi )2

is nothing else than the residual sum of squares (RSS) generalized tothe case of multivariate xi .



Example (Linear Regression Model - cont’d)Note that in a fixed design situation where the xi are considered asbeing fixed, we have

E(y) = Xβ and Var(y) = σ2In.

Then, using the properties of moments, we have

E(β) = (X>X )−1X> E(y) = β,

Var(β) = σ2(X>X )−1.



Summary: Likelihood Function

If xini=1 is an i.i.d. sample from a distribution with pdf f (x ; θ)then L(X ; θ) =

∏ni=1 f (xi ; θ) is the likelihood function.

The maximum likelihood estimator (MLE) is that value of θwhich maximizes L(X ; θ). Equivalently one can maximize thelog-likelihood `(X ; θ).



Summary: Likelihood Function

The MLE’s of µ,Σ from a Np(µ,Σ) distribution are µ = x andΣ = S. Note that the MLE for Σ is not unbiased.

The MLE’s in a linear model y = Xβ + ε, ε ∼ Nn(0, σ2I) aregiven by the least squares estimator β = (X>X )−1X>y andσ2 = 1

n ||y −X β||2. E(β) = β and Var(β) = σ2(X>X )−1.


Angewandte Multivariate Statistik Theory of Estimation Cramer-Rao Lower Bound

Cramer-Rao Lower bound

One typical property we want for an estimator is unbiasedness: E(θ) = θ. (x is an unbiased estimator of µ and S is a biased

estimator of Σ in finite sample). We look for an unbiased estimator with the smallest possible

variance. The Cramer-Rao lower bound will achieve this and it provides the

asymptotic optimality property of maximum likelihood estimators. The Cramer-Rao theorem involves the score function and its

properties, which are first derived.



Score Function and Fisher Information

The score function is

s(X ; θ) =∂

∂θ`(X ; θ)

The covariance matrix Fn = Vars(X ; θ) is called the Fisherinformation matrix.



Example (Score Function and Fisher Information)Suppose that X ∼ Np(θ, I). Then

s(X ; θ) =∂

∂θ`(X ; θ)

= −12∂

∂θ

n∑

i=1

(xi − θ)>(xi − θ)

= n(x − θ),

hence the information matrix is Fn = Varn(x − θ) = nIp.



TheoremIf s = s(X ; θ) is the score function and if θ = t = t(X , θ) is anyfunction of X and θ, then under regularity conditions

E(st>) =∂

∂θE(t>)− E

(∂t>

∂θ

)·

CorollaryIf s = s(X ; θ) is the score function, and θ = t = t(X ) is any unbiasedestimator of θ (i.e., E(t) = θ), then

E(st>) = Cov(s, t) = Ik .



Note that the score function has mean zero

Es(X ; θ) = 0.

Hence, E(ss>) = Var(s) = Fn and it follows that

Fn = −E

∂2

∂θ∂θ>`(X ; θ)

.

RemarkIf x1, · · · , xn are i.i.d., Fn = nF1 where F1 is the Fisher informationmatrix for sample size n = 1.



All estimators which are unbiased and attain the Cramer-Rao lowerbound are minimum variance estimators.

Theorem (Cramer-Rao)If θ = t = t(X ) is any unbiased estimator for θ then under regularityconditions

Var(t) ≥ F−1n ,

whereFn = Es(X ; θ)s(X ; θ)> = Vars(X ; θ)

is the Fisher information matrix



Beweis.Consider the correlation ρY ,Z between Y and Z where Y = a>t,Z = c>s, and s is the score and the vectors a, c ∈ Rp. By theCorollary Cov(s, t) = I and thus

Cov(Y ,Z ) = a> Cov(t, s)c = a>c

Var(Z ) = c> Var(s)c = c>Fnc .

Hence,

ρ2Y ,Z =

Cov2(Y ,Z )

Var(Y )Var(Z )=

(a>c)2

a> Var(t)a· c>Fnc≤ 1.



cont’d.In particular, this holds for any c 6= 0. Therefore it holds also for themaximum of the left-hand side with respect to c . Since

maxc

c>aa>c

c>Fnc= max

c>Fnc=1c>aa>c

andmax

c>Fnc=1c>aa>c = a>F−1

n a



By the maximization theorem in the chapter on Matrix Algebra wehave

a>F−1n a

a> Var(t)a≤ 1 ∀ a ∈ Rp, a 6= 0,

i.e.,a>Var(t)−F−1

n a ≥ 0 ∀ a ∈ Rp, a 6= 0,

which is equivalent to Var(t) ≥ F−1n .



Asymptotic Sampling Distribution of the MLE

Maximum likelihood estimators (MLE’s) attain the lower bound if thesample size n goes to infinity. The next theorem states this and, inaddition, gives the asymptotic sampling distribution of the maximumlikelihood estimation, which turns to be multinormal.



TheoremSuppose that the sample xini=1 is i.i.d. If θ is the MLE for θ ∈ Rk ,i.e., θ = argmax

θL(X ; θ), then under some regularity conditions, as

n→∞: √n(θ − θ)

L−→ Nk(0,F−11 )

where F1 denotes the Fisher information for sample size n = 1.As a consequence we see that (under regularity conditions) the MLE isasymptotically unbiased, efficient (minimum variance) and normallydistributed.



It follows that asymptotically

n(θ − θ)>F1(θ − θ)L→ χ2

p,

If F1 is a consistent estimator of F1

n(θ − θ)>F1(θ − θ)L→ χ2

p

This expression may be useful to test hypotheses about θ and toconstruct confidence regions for θ in a very general setup. It is clearthat

P(n(θ − θ)>F1(θ − θ) ≤ χ2

1−α;p

)≈ 1− α,

where χ2ν;p denotes the ν-quantile of a χ2

p random variable. So, theellipsoid n(θ − θ)>F1(θ − θ) ≤ χ2

1−α;p ∈ Rp provides an asymptotic(1− α)-confidence region for θ.



Summary: Cramer-Rao Lower bound

The score function is the derivative s(X ; θ) = ∂∂θ`(X ; θ) of the

log-likelihood with respect to θ. The covariance matrix of s(X ; θ)is the Fisher information matrix.

Any unbiased estimator θ = t = t(X ) has a variance that isbounded below by the inverse of the Fisher information. Thus, anestimator, which attains this lower bound, is a minimal varianceestimator.



Summary: Cramer-Rao Lower bound

MLE’s attain the lower bound in an asymptotic sense, i.e.,

√n(θ − θ)

L−→ N(0,F−11 )

if θ is the MLE θ = argmaxθ

L(X ; θ).


Angewandte Multivariate Statistik Hypothesis Testing Likelihood Ratio Test

Likelihood Ratio Test

Suppose that the distribution of xini=1, xi ∈ Rp, depends on aparameter vector θ. Then

H0 : θ ∈ Ω0

H1 : θ ∈ Ω1.

The hypothesis H0 corresponds to the “reduced model” and H1 to the“full model”.



ExampleXi ∼ Np(θ, I)

H0 : θ = θ0

H1 : no constraints for θ

or equivalently to Ω0 = θ0, Ω1 = Rp.



Likelihood RatioDefine L∗j = max

θ∈Ωj

L(X ; θ), the maxima of the likelihood for each of the

hypotheses.

λ(X ) =L∗0L∗1

Likelihood Ratio Testrejection region:

R = x : λ(x) < c

supθ∈Ω0

Pθ(x ∈ R) = α



Theorem (Wilks)If Ω1 ⊂ Rq is a q-dimensional space and if Ω0 ⊂ Ω1 is anr -dimensional subspace, then under regularity conditions for n→∞

∀ θ ∈ Ω0 : −2 log λ L−→ χ2q−r .



Test problem 1

X1, . . . ,Xn, i.i.d. with Xi ∼ Np(µ,Σ)

H0 : µ = µ0, Σ known, H1 : no constraints.

Ω0 = µ0, r = 0,Ω1 = Rp, q = p

−2 log λ = 2(`∗1 − `∗0) = n(x − µ0)>Σ−1(x − µ0)

−2 log λ ∼ χ2p

Rejection region R : x ∈ Rn such that −2 log λ > χ20.95;p



Example (Bank Data)

µ0 = (214.9, 129.9, 129.7, 8.3, 10.1, 141.5)>.

x = (214.8, 130.3, 130.2, 10.5, 11.1, 139.4)>.

−2 log λ = 2(`∗1 − `∗0) = n(x − µ0)>Σ−1(x − µ0)

= 7362.32

LR test statistic −2 log λ ∼ χ26 is highly significant.



Test problem 2Xi ∼ Np(µ,Σ) i.i.d.

H0 : µ = µ0, Σ unknown, H1 : no constraints.

Under H0 it can be shown that

`∗0 = `(µ0,S + dd>), d = (x − µ0)

and under H1 we have`∗1 = `(x ,S).

This leads to

− 2 log λ = 2(`∗1 − `∗0) = n log(1 + d>S−1d). (16)



Test problem 2 cont’dNote that this statistic depends on (n − 1)d>S−1d which has, underH0, a Hotelling’s T 2-distribution. Therefore,

(n − 1)(x − µ0)>S−1(x − µ0) ∼ T 2(p, n − 1). (17)

or equivalently(n − p

p

)(x − µ0)>S−1(x − µ0) ∼ Fp,n−p

So the rejection region may be defined as(n − p

p

)(x − µ0)>S−1(x − µ0) > F1−α;p,n−p.



Test problem 2 cont’dAlternatively we have, under H0, the asymptotic distribution

−2 log λ −→ χ2p,

leading to the rejection region

n log1 + (x − µ0)>S−1(x − µ0)

> χ2

1−α;p



Confidence region for µ

(n−pp

)(x − µ)>S−1(x − µ) ∼ Fp,n−pµ ∈ Rp | (µ− x)>S−1(µ− x) ≤ p

n − pF1−α;p,n−p

is a confidence region at level (1-α) for µ; it is the interior of aniso-distance ellipsoid in Rp.When p is large, ellipsoids are difficult to practically handle. One isthus interested in finding confidence intervals for µ1, µ2, . . . , µp sothat simultaneous confidence on all the intervals reaches the desiredlevel say, 1− α.



Simultaneous Confidence Intervals for a>µ

Obvious confidence interval for certain a>µ is given by:∣∣∣∣√n − 1(a>µ− a>x)√a>Sa

∣∣∣∣ ≤ t1−α2 ;n−1

or equivalently

t2(a) =(n − 1)

a>(µ− x)

2

a>Sa≤ F1−α;1,n−1

which provides the (1− α) confidence interval for a>µ:a>x −

√F1−α;1,n−1

a>San − 1

≤ a>µ ≤ a>x +

√F1−α;1,n−1

a>San − 1

.



Using Theorem on maximum of quadratic forms we see that:

maxa

t2(a) = (n − 1)(x − µ)>S−1(x − µ) ∼ T 2(p, n − 1).

implies that the simultaneous confidence intervals for all possiblelinear combinations a>µ, a ∈ Rp of the elements of µ is given by:(

a>x −√Kαa>Sa, a>x +

√Kαa>Sa

),

where Kα = pn−pF1−α;p,n−p.



Example95% confidence region for µf , the mean of the forged banknotes, isgiven by the ellipsoid:

µ ∈ R6∣∣∣(µ− xf )>S−1

f (µ− xf ) ≤ 694

F0.95;6,94

95% simultaneous c.i. are given by (using F0.95;6,94 = 2.1966)

214.692 ≤ µ1 ≤ 214.954130.205 ≤ µ2 ≤ 130.395130.082 ≤ µ3 ≤ 130.30410.108 ≤ µ4 ≤ 10.95210.896 ≤ µ5 ≤ 11.370

139.242 ≤ µ6 ≤ 139.658



Example (cont’d)Comparison with µ0 = (214.9, 129.9, 129.7, 8.3, 10.1, 141.5)> showsthat almost all components (except the first one) are responsible forthe rejection of µ0.In addition, choosing e.g. a> = (0, 0, 0, 1, −1, 0) gives c.i.−1.211 ≤ µ4 − µ5 ≤ 0.005 shows that for the forged bills, the lowerborder is essentially smaller than the upper border.



Test problem 3Xi ∼ Np(µ,Σ)

H0 : Σ = Σ0, µ unknown, H1 : no constraints.

−2 log λ = 2(`∗1 − `∗0)= n tr(Σ−1

0 S)− n log |Σ−10 S| − np.

−2 log λ→ χ2m, m = 1

2 p(p + 1)



Example (US companies data)

S = 107 ×(

1.6635 1.24101.2410 1.3747

)(energy sector)

We want to test if Var(X1X2

)= 107 ×

(1.2248 1.14251.1425 1.5112

)= Σ0

(where Σ0 is the variance of manufacturing sector)LR test statistic −2 log λ = 2.7365 is not significant for χ2

3.Hence, we don’t reject the null hypothesis H0 and we can not concludethat Σ 6= Σ0.



Test problem 4Yi ∼ N1(β>xi , σ

2), xi ∈ Rp

H0 : β = β0, σ2 unknown, H1 : no constraints.

−2 log λ = 2(`∗1 − `∗0)

=n

2log

(||y −Xβ0||2

||y −X β||2

)−→ χ2

p

Recall

F =(n − p)

p

(||y −Xβ0||2

||y −X β||2− 1

)∼ Fp,n−p



Example (Classic blue pullover example)(αβ

)=(211

0

)

y =

y1...

y10

=

x1,1...

x10,1

, X =

1 x1,2...

...1 x10,2

.

The test statistic for the LR test is −2 log λ = 4.55 which is under theχ2

2 distribution not significant. However the exact F -test statisticF = 5.93 under the F2,8 distribution is significant (F2,8;0.95 = 4.46).



Summary: Hypothesis Testing

The hypotheses H0 : θ ∈ Ω0 against H1 : θ ∈ Ω1 can be test bymeans of the likelihood ratio test (LRT).

The likelihood ratio (LR) is the quotient λ(X ) = L∗0/L∗1 where the

L∗j are the maxima of the likelihood in each of the hypotheses. The test statistic in the LRT is λ(X ) or equivalently its logarithm

log λ(X ).




If Ω1 is q-dimensional and Ω0 ⊂ Ω1 r -dimensional, then theasymptotic distribution of −2 log λ is χ2

q−r . This allows to test H0against H1 by calculating as test statistic −2 log λ = 2(`∗1 − `∗0)where `∗j = log L∗j .

The hypothesis H0 : µ = µ0 for X ∼ Np(µ,Σ), Σ known, leads to−2 log λ = n(x − µ0)>Σ−1(x − µ0) ∼ χ2

p.

The hypothesis H0 : µ = µ0 for X ∼ Np(µ,Σ), Σ unknown, leadsto −2 log λ = n log1 + (x − µ0)>S−1(x − µ0) −→ χ2

p, and

(n − 1)(x − µ0)>S−1(x − µ0) ∼ T 2(p, n − 1).




The hypothesis H0 : Σ = Σ0 for X ∼ Np(µ,Σ), µ unknown, leadsto −2 log λ = n tr

(Σ−1

0 S)− n log |Σ−1

0 S| − np −→ χ2m, m =

12p(p + 1).

The hypothesis H0 : β = β0 for Yi ∼ N1(β>xi , σ2), σ2 unknown,

leads to −2 log λ = n2 log

(||y−Xβ0||2

||y−X β||2

)−→ χ2

p.


Angewandte Multivariate Statistik Hypothesis Testing Linear Hypothesis

Linear Hypothesis

We present a general procedure which allows a linear hypothesis to betested.Linear hypotheses are of the form Aµ = a with known matricesA(q × p) and a(q × 1) with q ≤ p.

ExampleSuppose that X1 ∼ N(µ1, σ) and X2 ∼ N(µ2, σ) are independent andthat you want to test the hypothesis H0 : µ1 = µ2This can be written as linear hypothesis

H0 : Aµ =(1 −1

)( µ1µ2

)= 0.




H0 : Aµ = a, Σ known, H1 : no constraints.

The results of the Test Problems 1 and 2 can directly be used on µy ,the mean of Yi = AXi .Indeed Yi ∼ Nq(µy ,Σy ) where µy=Aµ and Σy = AΣA>.Accordingly we have: y = Ax , Sy = ASA>, d = Ax − a.

n(Ax − a)>(AΣA>)−1(Ax − a) ∼ X 2q



ExampleWe consider hypotheses on partitioned µ =

(µ1µ2

).

H0 : µ1 = µ2, H1 : no constraints,

for N2p((µ1µ2

),(

Σ0

0Σ

)) with known Σ.

This is equivalent to A = (Ip,−Ip), a = (0, . . . , 0︸︷︷︸p

)> and leads to

−2 log λ = n(x1 − x2)(2Σ)−1(x1 − x2) ∼ χ2p.



ExampleAnother example is the test whether µ1 = 0, i.e.

H0 : µ1 = 0, H1 : no constraints,

for N2p((µ1µ2

),(

Σ0

0Σ

)) with known Σ.

This is equivalent to Aµ = a with A = (I, 0),a = (0, . . . , 0︸︷︷︸

p

)>.

Hence−2 log λ = nx1Σ−1x1 ∼ χ2

p.




H0 : Aµ = a, Σ unknown, H1 : no constraints.



ExampleConsider the bank data set and test if µ4 = µ5, i.e., if the lower bordermean equals to the larger border mean for the forged bills.

A = (0 0 0 1− 1 0)

a = 0.

The test statistic is

99(Ax)>(ASfA>)−1(Ax) ∼ T 2(1, 99) = F1,99.

The observed value is 13.638 which is significant.



Repeated Measurements

Frequently, n independent sampling units are observed under pdifferent experimental conditions (different treatments,...).X1, . . . ,Xn are i.i.d. with Xi ∼ Np(µ,Σ) given p repeated measures.

The hypothesis of interest in that case is the following: there are notreatment effects, H0 : µ1 = µ2 = . . . = µp. This hypothesis is a directapplication of the Test Problem 6.

H0 : Cµ = 0 where C ((p − 1)× p) =

1 −1 0 · · · 00 1 −1 · · · 0...

......

......

0 · · · 0 1 −1




Note that in many cases one of the experimental conditions is the“control” (a placebo, standard drug or reference condition). In thiscase,

C ((p × 1)× p) =

1 −1 0 · · · 01 0 −1 · · · 0...

......

......

1 0 0 · · · −1

The null hypothesis will be rejected when

(n − p + 1)

p − 1x>C>(CSC>)−1Cx > F1−α;p−1,n−p+1




Simultaneous confidence intervals for linear combinations of the meanof Yi have already been derived. For all a ∈ Rp−1, with probability(1− α) we have:

a>Cµ ∈ a>Cx ±

√(p − 1)

n − p + 1F1−α;p−1,n−p+1a>CSC>a.

The row sums of the element of C are zero: C1p = 0, therefore a>C isa vector whose sum of element vanishes. This is called a contrast.




Let b = C>a, we have b>1p =p∑

j=1bj = 0, the result above provides

thus for all contrasts of µ, b>µ simultaneous confidence intervals oflevel (1− α)

b>µ ∈ b>x ±

√(p − 1)

n − p + 1F1−α;p−1,n−p+1b>Sb.

Contrast are e.g.: b> = (1,−1, 0, 0), (1, 0, 0,−1), (13 ,−

13 ,−

13 ,−

13).



Example40 children were randomly chosen and then followed from grade level 8to 11, the scores obtained from a test of their vocabulary.

x> = (1.086, 2.544, 2.851, 3.420)

S =

2.9022.438 3.0492.963 2.775 4.2812.183 2.319 2.939 3.162

.



Example (cont’d)The matrix C providing successive differences of µj is:

C =

1 −1 0 00 1 −1 00 0 1 −1

.

The test statistic is Fobs = 53.134 which is significant for F3.37.We have the following simultaneous 95% confidence intervals

−1.958 ≤ µ1 − µ2 ≤ −0.959−0.949 ≤ µ2 − µ3 ≤ 0.335−1.171 ≤ µ3 − µ4 ≤ 0.036.



Example (cont’d)The rejection of the H0 is mainly due to the difference between thefirst and the second year performance of children. The followingconfidence intervals for the following contrasts may also be of interest:

−2.283 ≤ µ1 − 13(µ2 + µ3 + µ4) ≤ −1.423

−1.777 ≤ 13(µ1 + µ2 + µ3)− µ4 ≤ −0.742

−1.479 ≤ µ2 − µ4 ≤ −0.272

i.e., µ1 is different from the average of the 3 other years and µ4 turnsout to be better than µ2.



Test Problem 7Suppose Y1, . . . ,Yn, independent withYi ∼ N1(β>xi , σ

2),xi ∈ Rp.

H0 : Aβ = a, σ2 unknown, H1 : no constraints.

The constrained maximum likelihood estimators under H0 are

β = β − (X>X )−1A>[A(X>X )−1A>]−1(Aβ − a)

for β and σ2 = 1n (y −X β)>(y −X β). β denotes the unconstrained

MLE as before. The LR statistic is

−2 log λ = 2(`∗1 − `∗0)

=n

2log

(||y −X β||2

||y −X β||2

)−→ χ2

q



Example (“classic blue” pullovers)Let’s test if β = 0 in the regression of sales on prices. It holds

β = 0 ←→ (0 1)

(α

β

)= 0.

The LR statistic here is

−2 log λ = 0.142

which is not significant for the χ21 distribution. The F -test statistic

F = 0.231

is also not significant.



Example (“classic blue” pullovers cont’d)We can assume independence of sales on prices (alone).Multivariate regression in the “classic blue” pullovers example.Parameter estimates in the model

X1 = α + β1X2 + β2X3 + β3X4 + ε

areα = 65.670, β1 = −0.216, β2 = 0.485, β3 = 0.844.

Let us test now the hypothesis

H0 : β1 = −12β2



Example (“classic blue” pullovers cont’d)This is equivalent to

(0 1

12

0) α

β1β2β3

= 0.

The LR statistic in this case is equal to

−2 log λ = 0.006,

the F statistic isF = 0.007.

Hence, in both cases we will not reject our hypothesis.



Test Problem 8 (Comparison of two means)Suppose Xi1 ∼ Np(µ1,Σ),i = 1, · · · , n1 andXj2 ∼ Np(µ2,Σ),j = 1, · · · , n2, all the variables being independent.

H0 : µ1 = µ2, H1 : no constraints.

Both samples provide the statistics xk and Sk , k=1,2. Letδ = µ1 − µ2,we have

(x1 − x2) ∼ Np

(δ,n1 + n2

n1n2Σ

)n1S1 + n2S2 ∼Wp(Σ, n1 + n2 − 2).



The rejection region will thus be given by:

n1n2(n1 + n2 − p − 1)

p(n1 + n2)2 ((x1 − x2))> S−1 ((x1 − x2))

≥ F1−α;p,n1+n2−p−1

A (1− α) ∗ 100% confidence region for δ is given by the ellipsoidcentered at (x1 − x2)

(δ − (x1 − x2))> S−1 (δ − (x1 − x2))

≤ p(n1 + n2)2

(n1 + n2 − p − 1)(n1n2)F1−α;p,n1+n2−p−1,

and the simultaneous confidence intervals for all linear combinations ofthe elements of δ : a>δ are given by

a>δ ∈ a>(x1−x2)±

√p(n1 + n2)2

(n1 + n2 − p − 1)(n1n2)F1−α;p,n1+n2−p−1a>Sa.



ExampleWe want to compare the mean of the assets (X1) and of the sales (X2)of the two sectors energy (group 1) and manufacturing (group 2). Wehave the following statistics n1 = 15, n2 = 10, p = 2,

x1 =

(40842580.5

), x1 =

(4307.24925.2

)and

S1 = 107 ∗(

1.6635 1.24101.2410 1.3747

),

S2 = 107 ∗(

1.2248 1.14251.1425 1.5112

),

so that S = 107 ∗(

1.4880 1.20161.2016 1.4293

).



ExampleThe observed value of the test statistic is Fobs = 2.7036. SinceF0.95;2,22 = 3.4434 the hypothesis of equal means of the two groups isnot rejected although it would be rejected at a less severe level(p − value = 0.0892). The 95% simultaneous confidence intervals forthe differences are given by

−4628.6 ≤ µ1a − µ2a ≤ 4182.2−6662.4 ≤ µ1s − µ2s ≤ 1973.0.



ExampleLet us compare the vectors of means of the forged and the genuinebank notes. The matrices Sf and Sg were already calculated and sincehere nf = ng = 100, S is simply the mean of Sf andSg : S = 1

2 (Sf + Sg ).

x>g = (214.97 129.94 129.72 8.305 10.168 141.52)

x>f = (214.82 130.3 130.19 10.53 11.133 139.45)

The test statistic is Fobs = 391.92 which is highly significant forF6,193.



ExampleThe 95% simultaneous confidence intervals for the differencesδj = µgj − µfj , j = 1, . . . , p are:

−0.0443 ≤ δ1 ≤ 0.3363−0.5186 ≤ δ2 ≤ −0.1954−0.6416 ≤ δ3 ≤ −0.3044−2.6981 ≤ δ4 ≤ −1.7519−1.2952 ≤ δ5 ≤ −0.63481.8072 ≤ δ6 ≤ 2.3268

All the components (except for the first) show a significant differencein the means. The main effects being taken by the lower border (X4)and the diagonal (X6).



Test Problem 9 (Comparison of Covariance Matrices)Let Xih ∼ Np(µh,Σh), i = 1 · · · ,Nh; h = 1, · · · , kall variables being independent,

H0 : Σ1 = Σ2 = · · · = Σk , H1 : no constraints.



Each subsample provides Sh an estimator of Σh with

nhSh ∼Wp(Σh, nh − 1)

Under H0,∑k

h=1 nhSh ∼Wp(Σ, n − k), where Σ is the commoncovariance matrix x and n =

∑kh=1 nh. Let S = n1S1+···+nkSk

n be theweighted average of the Sh(it is in the fact the MLE of Σ when H0 istrue). The likelihood ratio test leads to the statistic

−2 log λ = n log | S | −k∑

h=1

nh log | Sh |

which under H0 is approximately distributed as a X 2m where

m=12(k − 1)p(p + 1).



ExampleCome back to US companies data, where the mean of assets and saleshave been compared for companies from the energy and manufacturingsector. The test Σ1 = Σ2 leads to the value of the test statistic

−2 log λ = 0.9076

which is not significant (p-value for a χ23 = 0.82). We cannot reject H0

and the comparison of the means above is valid.



Test Problem 10 (Comparison of two means, unequal covariancematrices, large samples)Suppose Xi1 ∼ Np(µ1,Σ1),i = 1, · · · , n1 andXj2 ∼ Np(µ2,Σ2),j = 1, · · · , n2, all the variables being independent.

H0 : µ1 = µ2, H1 : no constraints.



(x1 − x2) ∼ Np

(δ,

Σ1

n1+

Σ2

n2

).

Therefore,

(x1 − x2)>(

Σ1

n1+

Σ2

n2

)−1

(x1 − x2) ∼ χ2p

Since Si is a consistent estimator of Σi , i = 1, 2 we have

(x1 − x2)>(S1

n1+S2

n2

)−1

(x1 − x2)→ χ2p (18)



ExampleLet us compare the forged and the genuine bank notes again (n1 andn2 are large). The test statistic turns out to be 2436.8 which is highlysignificant. The 95% simultaneous confidence intervals are now:

−0.0389 ≤ δ1 ≤ 0.3309−0.5140 ≤ δ2 ≤ −0.2000−0.6368 ≤ δ3 ≤ −0.3092−2.6846 ≤ δ4 ≤ −1.7654−1.2858 ≤ δ5 ≤ −0.6442

1.8146 ≤ δ6 ≤ 2.3194

showing that all the components except the first are different fromzero, the larger difference coming from X6 (length of the diagonal) andX4 (lower border).



Profile analysis

p measures are reported in the same units. For instance, measures of blood pressure at p different moments,

one group being the control group and the other is the groupreceiving a new treatment.

One is then interested to compare the profile of each group: the profilebeing just the vectors of means of the p responses (the comparisonmay be visualized in a two dimensional graph using the parallelcoordinate plot



Profile Analysis

1 2 3 4 5

12

34

5

Population Profiles

Treatment

Mea

n

Group 1Group 2

Abbildung 19: Population profiles MVAprofil



The following questions are of interest:1) Are the profiles similar in the sense of being parallel (which means

no interaction between the treatments and the groups)?2) If the profiles are parallel, are they at the same level?3) If the profiles are parallel, is there any treatment effect (are the

profiles horizontal)?The above questions are easily translated in terms of linear constraintson the means and a test statistic is obviously obtained.



Parallelism

Let C be a((p − 1)× p) matrix defined as

C =

1 −1 0 · · · 00 1 −1 · · · 00 · · · 0 1 −1

.The hypothesis to be tested is H(1)

0 : C (µ1 − µ2) = 0. Under H0,

n1n2

(n1 + n2)2 (n1 + n2 − 2) (C (x1 − x2))> (CSC>)−1C (x1 − x2)

∼ T 2(p − 1, n1 + n2 − 2)

when S is the pooled covariance matrix. The hypothesis is rejected if

n1n2(n1 + n1 − p)

(n1 + n2)2(p − 1)(Cx)>

(CSC>

)−1Cx > F1−α;p−1,n1+n2−p.



Equality of two levels

The question of equality of the two levels is meaningful only if the twoprofiles are parallel. In the case of interaction (rejection of H(1)

0 ), thetwo populations react differently to the treatments and the question ofthe level has no meaning.The equality of the two levels is written as:

H(2)0 : 1>p (µ1 − µ2) = 0



n1n2

(n1 + n2)2 (n1 + n2 − 2)

(1>p (x1 − x2)

)21>p S1p

∼ T 2(1, n1 + n2 − 2)

= F1,n1+n2−2.

The rejection region is thus

n1n2(n1 + n2 − 2)

(n1 + n2)2

(1>p (x1 − x2)

)21>p S1p

> F1−α;1,n1+n2−2.



Treatment effect

If the parallelism between the profiles has been rejected, then twoindependent analyses should be done on the two groups using therepeated measurement approach (see above). But if the parallelism isaccepted, we can exploit the information contained in both groups(eventually at different levels) to test a treatment effect or thehorizontality of the two profiles.This may be written as:

H(3)0 : C (µ1 + µ2) = 0.



It is easy to prove that H(3)0 with H

(1)0 implies that

C

(n1µ1 + n2µ2

n1 + n2

)= 0.

So under parallel, horizontal profiles we have√n1 + n2Cx ∼ Np(0,CΣC ′).

We obtain

(n1 + n2 − 2)(Cx)>(CSC>)−1Cx ∼ T 2(p − 1, n1 + n2 − 2).

This leads to the rejection region of H(3)0

n1 + n2 − p

p − 1(Cx)>(CSC>)−1Cx > F1−α;p−1,n1+n2−p.



ExampleWechsler Adult Intelligence Scale (WAIS) for 2 categories of people: ingroup 1 are n1 = 37 people who do not present a senile factor, group 2are those (n2 = 12) presenting a senile factor. The four WAIS subtestsare X1 (information), X2 (similarities), X3 (arithmetic) and X4 (picturecompletion). The relevant statistics are

x>1 = (12.57 9.57 11.49 7.97)

x>2 = (8.75 5.33 8.50 4.75)



Example

S1 =

11.1648.840 11.7596.210 5.778 10.7902.020 0.529 1.743 3.594

S2 =

9.6889.583 16.7228.875 11.083 12.0837.021 8.167 4.875 11.688



ExampleThe test statistics for testing the parallelism of the two profiles isFobs = 0.4634 which is not significant (p − value = 0.71) so we canaccept the parallelism.The second test (equality of the levels of the 2 profiles) is given withFobs = 17.2146 which is highly significant (p-value ' 10−4): theglobal level of the test for the non-senile people is superior to thesenile group.Finally, the final test (horizontality of the average profile) givesFobs = 53.317 which is also highly significant (p-value ' 10−14).There are significant differences among the means of the differentsubtests.



Summary: Linear Hypothesis

Hypotheses about µ can often be written as Aµ = a, with matrixA, and vector a.

The hypothesis H0 : Aµ = a for X ∼ Np(µ,Σ) with Σ knownleads to −2 log λ = n(Ax − a)>(AΣA>)−1(Ax − a) ∼ χ2

q, whereq is the number of elements in a.




The hypothesis H0 : Aµ = a for X ∼ Np(µ,Σ) with Σ unknownleads to−2 log λ = n log1 + (Ax − a)>(ASA)−1(Ax − a) −→ χ2

q,where q is the number of elements in a and we have an exact test(n − 1)(Ax − a)>(ASA>)−1(Ax − a) ∼ T 2(q, n − 1).




The hypothesis H0 : Aβ = a for Yi ∼ N1(β>xi , σ2) with σ2

unknown leads to −2 log λ = n2 log

(||y−X β||2

||y−X β||2− 1)−→ χ2

q, withq being the length of a and with

n − p

q

(Aβ − a

)A(X>X

)−1A>−1 (

Aβ − a)

(y −X β

)> (y −X β

) ∼ Fq,n−p.


Angewandte Multivariate Statistik Regression Models

Regression Models

Linear Regression

y = Xβ + ε

X (n × p) explanatory variabley(n × 1) response



Example

Let x1, x2 be two factors that explain the variation of response y

yi = β0 + β1xi1 + β2xi2 + β3x2i1 + β4x

2i2 + β5xi1xi2 + εi

i = 1, . . . , n

X =

1 x11 x12 x2

11 x212 x11x12

1 x21 x22 x221 x2

22 x21x22...

......

......

...1 xn1 xn2 x2

n1 x2n2 xn1xn2



Example

Abbildung 20: 3-D response surface MVAresponsesurface


Angewandte Multivariate Statistik Regression Models General ANOVA and ANCOVA Models

ANOVA Models

One factor (p levels) model

yk` = µ+ α` + εk`, k = 1, . . . , n`, and ` = 1, . . . , p

Pullover example: p = 3 marketing strategies, y = Xβ + ε

X =

1 1 01 1 01 0 11 0 11 −1 −11 −1 −1



Multiple-Factors Models

Example: 3 marketing strategies, 2 locations

A1 A2 A3

B1 18 15 5 8 8 10 14B2 15 20 25 30 10 12 20 25

Tabelle 9: A two factor ANOVA data set, factor A, three levels of the marketingstrategy and factor B, two levels for the location. The figures represent theresulting sales during the same period.



General Two Factor Model

yijk = µ+ αi + γj + (αγ)ij + εijk

i = 1, . . . , r , j = 1, . . . , s, k = 1, . . . , nij

r∑i=1

αi = 0,s∑

j=1

γj = 0

r∑i=1

(αγ)ij = 0,s∑

j=1

(αγ)ij = 0

For the marketing data: r = 3, s = 2 Interactions: (αγ)ij



Example

(αγ)11 > 0 The effect of A1 (advertisement in local newspaper)more successful in location B1 (commercial centre)

(αγ)31 < 0 A3 (luxury presentation) less effective in B1 than inB2 (non-commercial centre)



Model without Interactions

y =(

18 15 15 20 25 30 5 8 8 10 12 10 14 20 25)>

X =

1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 0 0 0 0 0 −1 −1 −1 −10 0 0 0 0 0 1 1 1 1 1 −1 −1 −1 −11 1 −1 −1 −1 −1 1 1 1 −1 −1 1 1 −1 −1

>



Model with Interactions

X =

1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 0 0 0 0 0 −1 −1 −1 −10 0 0 0 0 0 1 1 1 1 1 −1 −1 −1 −11 1 −1 −1 −1 −1 1 1 1 −1 −1 1 1 −1 −11 1 −1 −1 −1 −1 0 0 0 0 0 −1 −1 1 10 0 0 0 0 0 1 1 1 −1 −1 −1 −1 1 1

>



Example

β p-valuesµ 15.25α1 4.25 0.0218α2 -6.25 0.0033γ1 -3.42 0.0139

(αγ)11 0.42 0.7922(αγ)21 1.42 0.8096

Tabelle 10: The values of β in the full model with interactions for the marketingdata (RSSfull = 158)



ANCOVA Models

Regression models where some of the variables are qualitative andothers are continuous

Example: Consider the Car data and analyse the effect of weight (W )and displacement (D) on the mileage (M). Test if the origin of the car(C ) has some effect on the response and if the effect of the continuousvariables is same for different factor levels.



Example

β p-values β p-valuesµ 41.0066 0.0000 43.4031 0.0000W -0.0073 0.0000 -0.0074 0.0000D 0.0118 0.2250 0.0081 0.4140C -0.9675 0.1250

Tabelle 11: Estimation of the effects of weight and displacement on the mileageMVAcareffect



Example

µ p-values W p-values D p-valuesc = 1 40.043 0.0000 -0.0065 0.0000 0.0058 0.3790c = 2 47.557 0.0005 0.0081 0.3666 -0.3582 0.0160c = 3 44.174 0.0002 0.0039 0.7556 -0.2650 0.3031

Tabelle 12: Different factor levels on the response MVAcareffect


Angewandte Multivariate Statistik Regression Models Categorical Responses

Categorical Responses

The response variable is categorical (qualitative) Observe counts yk for class k = 1, . . . ,K Likelihood

L =n!∏K

k=1 yk !

K∏k=1

(mk

n

)yk Idea: make logmk linear on X



Two-Way Tables

yjk is the number of observations on cell (j , k)

Multinomial likelihood

L =n!∏J

j=1∏K

k=1 yjk !

J∏j=1

K∏k=1

(mjk

n

)yjk No interaction

logmjk = µ+ αj + γk for j = 1, . . . , J, k = 1, . . . ,K

logm = Xβ



Model without Interaction

logm =

logm11logm12logm13logm21logm22logm23

, X =

1 1 1 01 1 0 11 1 −1 −11 −1 1 01 −1 0 11 −1 −1 −1

, β =

β0β1β2β3



Model without Interaction

Likelihood

Lβ =J∑

j=1

K∑k=1

yjk logmjk s.t.∑j ,k

mjk = n

α1 = β1, α2 = −β1

γ1 = β2, γ2 = β3, γ3 = −(β2 + β3)



Model with Interactions

logmjk = µ+ αj + γk + (αγ)jk , j = 1, . . . , J, k = 1 . . . ,K

K∑k=1

(αγ)jk = 0, for j = 1, . . . , J

J∑j=1

(αγ)jk = 0, for k = 1, . . . ,K



Testing with Count Data

yk count data mk value predicted by the model

Pearson chi-square

χ2 =K∑

k=1

(yk − mk)2

mk

Deviance

G 2 = 2K∑

k=1

yk log(

ykmk

)




Both statistics are asymptotically χ2 distributed

Degrees of freedom

d .f . = # free cells−# free parameters estimated

TestH0 : reduced model with r degrees of freedomH1 : full model with f degrees of freedom




G 2H0− G 2

H1∼ χ2

r−f

Reject H0

Pχ2r−f >

(G 2H0− G 2

H1

)observed



Example

2× 2× 5 table of n = 5833 counts on prescribed drugs

M A1 A2 A3 A4 A5DY 21 32 70 43 19DN 683 596 705 295 99F A1 A2 A3 A4 A5DY 46 89 169 98 51DN 738 700 847 336 196

Tabelle 13: A Three-way Contingency Table: top table for men and bottomtable for women MVAdrug



Example

β0 intercept 5.0089 β10 0.0205β1 gender: M −0.2867 β11 0.0482β2 drug: DY −1.0660 β12 drug*age −0.4983β3 age −0.0080 β13 −0.1807β4 0.2151 β14 0.0857β5 0.6607 β15 0.2766β6 −0.0463 β16 gender*drug*age −0.0134β7 gender*drug −0.1632 β17 −0.0523β8 gender*age 0.0713 β18 −0.0112β9 −0.0092 β19 −0.0102



Example

β0 intercept 5.0051 β8 gender*age 0.0795β1 gender: M −0.2919 β9 0.0321β2 drug: DY −1.0717 β10 0.0265β3 age −0.0030 β11 0.0534β4 0.2358 β12 drug*age −0.4915β5 0.6649 β13 −0.1576β6 −0.0425 β14 0.0917β7 gender*drug −0.1734 β15 0.2822

Tabelle 14: Coefficients estimates based on the saturated model (previousslide) and ML method (current slide) MVAdrug3waysTab



Logit Models

p (xi ) = P(yi = 1 | xi ) =

exp(β0 +

p∑j=1

βjxij)

1 + exp(β0 +

p∑j=1

βjxij)

Log odds ratio is linear

log

p (xi )

1− p (xi )

= β0 +

p∑j=1

βjxij



Logit Models

Likelihood function

L(β0, β) =n∏

i=1

p (xi )yi1− p (xi )1−yi

Log-likelihood function

`(β0, β) =n∑

i=1

[yi log p (xi ) + (1− yi ) log1− p (xi )]



Example

β p-valuesβ0 3.6042 0.0660β3 -0.2031 0.0037β4 -0.0205 0.0183β5 -1.1841 0.3108

Tabelle 15: Estimation of the financial characteristics on bankrupt banks withlogit model MVAbankrupt



Summary: Regression Models

In contingency tables, the categories are defined by the qualitativevariables.

The saturated model has all of the interaction terms, and 0degree of freedom.

The non-saturated model is a reduced model since it fixes someparameters to be zero.




Two statistics to test for the full model and the reduced modelare:

X 2 =K∑

k=1

(yk − mk)2/mk

G 2 = 2K∑

k=1

yk log (yk/mk)




The logit models allow the column categories to be a quantitativevariable, and quantify the effect of the column category by usingfewer parameters and incorporating more flexible relationshipsthan just a linear one.

The logit model is equivalent to a log-linear model.

log [p (xi )/1− p (xi )] = β0 +

p∑j=1

βjxij