Download - Exploratory analysis for high dimensional extremes: Support …zoltan.szabo/jc/2020_04_02... · 2020. 4. 2. · Exploratory analysis for high dimensional extremes: Support identi

Exploratory analysis for high dimensional extremes:Support identification, Anomaly detection and

clustering, Principal Component Analysis.

Anne Sabourin1

+ many others: Maël Chiapino (PhD), Stephan Clémençon (Télécom Paris), Holger

Drees (U. Hamburg), Vincent Feuillard (Airbus), Nicolas Goix (PhD), Johan Segers

(UC Louvain)

1 Télécom Paris, Institut polytechnique de Paris, France.

Chair Stress test, Ecole Polytechnique, BNPP, 2020/04/02

1/47

Outline

Introduction: dimension reduction for multivariate extremes

Sparsity in multivariate extremes (Goix et al., 2016, 17., Chiapino et al. 2016,2019 a. )

Application: extremes/anomalies clustering (Chiapino et al., 2019 b.)

Principal Component Analysis for extremes (S. and Drees, 202+)

1/47

Motivation(s)• Multivariate heavy-tailed random vector X = (X1, . . . ,Xd)

(e.g. spatial field (temperature, precipitation), asset (negative) prices,. . .

• Focus on distribution of the largest values: Law(X | ‖X‖ > t), t � 1with P(‖X‖ > t) small.

Possible goals: simulation (stress test), anomaly detection(preprocessing) among extreme values, . . .

• d � 1 : modeling Law(X | ‖X‖ > t) unfeasible.

• Dimension reduction problem(s) :

1. Identify the groups of features α ⊂ {1, . . . d} which may be largetogether (while the others stay small), given that one of them is large.

2. Identify a single low undimensional projection subspace V0 such thatLaw(X | ‖X‖ > t) ≈ concentrated on V0.

2/47

Motivation(s)• Multivariate heavy-tailed random vector X = (X1, . . . ,Xd)

(e.g. spatial field (temperature, precipitation), asset (negative) prices,. . .

• Focus on distribution of the largest values: Law(X | ‖X‖ > t), t � 1with P(‖X‖ > t) small.

Possible goals: simulation (stress test), anomaly detection(preprocessing) among extreme values, . . .

• d � 1 : modeling Law(X | ‖X‖ > t) unfeasible.

• Dimension reduction problem(s) :

1. Identify the groups of features α ⊂ {1, . . . d} which may be largetogether (while the others stay small), given that one of them is large.

2. Identify a single low undimensional projection subspace V0 such thatLaw(X | ‖X‖ > t) ≈ concentrated on V0. 2/47

Examples: It cannot rain everywhere at the same time

(daily precipitation)

(air pollutants)

3/47

Applications to risk managementSensors network (road traffic, river streamflow, temperature, internettraffic . . . ) or financial asset prices:

→ extreme event = traffic jam, flood, heatwave, network congestion,falling price

→ question: which groups of sensors / assets are likely to be jointlyimpacted ?

→ how to define alert regions (alert groups of features/components)?

spatial case: one feature = one sensor

4/47

Applications to anomaly detection

• Training step:Learn a ‘normal region’ (e.g. approximate support)

• Prediction step: (with new data)Anomalies = points outside the ‘normal region’

If ‘normal’ data are heavy tailed, Abnormal 6⇔ Extreme .There may be extreme ‘normal data’.

How to distinguish between large anomalies and normal extremes?

5/47

Standardized data• Random vectors X = (X1, . . . ,Xd ,) ; Xj ≥ 0

• Margins: Xj ∼ Fj , 1 ≤ j ≤ d (continuous).

• Preliminary step: Standardization (here: to Pareto margins)Vj =

11−Fj (Xj )) , P(Vj > v) =

1v .

• Goal : P(V ∈ A), A ’far from 0’ ?

• Each component j is homogeneous (order −1): for all t > 0

P(Vj ∈ tA)/P(Vj > t) = P(Vj ∈ A).6/47

Multivariate extremes: regular variation• Informally: the marginal homogeneity property remains valid in the

multivariate sense.• A r.v. V = (V1, . . . ,Vd) ∈ Rd is regularly varying if ∃ a limit

measure µ s.t.P (V ∈ tA)P (‖V ‖ > t)

−−−→t→∞

µ(A)

for A ⊂ Rd with 0 /∈ closure(A) and µ(∂A) 6= 0.

• Necessarily µ is homogeneous: µ(rA) = r−αµ(A), for some α > 0 (tailindex).

• With Vj = 11−Fj (Xj ) necessarily α = 1.7/47

Multivariate extremes: regular variation (Cont’d)

• µ rules the (probabilistic) behaviour of extremes: if A is far from theorigin,

P(V ∈ A) ≈ µ(A)

• Examples: Max stable vectors with standardized margins, multivariatestudents, . . .

• Statistical procedures based on Extreme Value theory: 2 steps.

1. Learn useful features of µ using the k observations V(1), . . . ,V(k) withlargest norm, with k � n number of available data

2. Use the approximation P(V ∈ A) ≈ µ(A) for A far from 0.

8/47

Angular measure• Homogeneity of µ → polar coordinates are convenient

r = ‖x‖ (any norm) ; θ = r−1x• Angular measure Φ on the corresponding sphere:

Φ(B) = µ{r > 1, θ ∈ B}.• Then µ decomposes as a product, only Φ needs to be estimated:

µ{r > t, θ ∈ B} = r−αΦ(B)

9/47

Angular measure

• Φ rules the joint distribution of extremes

• Asymptotic dependence: (V1,V2) may be large together.

vs

• Asymptotic independence: only V1 or V2 may be large.

No assumption on Φ: non-parametric framework.

10/47

Outline





10/47

Towards high dimension• Reasonable hope: only a moderate number of Vj ’s may be

simultaneously large → sparse angular measure

• Our goal from a MEVT point of view:

Estimate the (sparse) support of the angular measure(i.e. the dependence structure).

Which components may be large together, while the other are small?

11/47

Sparse angular support

Full support: Sparse supportanything may happen (V1 not large if V2 or V3 large)

Where is the mass?

Subcones of Rd+ : Cα = {x � 0, xj > 0 (j ∈ α), xj = 0 (j /∈ α), ‖x‖ ≥ 1}α ⊂ {1, . . . , d}.

12/47

Support recovery + representation

• {Cα, α ⊂ {1, . . . , d}: partition of {x : ‖x‖ ≥ 1}• Goal 1: learn the 2d − 1-dimensional representation (potentially

sparse)

M =(µ(Cα)

)α⊂{1,...,d},α 6=∅

; support S = {α : µ(Cα) > 0}.

Main interest:

µ(Cα) > 0⇐⇒

features j ∈ α may be large together while the others are small.13/47

Identifying non empty edgesIssue: real data = non-asymptotic: Vj > 0.

Cannot just count data on each subcone:Only the largest-dimensional one has empirical mass!

14/47

Identifying non empty edges

Fix ε > 0. Affect data ε-close to an edge, to that edge.

Cα → Rεα→ New partition of the input space, compatible with non asymptotic data.

15/47

Empirical estimator of µ(Cα))(Counts the standardized points in Cεα, far from 0.)data: Xi , i = 1, . . . , n, Xi = (Xi ,1, . . . ,Xi ,d).

• Standardize: V̂i ,j = 11−F̂j (Xi,j ) , with F̂j(Xi ,j) =rank(Xi,j )−1

n

• Natural estimator

µ̂n(Cα) =n

kPn(V̂ ∈

n

kRεα) → M̂ = (µ̂n(Cα), α ⊂ {1, . . . , d})

• Estimated support Ŝ = {α : µ̂n(Cα) > µ0}. 16/47

Sparsity in real datasetsData=50 wave direction from buoys in North sea.(Shell Research, thanks J. Wadsworth)

17/47

Finite sample error boundVC-bound adapted to low probability regions (see Goix, S., Clémençon, 2015)

Theorem

If the margins Fj are continuous and if the density of the angular measureis bounded by M > 0 on each subface (infinity norm),There is a constant C s.t. for any n, d , k , δ ≥ e−k , ε ≤ 1/4,with probability ≥ 1− δ,

maxα|µ̂n(Cα)− µ(Cα)| ≤Cd

(√1

kεlog

d

δ+ Mdε

)+ Bias n

k,ε(F , µ).

Bias: using non asymptotic data to learn about an asymptotic quantity

Regular variation ⇐⇒ Biast,ε −−−→t→∞

0

• Existing litterature (d = 2): 1/√k .

• Here: 1/√kε+ Mdε. Price to pay for biasing estimator with ε.

OK if ε k →∞, ε→ 0.Choice of ε: cross-validation or ‘ε = 0.1’ 18/47

Tools for the proof

1. VC inequality for small probability classes (Goix et.al., 2015)

→ max deviations ≤ √p × (usual bound)

2. Apply it on VC-class of rectangles {kn R(x , z , α), x , z � ε}

→ p ≤ d kεn

⇒ supα|µ̂n − µ|(R�α) ≤ Cd

√1

εklog

d

δ

3. Approach µ(Cα) with µ(Rεα) → error ≤ Mdε(bounded angular density).

19/47

Algorithm DAMEX (Detecting Anomalies with MultivariateExtremes) (Goix, S., Clémençon, 2016)

Anomaly = new observation ‘violating the sparsity pattern’:observed in empty or light subcone.

Scoring function: for x such that v̂ ∈ kRεα,

sn(x) =1

‖v̂‖µ̂n(R

�α) 'x large P (V ∈ Cα, ‖V ‖ > x)

20/47

Extension: feature clustering (Chiapino, S. 2016; Chiapino, S., Segers,2019 a.)

• Motivating example: River stream-flow dataset, d = 92 gaugingstations:• Typical groups jointly impacted by extreme records include noisy

additional features!→ Empirical µ-mass scattered over many Cα’s

→ No apparent sparsity pattern.• How to gather ‘closeby’ α’s into feature clusters? → ‘robust’ version

of DAMEX: the CLEF algorithm and variants + Asymptotic analysis.21/47

Conclusion I

• Discovering subgroup of components that are likely to besimultaneously large is doable, with an error scaling as 1√

k, k : number

of extreme observations.

• Two algorithms:• DAMEX: easy to implement, linear complexity O(dn log n), not very

robust to weak signals/ noisy dependence structure

• CLEF: a bit more complex (graph mining, but existing python packagecan help), complexity OK only if the dependence graph is sparse, butmore robust.

• Statistical guarantees: non-asymptotic (DAMEX) and asymptotic(CLEF)

• Open questions: optimal choice of tuning parameters (cross-validationis common practice but no theory in an extreme values context)

22/47

Outline





22/47

Application: clustering extreme data (Chiapino et al. , 2019 b.)

• Context: monitoring a high dimensional system (e.g. air flight datafrom Airbus, 82 parameters, 18 000 obs), where extremes are ofparticular interest (associated with anomalies / risk regions)

• Naive idea: use the list of dependent maximal subgroup {αk , k ≥ K}issued from DAMEX/CLEF and corresponding rectangles of the kind

tRεα = {x ∈ Rd : xj > tε, j ∈ α, xj < tε, j /∈ α, ‖x‖ > t}

• Issue: In practice many data points fall outside the tRεα’s. How toassign them to a cluster?

23/47

Mixture model for extremes

• See dependent subgroups αk ⊂ {1, . . . , d}, k ≤ K issued fromDAMEX/CLEF as components of a mixture model.

• Zk : hidden indicator variable of component k . Conditionally to{‖V ‖ > r0,Zk = 1},

V = Vk + �k = RkWk + �k , (1)

where

• Vk ∈ Cαk ,

• �k ∈ C⊥αk ∼ i.i.d. Exponential,

• Rk = ‖Vk‖ ∼ Pareto(1),

• Wk = R−1k Vk ∈ Sαk ∼ Φk (Angular measure restricted to k th face:Dirichlet distribution with the L1 norm).

24/47

Model for the k th mixture component

V = Vk + �k = RkWk + �k , (2)

• Training the mixture model: EM algorithm.

25/47

Clustering extremes

• After training: each extreme point Vi has probabibility pi ,k to comefrom mixture component k.

• Similarity measure between Vi ,Vj :

si ,j = P (Vi ,Vj ∈ same component ) =K∑

k=1

pi ,kpj ,k

→ Similarity matrix (Si ,j) ∈ [0, 1]N×N where N is the number ofextreme components.

• Clustering based on the similirarity matrix using off-the-shelftechniques (e.g. spectral clustering)

26/47

Some results

Table: Shuttle dataset (9 attributes, 7 classes): Purity score - Comparisons withstandard approaches for different extreme sample sizes.

n0 = 500 n0 = 400 n0 = 300 n0 = 200 n0 = 100Dirichlet mixture 0.8 0.82 0.82 0.84 0.85Kmeans 0.72 0.73 0.75 0.78 0.8Spectral clustering 0.78 0.77 0.82 0.81 0.8

27/47

Flights anomaly clustering + Visual display

Using standard visualization tools (Python package Networkx)

0

1

2

3

4

5

6

7

8

9

10

11

12

13

1415

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156157

158

159

160

161

162

163

164

165

166

167 168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201 202

203

204

205

206

207

208

209

210

211

212

213

214215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238 239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

spect

ral cl

ust

ers

28/47

Conclusion II

• DAMEX/CLEF output can be used to perform clustering of extremes

• Mixture modelling,{ mixture components } = output of DAMEX/CLEF

• Additional layer: building a similarity matrix to perform clustering.

• Question: What if the angular mesure is concentrated on an’oblique’ subspace of the central subsphere (only particular linearcombinations of all components are likely to be large)? ThenCLEF/DAMEX fail because all the mass is in the central subsphereand no particular structure can be discovered.→ Idea: perform some sort of PCA on extreme data.

29/47

Outline





29/47

PCA for extremes: context and motivation

• (X1, . . . ,Xd) a multivariate r. vector with tail index α > 0 and limitmeasure µ

• Motivating assumption (not necessary)

Hypothesis 1

The vector space S0 = span(suppµ) generated by the support of µ hasdimension p < d .

• Purpose of this work: Recover S0 from the data, with guaranteesconcerning the reconstruction error.

30/47

Motivating assumption: interpretation

dim(S0) = p < d ; S0 = span(supp(µ))

⇐⇒

Certain linear combinations are much likelier to be large than others.

31/47

Dimension reduction in EVT: quick overview

• Looking for multiple subspaces where µ concentrates:

• Chautru, 2015 (clustering + principal nested spheres)

• Goix et al., 2016,17, Chiapino et al. (space partitioning) Simpson etal., 20++ (relaxing the partition)

• Engelke&Hitz, 20++, Graphical models

• K-means clustering: Janssen &Wan, 20++

• Dimension reduction in regression analysis: Gardes, 2018.

• PCA on a transformed version of the data: Cooley & Thibaud(20++)

32/47

Heavy-tailed scarecrow against using PCA for extremes

• ‘Classical dimension reduction tools such as PCA fail for multivariateextremes because they require the existence of second moments’

• Possible answer: Since µ is homogeneous, what matters is theangular component:

Proposed method for recovering the support of µ

• Perform PCA on angular data (or any rescaled version of the data withenough moments) corresponding to observations with largest norm.

• The first eigen vectors of the rescaled empirical covariance matrix provide anestimate for S0 = span(supp(µ)).

33/47

Toy example

●

●

●

●

●

●●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

● ●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●●

●

●

34/47

Toy example

●

●

●

●

●

●●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

● ●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●●

●

●

V_0

34/47

Toy example, proposed method

●

●

●

●

●

●●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

● ●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

● ●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

35/47


●●

●

●

●

●

●

●

●

●

35/47


●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

35/47

Empirical Risk Minimization setting• ‖ · ‖: Euclidean norm.• ΠS (resp. Π⊥S ): Orthogonal projection operator onto linear space S

(resp. S⊥)

• Rescaled observations: Θ = θ(X ) = ω(X ) · X ,• ω : Rd → R+: suitable scaling function (think ω(x) = 1/‖x‖, variants

allowed s.t. E(‖Θ‖2

) t

)• Empirical counterpart

Rn,k(S) =1

k

n∑i=1

‖Π⊥S (Θ(i))‖2

with ‖X‖(1) ≥ . . . ‖X‖(n) the order statistics of the norm and Θ(i) thecorresponding rescaled data.

36/47

Minimizing a risk ⇐⇒ Diagonalizing a covariance matrix• Denote Eq = set of all q-dimensional subspaces of Rd , 1 ≤ q ≤ d .

• Σt = E(ΘΘ> | ‖X‖ > t

): conditional second moments matrix.

• standard fact from Principal Component analysis: Assume forsimplicity Σt has distinct eigenvalues. Let (u1, . . . ud) denote theeigen vectors associated to eigen values in decreasing order. Then

arg minS∈Eq

Rt(S) = span(u1, . . . , uq).

• Similarlyarg minS∈Eq

Rn,k(S) = span(un1 , . . . , u

nq).

(unj ): eigen vectors of the empirical 2nd moments matrix of the

Θ(i), i ≤ k

37/47

ERM setting: risk at the limit

• Limit risk above extreme levels

R∞(S) := E∞‖Π⊥S Θ‖2

where E∞: expectation w.r.t. the limit conditional distribution

P∞( · ) = limt→∞

P (X ∈ t( · ) | ‖X‖ > t) = µ( · )/µ({x : ‖x‖ > 1})

• Hypothesis 1 (S0 = span suppµ, dim(S0) = p) ⇒{S0} = arg minEp R∞, R∞(S0) = 0,

For all S ′ of dimension p′ < p, R∞(S′) > 0.

38/47

Questions

• Is the empirical minimizer Ŝn of Rn,k consistent?

• Uniform, non-asymptotic bounds on |Rn,k(S)− Rtn,k (S)|:? (classicalgoal in statistical learning)

• Relevance for practical applications (improved performance fornon-parametric estimation of the probability of failure regions) ?

39/47

Convergence of minimizers of the true conditional risk• Scaling condition on the weight ω (→ second moments of Θ exist).

∃β ∈(1− α

2, 1]

: ∀λ > 0, x ∈ Rd : ω(λx) = λ−βω(x)

and cω := sup‖x‖=1

ω(x)

Uniform risk bound• Stronger condition on ω: ω(x) ≤ 1/‖x‖ (thus ‖Θ‖ ≤ 1).• tn,k : quantile of level 1− k/n for ‖X‖• St := E

(‖Θ‖4 − πt tr(Σ2t ) | ‖X‖ > t

); Σt = E

(ΘΘ> | ‖X‖ > t

)Theorem 3 (Drees, S., 20++), simplified version

With probability at least 1− δ,

supS∈Ep|Rn,k(S)− Rtn,k (S)| ≤

[p ∧ (d − p)k

Stn,k

]1/2+ . . .

. . .[8k

(1 + k/n) log(4/δ)]1/2

+ . . .

. . .4 log(4/δ)

3k.

(Variant of bounded difference inequality (McDiarmid, 98) + argumentsfrom Blanchard et al. 07))

• NB: unknown term St : an alternative statement is proven with onlyempirical quantities in the upper bound.

41/47

Simulations: questions

• Can p = dim(V0) be chosen empirically from the risk plots?

• Does the empirical angular measure after projection on the subspacelearned by PCA provide better estimates than the classical one for therisk-related quantities:

(i) limu→∞ P(p−1∑

1≤j≤p Xj/‖X‖ > t(i) | ‖X‖ > u) =

H{x | p−1∑p

j=1 xj > t(i)} for some t(i) ∈ (0, p−1/2)

(ii) limu→∞ P(min1≤j≤p X j > u,maxp+1≤j≤d X j ≤ u | ‖X‖ > u) =∫ ((min1≤j≤p x

j)α − (maxp+1≤j≤d x j)α)+

H(dx)

(iii) limu→∞ P(X 1 > u | max1≤j≤d X j > u) =∫(x1)α H(dx)/

∫(max1≤j≤d x

j)α H(dx)

(iv) limu→∞ P(min1≤j≤d X j > u | ‖X‖ > u) =∫

(min1≤j≤d xj)α H(dx)

42/47

Simulations: models

• d dimensional vectors with limit measure concentrated on a p < ddimensional subspace.

• Structure: p dimensional max-stable model + d-dimensional Gaussiannoise (absolute values), ρ = 0.2, σ2 ∈ {105/d , 10/d}.

• Unit Fréchet margins with tail index α ∈ {1, 2}

• Dependence for the p-dimensional model:• Max-stable vector from the Dirichlet Model (Coles&Tawn91, see

Segers, 2012 for simulation), parameter (3, . . . , 3).

• other settings (not shown here)

• n = 1000, k ∈ {5, 10, 15, . . . , 200}, 1000 replications.

43/47

choice of p̂, Dirichlet model, p = 2, d = 10

0 50 100 150 2000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 50 100 150 2000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Mean

empirical risk (left) and empirical risk for one sample (right) versus k forPCA projecting onto a subspace of dimension 1 ≤ p̃ ≤ 10

→ Choice p̂ = 2 obvious for small k , p̂ ∈ {2, 3} for k ≥ 50.44/47

Performance for estimating failure probabilities: RMSE’srelated to the angular measure H

0 50 100 150 2000

0.2

0.4

0.6(i)

0 50 100 150 2000

0.05

0.1

0.15

0.2

(ii)

0 50 100 150 2000

0.05

0.1

0.15

(iii)

0 50 100 150 2000

0.05

0.1

0.15

0.2(iv)

RMSE’s based on Ĥn,k (black, solid), ĤPCAn,k (blue, dashed) and Ĥ

PCAn,k,10 (red, dash-dotted)

versus k in the Dirichlet model with parameter 3, p = 2 and d = 10.

• PCA step with 10 observations → estimators relatively insensitive tothe choice of k for Ĥn,k . 45/47

Conclusion III (PCA)

• Plotting the empirical risk is useful to choose p̂

• In case of doubt, choose the highest plausible dimension.

• For estimating failure probabilities: estimators including a PCA stepare competitive, for probability (i) [concomitance of extremes] theyare superior.

• choosing kPCA < k offers improved robustness w.r.t choice of k in thesecond step.

46/47

Bibliography I

• Blanchard, G., Bousquet, O., & Zwald, L. (2007). Statistical properties of kernelprincipal component analysis. Machine Learning, 66(2-3), 259-294.

• Chautru, E. (2015). Dimension reduction in multivariate extreme value analysis.Electronic journal of statistics, 9(1), 383-418.

• Chiapino, M., Sabourin, A. (2016). Feature clustering for extreme eventsanalysis, with application to extreme stream-flow data. In InternationalWorkshop on New Frontiers in Mining Complex Patterns (pp. 132-147).Springer, Cham.

• Chiapino, M., Sabourin, A., Segers, J. (2019). Identifying groups of variableswith the potential of being large simultaneously. Extremes, 22(2), 193-222.

• Chiapino, M., Clémençon, S., Feuillard, V., Sabourin, A. (2019). Amultivariate extreme value theory approach to anomaly clustering andvisualization. Computational Statistics, 1-22.

• Cooley, D., & Thibaud, E. Decompositions of dependence for high-dimensionalextremes. arXiv:1612.07190.

• J-J. Cai, J. Einmahl, and L. De Haan. ”Estimation of extreme risk regions undermultivariate regular variation.” AoS,2011

46/47

Bibliography II

• N. Goix, A. S., S. Clémençon. ”Learning the dependence structure of rare events:a non-asymptotic study”,.COLT, 2015

• Drees, H. & Sabourin, A. Principal Component Analysis for multivariateextremes, arXiv:1906.11043

• Engelke, S., & Hitz, A. S. Graphical models for extremes. arXiv:1812.01734.• Gardes, L. (2018). Tail dimension reduction for extreme quantile estimation.

Extremes, 21(1), 57-95.

• Goix, N., Sabourin, A., Clémençon, S. (2016). Sparse Representation ofMultivariate Extremes with Applications to Anomaly Ranking. In AISTATS(pp. 75-83).

• N. Goix, A. S., and S. Clémençon. (2017) Sparse representation ofmultivariate extremes with applications to anomaly detection. JMVA

• Simpson, E. S., Wadsworth, J. L., & Tawn, J. A. Determining the DependenceStructure of Multivariate Extremes. arXiv:1809.01606.

• Janssen, A., & Wan, P. k-means clustering of extremes. arXiv:1904.02970.

47/47

More material on DAMEX

1/19

Estimation of the dependence structure: Φ(B) or µ[0, x ]c

• Flexible multivariate models for moderate dimension (d ' 5)Dirichlet Mixtures (Boldi,Davison 07; S., Naveau 12), Logistic family (Stephenson

09, Fougères et.al, 13), Pairwise Beta (Cooley et.al) . . .

• Asymptotic theory: rates under second order conditions(Einmahl, 01) Empirical likelihood (Einmahl, Segers 09) Asymptotic normality

(Einmahl et. al., 12, 15) (parametric)

• Finite sample error bounds, non parametric, on

supx�R|µ̂n[0, x ]c − µ[0, x ]c |

(Goix, S., Clémençon, 15)

Does not tell ‘which components may be large together’

2/19

DAMEX results: support recovery

• Asymmetric logistic, d = 10, dependence parameter α = 0.1→ Non asymptotic data (not exactly Generalized Pareto)• K randomly chosen (asymptotically) non-empty faces.• parameters: k =

√n, � = 0.1

• Heuristic for setting minimum mass µ0: eliminate faces supportingless than 1% of total mass.

# sub-cones K 10 15 20 30 35 40 45 50

Aver. # errors 0.01 0.09 0.39 1.82 3.59 6.59 8.06 11.21(n=5e4)

Aver. # errors 0.06 0.02 0.14 0.98 1.85 3.14 5.23 7.87(n=15e4)

3/19

More material on CLEF and variants

4/19

CLEF method: Relaxed constraints on the region ofinterest

Initial regions of interest:

Cα = {v � 0 : v j large for j ∈ α, v j small for j /∈ α}Question: µ(Cα) > 0?

Modified regions (relaxed constraints, larger regions, more points)

{v � 0 : v j large for j ∈ α}

more precisely

Γα = {v � 0 : ∀j ∈ α, v j > 1}; µ(Γα) = lim tP(∀j ∈ α,V j > t

).

Alternative question: µ(Γα) > 0 ?

5/19

Problem statement (CLEF)

Goal Estimate the family of subsets

S = {α ⊂ {1, . . . , d} : µ(Γα) > 0}

Recall the initial problem: estimate S = {α : µ(Cα) > 0}.

Lemma: ‘Equivalence’ of the two problems DAMEX/CLEF

α is a maximal element of S⇐⇒α is a maximal element of S

6/19

Conditional criterion in CLEF• One needs an empirical criterion for ‘testing’ dependence: µ(Γα > 0).

e.g. µ̂n(Γα) > µ0.

• Issue: µ(Γα)↘ as |α| ↗How to set the threshold according to |α| ?

• Way around: condition upon joint exceedance of ‘all but one’components.

κα = limt→∞

P(∀j ∈ αV j > t | V j > t for all but at most one j ∈ α}

Empirical criterion

κ̂α,t =

∑ni=1 1V̂ ji >t for all j∈α∑n

i=1 1V̂ ji >t for all but at most one j∈α

Ŝ = {α : κ̂t > κ0, β ∈ M̂ for β ⊂ α}Ŝ0 = {maximal such α′s}

7/19

Coping with combinatorial complexity: O(2d) subsets!• Good news: µ(Γα) = 0⇒ ∀β ⊃ α, µ(Γβ) = 0 → follow Hasse diagram

CLEF algorithm (CLustering Extreme Features Chiapino, S., 16)

• Start with pairs: Â2 = {α : |α| = 2, κ̂(α) > κ0},with κ̂: a dependence summary (next slide)

• Stage k : Âk = {α : |α| = k , κ̂(α) > κ0}; → Candidates for Ak+1:

{α : |α| = k + 1, ∀β ⊂ α s.t. |β| = k , β ∈ Âk .} Not too many !

• If Âk = ∅, return M̂0 = ∪j

Refined statistical analysis (Chiapino et al., 2019 a.)

How to turn the heuristic stopping criterion ‘κ̂α < κ0’ into astatistical test with controllable asymptotic level?

9/19

Strategy 1: Testing H0 : κα > κ0

Theorem (Chiapino, S., Segers, 2019)

Under the conditions from Einmahl et al., 12, th. 4.6 (2nd order + smooth-

ness of `), if κα > 0,√k(κ̂α − κα

)W−→ Zκ,α: a centered Gaussian variable

which variance involves unknown quantities that can be estimated empiri-cally.

• proof: κα is a function of the (λβ : β ⊂ α) withλβ = µ{v ∈ Rd+ : ∃j ∈ α : xj > 1) (extremal coefficient).+ ∆-method.

• Corollary: τα,n = 1{κ̂α < κ0 + qδ

√σ̂2αk

}where qδ : δ-quantile of

N (0, 1) is a test for H0 : κ > κ0 of asymptotic level δ.

10/19

Strategy 2: estimating a tail dependence coefficient ηα.

• Issue with strategy 1: κ0 > 0 is arbitrary but unavoidable. IndeedTesting ‘µ(Γα) > 0’ is not manageable in this framework because the

limit distribution degenerates for µ(Γα)→ 0 (or for κα → 0).

• Alternative framework: Ledford & Tawn’s model (multivariate case:de Haan, Zhou, 11 ; Eastoe, Tawn, 12).

P(∀j ∈ α : Vj > t) = t−1ηαLα(t), where Lα is slowly varying. (3)

• For our purposes: µ(Γα) > 0⇒ (3) holds with ηα = 1.

• Consequence: any test for H̃0: ηα = 1 (simple hypothesis) is also atest for H0 : µ(Γα) > 0.

11/19

• Estimation: using a Pickands estimator η̂α,P (Peng, 99, bivariate case)or a Hill estimator η̂α,H (Draisma et al. (04) and Drees (98 a,b)).

• Technical challenge: accounting for unknown margins (V← V̂ usingempirical cdf)

• Theorems (Chiapino, S., Segers, 2019): Under H0,√k(η̂α,P − 1)

and√k(η̂α,H − 1) both converge towards Gaussian limits which

variances involve unknown quantities that can be estimatedempirically.

• Tools for the proofs:• for η̂α,P : ∆-method + Einmahl et al. (12)• for η̂α,H : Extending Draisma et al. (04) and Drees (98 a,b)’s proofs for

the bivariate case to the multivariate case.

12/19

Experiments: d = 100, asymmetric logistic (AL) modelwith random perturbation• Comparison between CLEF and the three considered variants

(heuristic stopping criterion replaced with a test)

• M = {α : µ(Γα) > 0} = 80 randomly chosen subsets.

• perturbation: For each i ≤ n, Xi ∼ perturbed AL distribution: eachα ∈M augmented with a randomly chosen ji ∈ {1, . . . , d \ α.

• DAMEX algorithm (Goix, S, Clémençon, 16,17) fails (miserably)

# true recovered # false α ⊂ β ∈ M # false α ⊃ β ∈ M # other falseη̂H 79.0 (1.4) 2.4 (3.4) 0.04 (0.2) 18.0 (7.0)

η̂P 79.6 (0.7) 1.0 (2.5) 0. (0.) 3.4 (2.8)κ̂ 71.1 (2.3) 7.4 (4.7) 5.1 (2.1) 28.0 (13.3)

CLEF 69.9 (4.4) 16.2 (8.1) 0.5 (0.6) 2.3 (2.2)50 datasets, n = 5e4, k = 150, conf. level 0.001 ; κ0 = 0.05 (CLEF).

Tests based on η̂P are modified to accommodate for the case µ̂(Γα) ≤ 0.05. 13/19