Exploratory analysis for high dimensional extremes:Support identification, Anomaly detection and
clustering, Principal Component Analysis.
Anne Sabourin1
+ many others: Maël Chiapino (PhD), Stephan Clémençon (Télécom Paris), Holger
Drees (U. Hamburg), Vincent Feuillard (Airbus), Nicolas Goix (PhD), Johan Segers
(UC Louvain)
1 Télécom Paris, Institut polytechnique de Paris, France.
Chair Stress test, Ecole Polytechnique, BNPP, 2020/04/02
1/47
Outline
Introduction: dimension reduction for multivariate extremes
Sparsity in multivariate extremes (Goix et al., 2016, 17., Chiapino et al. 2016,2019 a. )
Application: extremes/anomalies clustering (Chiapino et al., 2019 b.)
Principal Component Analysis for extremes (S. and Drees, 202+)
1/47
Motivation(s)• Multivariate heavy-tailed random vector X = (X1, . . . ,Xd)
(e.g. spatial field (temperature, precipitation), asset (negative) prices,. . .
• Focus on distribution of the largest values: Law(X | ‖X‖ > t), t � 1with P(‖X‖ > t) small.
Possible goals: simulation (stress test), anomaly detection(preprocessing) among extreme values, . . .
• d � 1 : modeling Law(X | ‖X‖ > t) unfeasible.
• Dimension reduction problem(s) :
1. Identify the groups of features α ⊂ {1, . . . d} which may be largetogether (while the others stay small), given that one of them is large.
2. Identify a single low undimensional projection subspace V0 such thatLaw(X | ‖X‖ > t) ≈ concentrated on V0.
2/47
Motivation(s)• Multivariate heavy-tailed random vector X = (X1, . . . ,Xd)
(e.g. spatial field (temperature, precipitation), asset (negative) prices,. . .
• Focus on distribution of the largest values: Law(X | ‖X‖ > t), t � 1with P(‖X‖ > t) small.
Possible goals: simulation (stress test), anomaly detection(preprocessing) among extreme values, . . .
• d � 1 : modeling Law(X | ‖X‖ > t) unfeasible.
• Dimension reduction problem(s) :
1. Identify the groups of features α ⊂ {1, . . . d} which may be largetogether (while the others stay small), given that one of them is large.
2. Identify a single low undimensional projection subspace V0 such thatLaw(X | ‖X‖ > t) ≈ concentrated on V0.
2/47
Motivation(s)• Multivariate heavy-tailed random vector X = (X1, . . . ,Xd)
(e.g. spatial field (temperature, precipitation), asset (negative) prices,. . .
• Focus on distribution of the largest values: Law(X | ‖X‖ > t), t � 1with P(‖X‖ > t) small.
Possible goals: simulation (stress test), anomaly detection(preprocessing) among extreme values, . . .
• d � 1 : modeling Law(X | ‖X‖ > t) unfeasible.
• Dimension reduction problem(s) :
1. Identify the groups of features α ⊂ {1, . . . d} which may be largetogether (while the others stay small), given that one of them is large.
2. Identify a single low undimensional projection subspace V0 such thatLaw(X | ‖X‖ > t) ≈ concentrated on V0. 2/47
Examples: It cannot rain everywhere at the same time
(daily precipitation)
(air pollutants)
3/47
Applications to risk managementSensors network (road traffic, river streamflow, temperature, internettraffic . . . ) or financial asset prices:
→ extreme event = traffic jam, flood, heatwave, network congestion,falling price
→ question: which groups of sensors / assets are likely to be jointlyimpacted ?
→ how to define alert regions (alert groups of features/components)?
spatial case: one feature = one sensor
4/47
Applications to anomaly detection
• Training step:Learn a ‘normal region’ (e.g. approximate support)
• Prediction step: (with new data)Anomalies = points outside the ‘normal region’
If ‘normal’ data are heavy tailed, Abnormal 6⇔ Extreme .There may be extreme ‘normal data’.
How to distinguish between large anomalies and normal extremes?
5/47
Applications to anomaly detection
• Training step:Learn a ‘normal region’ (e.g. approximate support)
• Prediction step: (with new data)Anomalies = points outside the ‘normal region’
If ‘normal’ data are heavy tailed, Abnormal 6⇔ Extreme .There may be extreme ‘normal data’.
How to distinguish between large anomalies and normal extremes?
5/47
Applications to anomaly detection
• Training step:Learn a ‘normal region’ (e.g. approximate support)
• Prediction step: (with new data)Anomalies = points outside the ‘normal region’
If ‘normal’ data are heavy tailed, Abnormal 6⇔ Extreme .There may be extreme ‘normal data’.
How to distinguish between large anomalies and normal extremes?
5/47
Standardized data• Random vectors X = (X1, . . . ,Xd ,) ; Xj ≥ 0
• Margins: Xj ∼ Fj , 1 ≤ j ≤ d (continuous).
• Preliminary step: Standardization (here: to Pareto margins)Vj =
11−Fj (Xj )) , P(Vj > v) =
1v .
• Goal : P(V ∈ A), A ’far from 0’ ?
• Each component j is homogeneous (order −1): for all t > 0
P(Vj ∈ tA)/P(Vj > t) = P(Vj ∈ A).6/47
Multivariate extremes: regular variation• Informally: the marginal homogeneity property remains valid in the
multivariate sense.• A r.v. V = (V1, . . . ,Vd) ∈ Rd is regularly varying if ∃ a limit
measure µ s.t.P (V ∈ tA)P (‖V ‖ > t)
−−−→t→∞
µ(A)
for A ⊂ Rd with 0 /∈ closure(A) and µ(∂A) 6= 0.
• Necessarily µ is homogeneous: µ(rA) = r−αµ(A), for some α > 0 (tailindex).
• With Vj = 11−Fj (Xj ) necessarily α = 1.7/47
Multivariate extremes: regular variation (Cont’d)
• µ rules the (probabilistic) behaviour of extremes: if A is far from theorigin,
P(V ∈ A) ≈ µ(A)
• Examples: Max stable vectors with standardized margins, multivariatestudents, . . .
• Statistical procedures based on Extreme Value theory: 2 steps.
1. Learn useful features of µ using the k observations V(1), . . . ,V(k) withlargest norm, with k � n number of available data
2. Use the approximation P(V ∈ A) ≈ µ(A) for A far from 0.
8/47
Angular measure• Homogeneity of µ → polar coordinates are convenient
r = ‖x‖ (any norm) ; θ = r−1x• Angular measure Φ on the corresponding sphere:
Φ(B) = µ{r > 1, θ ∈ B}.• Then µ decomposes as a product, only Φ needs to be estimated:
µ{r > t, θ ∈ B} = r−αΦ(B)
9/47
Angular measure
• Φ rules the joint distribution of extremes
• Asymptotic dependence: (V1,V2) may be large together.
vs
• Asymptotic independence: only V1 or V2 may be large.
No assumption on Φ: non-parametric framework.
10/47
Outline
Introduction: dimension reduction for multivariate extremes
Sparsity in multivariate extremes (Goix et al., 2016, 17., Chiapino et al. 2016,2019 a. )
Application: extremes/anomalies clustering (Chiapino et al., 2019 b.)
Principal Component Analysis for extremes (S. and Drees, 202+)
10/47
Towards high dimension• Reasonable hope: only a moderate number of Vj ’s may be
simultaneously large → sparse angular measure
• Our goal from a MEVT point of view:
Estimate the (sparse) support of the angular measure(i.e. the dependence structure).
Which components may be large together, while the other are small?
11/47
Sparse angular support
Full support: Sparse supportanything may happen (V1 not large if V2 or V3 large)
Where is the mass?
Subcones of Rd+ : Cα = {x � 0, xj > 0 (j ∈ α), xj = 0 (j /∈ α), ‖x‖ ≥ 1}α ⊂ {1, . . . , d}.
12/47
Support recovery + representation
• {Cα, α ⊂ {1, . . . , d}: partition of {x : ‖x‖ ≥ 1}• Goal 1: learn the 2d − 1-dimensional representation (potentially
sparse)
M =(µ(Cα)
)α⊂{1,...,d},α 6=∅
; support S = {α : µ(Cα) > 0}.
Main interest:
µ(Cα) > 0⇐⇒
features j ∈ α may be large together while the others are small.13/47
Identifying non empty edgesIssue: real data = non-asymptotic: Vj > 0.
Cannot just count data on each subcone:Only the largest-dimensional one has empirical mass!
14/47
Identifying non empty edges
Fix ε > 0. Affect data ε-close to an edge, to that edge.
Cα → Rεα→ New partition of the input space, compatible with non asymptotic data.
15/47
Empirical estimator of µ(Cα))(Counts the standardized points in Cεα, far from 0.)data: Xi , i = 1, . . . , n, Xi = (Xi ,1, . . . ,Xi ,d).
• Standardize: V̂i ,j = 11−F̂j (Xi,j ) , with F̂j(Xi ,j) =rank(Xi,j )−1
n
• Natural estimator
µ̂n(Cα) =n
kPn(V̂ ∈
n
kRεα) → M̂ = (µ̂n(Cα), α ⊂ {1, . . . , d})
• Estimated support Ŝ = {α : µ̂n(Cα) > µ0}. 16/47
Sparsity in real datasetsData=50 wave direction from buoys in North sea.(Shell Research, thanks J. Wadsworth)
17/47
Finite sample error boundVC-bound adapted to low probability regions (see Goix, S., Clémençon, 2015)
Theorem
If the margins Fj are continuous and if the density of the angular measureis bounded by M > 0 on each subface (infinity norm),There is a constant C s.t. for any n, d , k , δ ≥ e−k , ε ≤ 1/4,with probability ≥ 1− δ,
maxα|µ̂n(Cα)− µ(Cα)| ≤Cd
(√1
kεlog
d
δ+ Mdε
)+ Bias n
k,ε(F , µ).
Bias: using non asymptotic data to learn about an asymptotic quantity
Regular variation ⇐⇒ Biast,ε −−−→t→∞
0
• Existing litterature (d = 2): 1/√k .
• Here: 1/√kε+ Mdε. Price to pay for biasing estimator with ε.
OK if ε k →∞, ε→ 0.Choice of ε: cross-validation or ‘ε = 0.1’ 18/47
Tools for the proof
1. VC inequality for small probability classes (Goix et.al., 2015)
→ max deviations ≤ √p × (usual bound)
2. Apply it on VC-class of rectangles {kn R(x , z , α), x , z � ε}
→ p ≤ d kεn
⇒ supα|µ̂n − µ|(R�α) ≤ Cd
√1
εklog
d
δ
3. Approach µ(Cα) with µ(Rεα) → error ≤ Mdε(bounded angular density).
19/47
Algorithm DAMEX (Detecting Anomalies with MultivariateExtremes) (Goix, S., Clémençon, 2016)
Anomaly = new observation ‘violating the sparsity pattern’:observed in empty or light subcone.
Scoring function: for x such that v̂ ∈ kRεα,
sn(x) =1
‖v̂‖µ̂n(R
�α) 'x large P (V ∈ Cα, ‖V ‖ > x)
20/47
Extension: feature clustering (Chiapino, S. 2016; Chiapino, S., Segers,2019 a.)
• Motivating example: River stream-flow dataset, d = 92 gaugingstations:• Typical groups jointly impacted by extreme records include noisy
additional features!→ Empirical µ-mass scattered over many Cα’s
→ No apparent sparsity pattern.• How to gather ‘closeby’ α’s into feature clusters? → ‘robust’ version
of DAMEX: the CLEF algorithm and variants + Asymptotic analysis.21/47
Conclusion I
• Discovering subgroup of components that are likely to besimultaneously large is doable, with an error scaling as 1√
k, k : number
of extreme observations.
• Two algorithms:• DAMEX: easy to implement, linear complexity O(dn log n), not very
robust to weak signals/ noisy dependence structure
• CLEF: a bit more complex (graph mining, but existing python packagecan help), complexity OK only if the dependence graph is sparse, butmore robust.
• Statistical guarantees: non-asymptotic (DAMEX) and asymptotic(CLEF)
• Open questions: optimal choice of tuning parameters (cross-validationis common practice but no theory in an extreme values context)
22/47
Outline
Introduction: dimension reduction for multivariate extremes
Sparsity in multivariate extremes (Goix et al., 2016, 17., Chiapino et al. 2016,2019 a. )
Application: extremes/anomalies clustering (Chiapino et al., 2019 b.)
Principal Component Analysis for extremes (S. and Drees, 202+)
22/47
Application: clustering extreme data (Chiapino et al. , 2019 b.)
• Context: monitoring a high dimensional system (e.g. air flight datafrom Airbus, 82 parameters, 18 000 obs), where extremes are ofparticular interest (associated with anomalies / risk regions)
• Naive idea: use the list of dependent maximal subgroup {αk , k ≥ K}issued from DAMEX/CLEF and corresponding rectangles of the kind
tRεα = {x ∈ Rd : xj > tε, j ∈ α, xj < tε, j /∈ α, ‖x‖ > t}
• Issue: In practice many data points fall outside the tRεα’s. How toassign them to a cluster?
23/47
Mixture model for extremes
• See dependent subgroups αk ⊂ {1, . . . , d}, k ≤ K issued fromDAMEX/CLEF as components of a mixture model.
• Zk : hidden indicator variable of component k . Conditionally to{‖V ‖ > r0,Zk = 1},
V = Vk + �k = RkWk + �k , (1)
where
• Vk ∈ Cαk ,
• �k ∈ C⊥αk ∼ i.i.d. Exponential,
• Rk = ‖Vk‖ ∼ Pareto(1),
• Wk = R−1k Vk ∈ Sαk ∼ Φk (Angular measure restricted to k th face:Dirichlet distribution with the L1 norm).
24/47
Model for the k th mixture component
V = Vk + �k = RkWk + �k , (2)
• Training the mixture model: EM algorithm.
25/47
Clustering extremes
• After training: each extreme point Vi has probabibility pi ,k to comefrom mixture component k.
• Similarity measure between Vi ,Vj :
si ,j = P (Vi ,Vj ∈ same component ) =K∑
k=1
pi ,kpj ,k
→ Similarity matrix (Si ,j) ∈ [0, 1]N×N where N is the number ofextreme components.
• Clustering based on the similirarity matrix using off-the-shelftechniques (e.g. spectral clustering)
26/47
Some results
Table: Shuttle dataset (9 attributes, 7 classes): Purity score - Comparisons withstandard approaches for different extreme sample sizes.
n0 = 500 n0 = 400 n0 = 300 n0 = 200 n0 = 100Dirichlet mixture 0.8 0.82 0.82 0.84 0.85Kmeans 0.72 0.73 0.75 0.78 0.8Spectral clustering 0.78 0.77 0.82 0.81 0.8
27/47
Flights anomaly clustering + Visual display
Using standard visualization tools (Python package Networkx)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
1415
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156157
158
159
160
161
162
163
164
165
166
167 168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201 202
203
204
205
206
207
208
209
210
211
212
213
214215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238 239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
spect
ral cl
ust
ers
28/47
Conclusion II
• DAMEX/CLEF output can be used to perform clustering of extremes
• Mixture modelling,{ mixture components } = output of DAMEX/CLEF
• Additional layer: building a similarity matrix to perform clustering.
• Question: What if the angular mesure is concentrated on an’oblique’ subspace of the central subsphere (only particular linearcombinations of all components are likely to be large)? ThenCLEF/DAMEX fail because all the mass is in the central subsphereand no particular structure can be discovered.→ Idea: perform some sort of PCA on extreme data.
29/47
Outline
Introduction: dimension reduction for multivariate extremes
Sparsity in multivariate extremes (Goix et al., 2016, 17., Chiapino et al. 2016,2019 a. )
Application: extremes/anomalies clustering (Chiapino et al., 2019 b.)
Principal Component Analysis for extremes (S. and Drees, 202+)
29/47
PCA for extremes: context and motivation
• (X1, . . . ,Xd) a multivariate r. vector with tail index α > 0 and limitmeasure µ
• Motivating assumption (not necessary)
Hypothesis 1
The vector space S0 = span(suppµ) generated by the support of µ hasdimension p < d .
• Purpose of this work: Recover S0 from the data, with guaranteesconcerning the reconstruction error.
30/47
Motivating assumption: interpretation
dim(S0) = p < d ; S0 = span(supp(µ))
⇐⇒
Certain linear combinations are much likelier to be large than others.
31/47
Dimension reduction in EVT: quick overview
• Looking for multiple subspaces where µ concentrates:
• Chautru, 2015 (clustering + principal nested spheres)
• Goix et al., 2016,17, Chiapino et al. (space partitioning) Simpson etal., 20++ (relaxing the partition)
• Engelke&Hitz, 20++, Graphical models
• K-means clustering: Janssen &Wan, 20++
• Dimension reduction in regression analysis: Gardes, 2018.
• PCA on a transformed version of the data: Cooley & Thibaud(20++)
32/47
Heavy-tailed scarecrow against using PCA for extremes
• ‘Classical dimension reduction tools such as PCA fail for multivariateextremes because they require the existence of second moments’
• Possible answer: Since µ is homogeneous, what matters is theangular component:
Proposed method for recovering the support of µ
• Perform PCA on angular data (or any rescaled version of the data withenough moments) corresponding to observations with largest norm.
• The first eigen vectors of the rescaled empirical covariance matrix provide anestimate for S0 = span(supp(µ)).
33/47
Toy example
●
●
●
●
●
●●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
● ●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●●
●
●
34/47
Toy example
●
●
●
●
●
●●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
● ●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●●
●
●
V_0
34/47
Toy example, proposed method
●
●
●
●
●
●●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
● ●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
● ●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
35/47
Toy example, proposed method
●●
●
●
●
●
●
●
●
●
35/47
Toy example, proposed method
●●
●
●
●
●
●
●
●
●
35/47
Toy example, proposed method
●●
●
●
●
●
●
●
●
●
35/47
Toy example, proposed method
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
35/47
Empirical Risk Minimization setting• ‖ · ‖: Euclidean norm.• ΠS (resp. Π⊥S ): Orthogonal projection operator onto linear space S
(resp. S⊥)
• Rescaled observations: Θ = θ(X ) = ω(X ) · X ,• ω : Rd → R+: suitable scaling function (think ω(x) = 1/‖x‖, variants
allowed s.t. E(‖Θ‖2
) t
)• Empirical counterpart
Rn,k(S) =1
k
n∑i=1
‖Π⊥S (Θ(i))‖2
with ‖X‖(1) ≥ . . . ‖X‖(n) the order statistics of the norm and Θ(i) thecorresponding rescaled data.
36/47
Minimizing a risk ⇐⇒ Diagonalizing a covariance matrix• Denote Eq = set of all q-dimensional subspaces of Rd , 1 ≤ q ≤ d .
• Σt = E(ΘΘ> | ‖X‖ > t
): conditional second moments matrix.
• standard fact from Principal Component analysis: Assume forsimplicity Σt has distinct eigenvalues. Let (u1, . . . ud) denote theeigen vectors associated to eigen values in decreasing order. Then
arg minS∈Eq
Rt(S) = span(u1, . . . , uq).
• Similarlyarg minS∈Eq
Rn,k(S) = span(un1 , . . . , u
nq).
(unj ): eigen vectors of the empirical 2nd moments matrix of the
Θ(i), i ≤ k
37/47
ERM setting: risk at the limit
• Limit risk above extreme levels
R∞(S) := E∞‖Π⊥S Θ‖2
where E∞: expectation w.r.t. the limit conditional distribution
P∞( · ) = limt→∞
P (X ∈ t( · ) | ‖X‖ > t) = µ( · )/µ({x : ‖x‖ > 1})
• Hypothesis 1 (S0 = span suppµ, dim(S0) = p) ⇒{S0} = arg minEp R∞, R∞(S0) = 0,
For all S ′ of dimension p′ < p, R∞(S′) > 0.
38/47
Questions
• Is the empirical minimizer Ŝn of Rn,k consistent?
• Uniform, non-asymptotic bounds on |Rn,k(S)− Rtn,k (S)|:? (classicalgoal in statistical learning)
• Relevance for practical applications (improved performance fornon-parametric estimation of the probability of failure regions) ?
39/47
Convergence of minimizers of the true conditional risk• Scaling condition on the weight ω (→ second moments of Θ exist).
∃β ∈(1− α
2, 1]
: ∀λ > 0, x ∈ Rd : ω(λx) = λ−βω(x)
and cω := sup‖x‖=1
ω(x)
Uniform risk bound• Stronger condition on ω: ω(x) ≤ 1/‖x‖ (thus ‖Θ‖ ≤ 1).• tn,k : quantile of level 1− k/n for ‖X‖• St := E
(‖Θ‖4 − πt tr(Σ2t ) | ‖X‖ > t
); Σt = E
(ΘΘ> | ‖X‖ > t
)Theorem 3 (Drees, S., 20++), simplified version
With probability at least 1− δ,
supS∈Ep|Rn,k(S)− Rtn,k (S)| ≤
[p ∧ (d − p)k
Stn,k
]1/2+ . . .
. . .[8k
(1 + k/n) log(4/δ)]1/2
+ . . .
. . .4 log(4/δ)
3k.
(Variant of bounded difference inequality (McDiarmid, 98) + argumentsfrom Blanchard et al. 07))
• NB: unknown term St : an alternative statement is proven with onlyempirical quantities in the upper bound.
41/47
Simulations: questions
• Can p = dim(V0) be chosen empirically from the risk plots?
• Does the empirical angular measure after projection on the subspacelearned by PCA provide better estimates than the classical one for therisk-related quantities:
(i) limu→∞ P(p−1∑
1≤j≤p Xj/‖X‖ > t(i) | ‖X‖ > u) =
H{x | p−1∑p
j=1 xj > t(i)} for some t(i) ∈ (0, p−1/2)
(ii) limu→∞ P(min1≤j≤p X j > u,maxp+1≤j≤d X j ≤ u | ‖X‖ > u) =∫ ((min1≤j≤p x
j)α − (maxp+1≤j≤d x j)α)+
H(dx)
(iii) limu→∞ P(X 1 > u | max1≤j≤d X j > u) =∫(x1)α H(dx)/
∫(max1≤j≤d x
j)α H(dx)
(iv) limu→∞ P(min1≤j≤d X j > u | ‖X‖ > u) =∫
(min1≤j≤d xj)α H(dx)
42/47
Simulations: models
• d dimensional vectors with limit measure concentrated on a p < ddimensional subspace.
• Structure: p dimensional max-stable model + d-dimensional Gaussiannoise (absolute values), ρ = 0.2, σ2 ∈ {105/d , 10/d}.
• Unit Fréchet margins with tail index α ∈ {1, 2}
• Dependence for the p-dimensional model:• Max-stable vector from the Dirichlet Model (Coles&Tawn91, see
Segers, 2012 for simulation), parameter (3, . . . , 3).
• other settings (not shown here)
• n = 1000, k ∈ {5, 10, 15, . . . , 200}, 1000 replications.
43/47
choice of p̂, Dirichlet model, p = 2, d = 10
0 50 100 150 2000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 50 100 150 2000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Mean
empirical risk (left) and empirical risk for one sample (right) versus k forPCA projecting onto a subspace of dimension 1 ≤ p̃ ≤ 10
→ Choice p̂ = 2 obvious for small k , p̂ ∈ {2, 3} for k ≥ 50.44/47
Performance for estimating failure probabilities: RMSE’srelated to the angular measure H
0 50 100 150 2000
0.2
0.4
0.6(i)
0 50 100 150 2000
0.05
0.1
0.15
0.2
(ii)
0 50 100 150 2000
0.05
0.1
0.15
(iii)
0 50 100 150 2000
0.05
0.1
0.15
0.2(iv)
RMSE’s based on Ĥn,k (black, solid), ĤPCAn,k (blue, dashed) and Ĥ
PCAn,k,10 (red, dash-dotted)
versus k in the Dirichlet model with parameter 3, p = 2 and d = 10.
• PCA step with 10 observations → estimators relatively insensitive tothe choice of k for Ĥn,k . 45/47
Conclusion III (PCA)
• Plotting the empirical risk is useful to choose p̂
• In case of doubt, choose the highest plausible dimension.
• For estimating failure probabilities: estimators including a PCA stepare competitive, for probability (i) [concomitance of extremes] theyare superior.
• choosing kPCA < k offers improved robustness w.r.t choice of k in thesecond step.
46/47
Bibliography I
• Blanchard, G., Bousquet, O., & Zwald, L. (2007). Statistical properties of kernelprincipal component analysis. Machine Learning, 66(2-3), 259-294.
• Chautru, E. (2015). Dimension reduction in multivariate extreme value analysis.Electronic journal of statistics, 9(1), 383-418.
• Chiapino, M., Sabourin, A. (2016). Feature clustering for extreme eventsanalysis, with application to extreme stream-flow data. In InternationalWorkshop on New Frontiers in Mining Complex Patterns (pp. 132-147).Springer, Cham.
• Chiapino, M., Sabourin, A., Segers, J. (2019). Identifying groups of variableswith the potential of being large simultaneously. Extremes, 22(2), 193-222.
• Chiapino, M., Clémençon, S., Feuillard, V., Sabourin, A. (2019). Amultivariate extreme value theory approach to anomaly clustering andvisualization. Computational Statistics, 1-22.
• Cooley, D., & Thibaud, E. Decompositions of dependence for high-dimensionalextremes. arXiv:1612.07190.
• J-J. Cai, J. Einmahl, and L. De Haan. ”Estimation of extreme risk regions undermultivariate regular variation.” AoS,2011
46/47
Bibliography II
• N. Goix, A. S., S. Clémençon. ”Learning the dependence structure of rare events:a non-asymptotic study”,.COLT, 2015
• Drees, H. & Sabourin, A. Principal Component Analysis for multivariateextremes, arXiv:1906.11043
• Engelke, S., & Hitz, A. S. Graphical models for extremes. arXiv:1812.01734.• Gardes, L. (2018). Tail dimension reduction for extreme quantile estimation.
Extremes, 21(1), 57-95.
• Goix, N., Sabourin, A., Clémençon, S. (2016). Sparse Representation ofMultivariate Extremes with Applications to Anomaly Ranking. In AISTATS(pp. 75-83).
• N. Goix, A. S., and S. Clémençon. (2017) Sparse representation ofmultivariate extremes with applications to anomaly detection. JMVA
• Simpson, E. S., Wadsworth, J. L., & Tawn, J. A. Determining the DependenceStructure of Multivariate Extremes. arXiv:1809.01606.
• Janssen, A., & Wan, P. k-means clustering of extremes. arXiv:1904.02970.
47/47
More material on DAMEX
1/19
Estimation of the dependence structure: Φ(B) or µ[0, x ]c
• Flexible multivariate models for moderate dimension (d ' 5)Dirichlet Mixtures (Boldi,Davison 07; S., Naveau 12), Logistic family (Stephenson
09, Fougères et.al, 13), Pairwise Beta (Cooley et.al) . . .
• Asymptotic theory: rates under second order conditions(Einmahl, 01) Empirical likelihood (Einmahl, Segers 09) Asymptotic normality
(Einmahl et. al., 12, 15) (parametric)
• Finite sample error bounds, non parametric, on
supx�R|µ̂n[0, x ]c − µ[0, x ]c |
(Goix, S., Clémençon, 15)
Does not tell ‘which components may be large together’
2/19
DAMEX results: support recovery
• Asymmetric logistic, d = 10, dependence parameter α = 0.1→ Non asymptotic data (not exactly Generalized Pareto)• K randomly chosen (asymptotically) non-empty faces.• parameters: k =
√n, � = 0.1
• Heuristic for setting minimum mass µ0: eliminate faces supportingless than 1% of total mass.
# sub-cones K 10 15 20 30 35 40 45 50
Aver. # errors 0.01 0.09 0.39 1.82 3.59 6.59 8.06 11.21(n=5e4)
Aver. # errors 0.06 0.02 0.14 0.98 1.85 3.14 5.23 7.87(n=15e4)
3/19
More material on CLEF and variants
4/19
CLEF method: Relaxed constraints on the region ofinterest
Initial regions of interest:
Cα = {v � 0 : v j large for j ∈ α, v j small for j /∈ α}Question: µ(Cα) > 0?
Modified regions (relaxed constraints, larger regions, more points)
{v � 0 : v j large for j ∈ α}
more precisely
Γα = {v � 0 : ∀j ∈ α, v j > 1}; µ(Γα) = lim tP(∀j ∈ α,V j > t
).
Alternative question: µ(Γα) > 0 ?
5/19
Problem statement (CLEF)
Goal Estimate the family of subsets
S = {α ⊂ {1, . . . , d} : µ(Γα) > 0}
Recall the initial problem: estimate S = {α : µ(Cα) > 0}.
Lemma: ‘Equivalence’ of the two problems DAMEX/CLEF
α is a maximal element of S⇐⇒α is a maximal element of S
6/19
Conditional criterion in CLEF• One needs an empirical criterion for ‘testing’ dependence: µ(Γα > 0).
e.g. µ̂n(Γα) > µ0.
• Issue: µ(Γα)↘ as |α| ↗How to set the threshold according to |α| ?
• Way around: condition upon joint exceedance of ‘all but one’components.
κα = limt→∞
P(∀j ∈ αV j > t | V j > t for all but at most one j ∈ α}
Empirical criterion
κ̂α,t =
∑ni=1 1V̂ ji >t for all j∈α∑n
i=1 1V̂ ji >t for all but at most one j∈α
Ŝ = {α : κ̂t > κ0, β ∈ M̂ for β ⊂ α}Ŝ0 = {maximal such α′s}
7/19
Coping with combinatorial complexity: O(2d) subsets!• Good news: µ(Γα) = 0⇒ ∀β ⊃ α, µ(Γβ) = 0 → follow Hasse diagram
CLEF algorithm (CLustering Extreme Features Chiapino, S., 16)
• Start with pairs: Â2 = {α : |α| = 2, κ̂(α) > κ0},with κ̂: a dependence summary (next slide)
• Stage k : Âk = {α : |α| = k , κ̂(α) > κ0}; → Candidates for Ak+1:
{α : |α| = k + 1, ∀β ⊂ α s.t. |β| = k , β ∈ Âk .} Not too many !
• If Âk = ∅, return M̂0 = ∪j
Refined statistical analysis (Chiapino et al., 2019 a.)
How to turn the heuristic stopping criterion ‘κ̂α < κ0’ into astatistical test with controllable asymptotic level?
9/19
Strategy 1: Testing H0 : κα > κ0
Theorem (Chiapino, S., Segers, 2019)
Under the conditions from Einmahl et al., 12, th. 4.6 (2nd order + smooth-
ness of `), if κα > 0,√k(κ̂α − κα
)W−→ Zκ,α: a centered Gaussian variable
which variance involves unknown quantities that can be estimated empiri-cally.
• proof: κα is a function of the (λβ : β ⊂ α) withλβ = µ{v ∈ Rd+ : ∃j ∈ α : xj > 1) (extremal coefficient).+ ∆-method.
• Corollary: τα,n = 1{κ̂α < κ0 + qδ
√σ̂2αk
}where qδ : δ-quantile of
N (0, 1) is a test for H0 : κ > κ0 of asymptotic level δ.
10/19
Strategy 2: estimating a tail dependence coefficient ηα.
• Issue with strategy 1: κ0 > 0 is arbitrary but unavoidable. IndeedTesting ‘µ(Γα) > 0’ is not manageable in this framework because the
limit distribution degenerates for µ(Γα)→ 0 (or for κα → 0).
• Alternative framework: Ledford & Tawn’s model (multivariate case:de Haan, Zhou, 11 ; Eastoe, Tawn, 12).
P(∀j ∈ α : Vj > t) = t−1ηαLα(t), where Lα is slowly varying. (3)
• For our purposes: µ(Γα) > 0⇒ (3) holds with ηα = 1.
• Consequence: any test for H̃0: ηα = 1 (simple hypothesis) is also atest for H0 : µ(Γα) > 0.
11/19
• Estimation: using a Pickands estimator η̂α,P (Peng, 99, bivariate case)or a Hill estimator η̂α,H (Draisma et al. (04) and Drees (98 a,b)).
• Technical challenge: accounting for unknown margins (V← V̂ usingempirical cdf)
• Theorems (Chiapino, S., Segers, 2019): Under H0,√k(η̂α,P − 1)
and√k(η̂α,H − 1) both converge towards Gaussian limits which
variances involve unknown quantities that can be estimatedempirically.
• Tools for the proofs:• for η̂α,P : ∆-method + Einmahl et al. (12)• for η̂α,H : Extending Draisma et al. (04) and Drees (98 a,b)’s proofs for
the bivariate case to the multivariate case.
12/19
Experiments: d = 100, asymmetric logistic (AL) modelwith random perturbation• Comparison between CLEF and the three considered variants
(heuristic stopping criterion replaced with a test)
• M = {α : µ(Γα) > 0} = 80 randomly chosen subsets.
• perturbation: For each i ≤ n, Xi ∼ perturbed AL distribution: eachα ∈M augmented with a randomly chosen ji ∈ {1, . . . , d \ α.
• DAMEX algorithm (Goix, S, Clémençon, 16,17) fails (miserably)
# true recovered # false α ⊂ β ∈ M # false α ⊃ β ∈ M # other falseη̂H 79.0 (1.4) 2.4 (3.4) 0.04 (0.2) 18.0 (7.0)
η̂P 79.6 (0.7) 1.0 (2.5) 0. (0.) 3.4 (2.8)κ̂ 71.1 (2.3) 7.4 (4.7) 5.1 (2.1) 28.0 (13.3)
CLEF 69.9 (4.4) 16.2 (8.1) 0.5 (0.6) 2.3 (2.2)50 datasets, n = 5e4, k = 150, conf. level 0.001 ; κ0 = 0.05 (CLEF).
Tests based on η̂P are modified to accommodate for the case µ̂(Γα) ≤ 0.05. 13/19
More on PCA for extremes
14/19
PCA for extremes: Uniform risk bound II
• R̂t : empirical risk conditional to {‖X‖ > t}.• Data dependent bound (with empirical moments instead of St):
Theorem 4 (Drees, S., 20++)
For all ` > 1, u, v > 0,
P(
supS∈Ep|R̂t(S)− Rt(S)| ≥
[(p ∧ (d − p))
( S̃t`− 1
+v
`
)]1/2+ u
∣∣∣Nt = `)≤ 2 exp
(− 2`u2) + exp
(− b`/2cv2/2
)with S̃t := N
−1t
∑ni=1 ‖Θi ,t‖4 − tr
((N−1t
∑ni=1 Θi ,tΘ
>i ,t)
2)
and
Θi ,t = Θi1{‖Xi‖ > t}
(adapting arguments from Blanchard et al. 07, conditioning upon‖X‖ > t)• Corollary: confidence interval for Rt
15/19
Mean distance ρ(Ŝn,k , S0) vs. k
0 20 40 60 80 100 120 140 160 180 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
∣∣∣∣∣∣∣∣∣ΠŜn,k −ΠŜ0∣∣∣∣∣∣∣∣∣ as a function of k in the Dirichlet model with parameter 3,p = 2 and d = 10
• Performance deteriorates quickly → choosekPCA = 10 < k(angular measure).
16/19
PCA: Choice of p̂, p = 5, d = 100
0 50 100 150 2000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 50 100 150 2000
0.2
0.4
0.6
0.8
1
Mean empirical risk for PCA projecting onto a subspace of dimension1 ≤ p̃ ≤ 10 (left) and distance ρ(Ŝn,k , S0) with p̂ = p (right) versus k in
the Dirichlet model with parameter 3.
• p̂ ∈ {4, 5, 6}: reasonable. kPCA should remain small (again)
17/19
RMSE’s for probabilities (i)–(iv), Dirichlet model,d = 100, p = 5.
0 100 2000
0.2
0.4
0.6(i)
0 100 2000
0.05
0.1
(ii)
0 100 2000
0.1
0.2(iii)
0 100 2000
0.02
0.04
(iv)
0 100 2000
0.2
0.4
0.6(i)
0 100 2000
0.05
0.1(ii)
0 100 2000
0.1
0.2(iii)
0 100 2000
0.02
0.04
(iv)
0 100 2000
0.2
0.4
0.6(i)
0 100 2000
0.05
0.1(ii)
0 100 2000
0.1
0.2(iii)
0 100 2000
0.02
0.04
(iv)
Estimators based on Ĥn,k (black), ĤPCAn,k (blue, dashed) and Ĥ
PCAn,k,10 (red,
dash-dotted) vs. k.18/19
RMSE’s for probabilities (i)–(iv), randomly rotatedDirichlet model, d = 10, p = 3.
0 100 2000
0.2
0.4
0.6(i)
0 100 2000
0.05
0.1
0.15
0.2
(ii)
0 100 2000
0.05
0.1
0.15
(iii)
0 100 2000
0.05
0.1
0.15
0.2(iv)
0 100 2000.1
0.2
0.3
0.4
0.5
(i)
0 100 2000
0.05
0.1
0.15
0.2
(ii)
0 100 2000
0.05
0.1
0.15
(iii)
0 100 2000
0.05
0.1
0.15
0.2(iv)
Estimators based on Ĥn,k (black, solid), ĤPCAn,k (blue, dashed) and Ĥ
PCAn,k,10
(red, dash-dotted) vs. k .19/19
Introduction: dimension reduction for multivariate extremesSparsity in multivariate extremes (Goix et al., 2016, 17., Chiapino et al. 2016, 2019 a. ) Application: extremes/anomalies clustering (Chiapino et al., 2019 b.)Principal Component Analysis for extremes (S. and Drees, 202+)Appendix