Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics –...
Transcript of Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics –...
![Page 1: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/1.jpg)
Finding Multivariate Outlier
Applied Multivariate Statistics – Spring 2012
TexPoint fonts used in EMF.
Read the TexPoint manual before you delete this box.: AAAAAAA
![Page 2: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/2.jpg)
Goals
Concept: Detecting outliers with (robustly) estimated
Mahalanobis distance and QQ-plot
R: chisq.plot, pcout from package “mvoutlier”
2 Appl. Multivariate Statistics - Spring 2012
![Page 3: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/3.jpg)
Outlier in one dimension - easy
Look at scatterplots
Find dimensions of outliers
Find extreme samples just in these dimensions
Remove outlier
3 Appl. Multivariate Statistics - Spring 2012
![Page 4: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/4.jpg)
2d: More tricky
4 Appl. Multivariate Statistics - Spring 2012
Outlier
No outlier in x or y
![Page 5: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/5.jpg)
True Mahalanobis distance:
Estimated Mahalanobis distance:
Recap: Mahalanobis distance
5 Appl. Multivariate Statistics - Spring 2012
MD(x) =p(x¡¹)T§¡1(x¡¹)
Sq. Mahalanobis Distance MD2(x)
=
Sq. distance from mean in
standard deviations
IN DIRECTION OF X
M̂D(x) =
q(x¡ ¹̂)T §̂¡1(x¡ ¹̂)
![Page 6: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/6.jpg)
Mahalanobis distance: Example
6 Appl. Multivariate Statistics - Spring 2012
§ =
µ25 0
0 1
¶
¹ =
µ0
0
¶;
![Page 7: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/7.jpg)
Mahalanobis distance: Example
7 Appl. Multivariate Statistics - Spring 2012
§ =
µ25 0
0 1
¶
¹ =
µ0
0
¶;
(20,0) MD = 4
![Page 8: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/8.jpg)
Mahalanobis distance: Example
8 Appl. Multivariate Statistics - Spring 2012
§ =
µ25 0
0 1
¶
¹ =
µ0
0
¶;
(0,10)
MD = 10
![Page 9: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/9.jpg)
Mahalanobis distance: Example
9 Appl. Multivariate Statistics - Spring 2012
§ =
µ25 0
0 1
¶
¹ =
µ0
0
¶;
(10, 7)
MD = 7.3
![Page 10: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/10.jpg)
Theory of Mahalanobis Distance
Assume data is multivariate normally distributed
(d dimensions)
10 Appl. Multivariate Statistics - Spring 2012
Mahalanobis distance of samples follows a Chi-Square distribution
with d degrees of freedom
(“By definition”: Sum of d standard normal random variables has
Chi-Square distribution with d degrees of freedom.)
![Page 11: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/11.jpg)
Check for multivariate outlier
Are there samples with estimated Mahalanobis distance
that don’t fit at all to a Chi-Square distribution?
Check with a QQ-Plot
Technical details:
- Chi-Square distribution is still reasonably good for
estimated Mahalanobis distance - use robust estimates for
11 Appl. Multivariate Statistics - Spring 2012
¹;§
![Page 12: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/12.jpg)
Robust Estimates: Income of 7 people
Robust Scatter
Std. Dev.
![Page 13: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/13.jpg)
Robust
Std. Dev.
![Page 14: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/14.jpg)
Robust Std. Dev.
![Page 15: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/15.jpg)
Robust Estimates for outlier detection
If scatter is estimated robustly, outlier “stick out” much
more
Robust Mahalanobis distance:
Mean and Covariance matrix estiamted robustly
15 Appl. Multivariate Statistics - Spring 2012
![Page 16: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/16.jpg)
Example - continued
16 Appl. Multivariate Statistics - Spring 2012
Outlier easily detected !
![Page 17: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/17.jpg)
Outliers in >2d can be well hidden !
17 Appl. Multivariate Statistics - Spring 2012
No outlier,
right?
![Page 18: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/18.jpg)
Outliers in >2d can be well hidden !
18 Appl. Multivariate Statistics - Spring 2012
Wrong!
![Page 19: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/19.jpg)
Outliers in >2d can be well hidden !
19 Appl. Multivariate Statistics - Spring 2012
This outlier
can’t be seen
in the
scatterplot-
matrix
(but in a 3d plot)
![Page 20: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/20.jpg)
Method 1: Quantile of Chi-Sqaure distribution
Compute for each sample (in d dimensions) the robustly
estimated Mahalanobis distance MD(xi)
Compute the 97.5%-Quantile Q of the Chi-Square
distribution with d degrees of freedom
All samples with MD(xi) > Q are declared outlier
20 Appl. Multivariate Statistics - Spring 2012
![Page 21: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/21.jpg)
Method 2: Adjusted Quantile
Adjusted Quantile for outlier: Depends on distance
between cdf of Chi-Square and ecdf of samples in tails
Simulate “normal” deviations in the tails
Outlier have “abnormally large” deviations in the tails
(e.g. more than seen in 100 simulations without outliers)
21 Appl. Multivariate Statistics - Spring 2012
![Page 22: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/22.jpg)
Method 2: Adjusted Quantile
22 Appl. Multivariate Statistics - Spring 2012
ECDF leaves “plausible” range
Defines adaptive cutoff
![Page 23: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/23.jpg)
Method 2: Adjusted Quantile
Function “aq.plot”
23 Appl. Multivariate Statistics - Spring 2012
![Page 24: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/24.jpg)
Method 3: State of the art - pcout
Complex method based on robust principal components
Pretty involved methodology
Very fast – good for high dimensions
R: Function “pcout” in package “mvoutlier”
$wfinal01: 0 is outlier
$wfinal: Small values are more severe outlier
P. Filzmoser, R. Maronna, M. Werner. Outlier identification
in high dimensions, Computational Statistics and Data
Analysis, 52, 1694-1711, 2008
24 Appl. Multivariate Statistics - Spring 2012
![Page 25: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/25.jpg)
Automatic outlier detection
It is always better to look at a QQ-plot to find outlier !
Just find points “sticking out”; no distributional assumption
If you can’t: Automatic outlier detection
- finds usually too many or too few outlier depending on
parameter settings
- depends on distribution assumptions
(e.g. multivariate normality)
+ good for screening of large amounts of data
25 Appl. Multivariate Statistics - Spring 2012
![Page 26: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/26.jpg)
Concepts to know
Find multivariate outlier with robustly estimated
Mahalanobis distance
Cutoff
- by eye (best method)
- quantile of Chi-Square distribution
26 Appl. Multivariate Statistics - Spring 2012
![Page 27: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/27.jpg)
R commands to know
chisq.plot, pcout in package “mvoutlier”
27 Appl. Multivariate Statistics - Spring 2012
![Page 28: Applied Multivariate Statistics Spring 2012 - ETH Zurich · Applied Multivariate Statistics – Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before you delete](https://reader035.fdocuments.us/reader035/viewer/2022081800/5b2b7bef7f8b9afe1c8b9ed3/html5/thumbnails/28.jpg)
Next week
Missing values
28 Appl. Multivariate Statistics - Spring 2012