Multivariate outlier detection
-
Upload
jay-jianqiang-wang -
Category
Data & Analytics
-
view
97 -
download
0
Transcript of Multivariate outlier detection
Estimating Distance Distributions and Testing Observation Outlyingness for Complex Surveys
Jianqiang Wang
Major Professor: Jean Opsomer
Committee: Wayne A. Fuller
Song X. Chen
Dan Nettleton
Dimitris Margaritis
2/52
OutlineW IntroductionW Notation and assumptionsW Mean, median-based inferenceW Variance estimationW Simulation studyW Application in National Resources InventoryW Theoretical extensions
3/52
Structure of survey dataW Many finite populations targeted by surveys consist
of homogeneous subpopulations.
W “Homogeneity” refers to variables being collected, generally different from design variables.
W ExampleW Interested in the health condition of U.S. residents
between 45 and 60 years old, stratify by county, and homogeneity refers to health condition variables we collect.
4/52
Conceptual ideasW Given this structure of population, provide measure of
outlyingness and flag unusual points.
W Assign each point to a subpopulation and define certain measure of outlyingness.
W Dimension reduction, describe multivariate populations, identify outliers and discriminate objects.
5/52
Outlier identification procedure (1)W Identify target variables that we want to test
outlyingness on.
W Partition the population into a number of relatively homogeneous groups.
W Define a measure of subpopulation center and a distance metric from each point to subpopulation center.
W Define the outlyingness of each point as the fraction of points with a less extreme distance in its subpopulation.
6/52
Outlier identification procedure (2)
W Estimate the distance distribution and outlyingness of each point.
W Flag observations with measure of outlyingness exceeding a prespecified threshold (0.95 or 0.98).
W Make decisions on the list of suspicious points.
7/52
Inference in survey samplingW The target measure of outlyingness is defined at
finite population level. W Two mechanisms
W Mechanism for generating finite populationW Mechanism for drawing a sample
W Condition on the finite population, use design-based inference.
W Asymptotic theory in survey samplingW Sequence of finite populationsW Sequence of sampling designs
8/52
Sequence of finite populationsW Let be the population index.W Associated with the -th population element is a -dim
vector , with inclusion probability .W Finite pop of sizes , and sample
a of sizes , expected sample size a
W Population composition , with .W Assume we know and subpopulation association.W Let be the power set of .
yi = (yi ;1;:::;yi ;p)p
¼i
Ng = fN gN fN g 2 [f L ; f H ]
i
Uº = f1;2;¢¢¢;Nº g
Uº = [ Gg=1Uº g Nº g
Aº =S G
g=1 Aº;g
G
n¤º g = E(nº gjF º ):
nº g
F º f y1;y2;¢¢¢;yN ºg
9/52
Sequence of sampling designsW A probability sample is drawn from with
respect to some measurable design.
W Associate a sampling indicator with each element.
W Inclusion probability
W Sample size with expectation .
AN UN
I(i2A N )
n
¼i=Pr(i 2 AN ) = E(I(i2A N ) jF N )
¼i j = Pr(i; j 2 AN ) = E(I(i2A N )I(j 2A N ) jF N )
n¤ = E(njFN )
10/52
Examples of sampling designs and estimatorsW Simple random sampling without replacement
W Inclusion probability
W Poisson samplingW Inclusion probability
W Horvitz-Thompson and Hajek estimator of the mean
Arbitrary ¼i j =½ ¼i¼j ; i 6= j
¼i ; i = j
¼i = nN ; ¼i j = (n¡ 1)n
(N ¡ 1)N ; i 6= j
¼i ;
11/52
NormsW Use the notion of norm to quantify the distance
between an observation and measure of center.
W A norm satisfies:W Non-degenerate: W Homogeneity:W Triangle inequality:
k¹ k : <p ! <+
k¹ k = 0, ¹ = 0
k¹ 1 +¹ 2k · k¹ 1k+k¹ 2k
k®¹ k = j®jk¹ k
12/52
Example of norms and unit circleW General norms
W Manhattan distance:
W Euclidean distance:
W Supremum norm:
W Quadratic norm
L1 : k¹ k1 =P p
i=1 j¹ i j
L2 : k¹ k2 =p P p
i=1 ¹ 2i
L1 : k¹ k = maxf j¹ 1j;¢¢¢;j¹ pjg
LA : k¹ kA =p ¹ 0A¹
13/52
Distribution of population distancesW Population
W Sample
where W Measure of center: mean vector
W Population
W Sample
Dº;d(¹ º ) = 1N º
PUº
I(ky i ¡ ¹ º k· d)
bDº;d(¹ º ) = 1bN º
PA º
1¼i
I(ky i ¡ ¹ º k· d)
bNº =P
A º1¼i
¹ º = 1N º
PUº
yi
¹ º = 1bN º
PA º
y i¼i
location
radius
14/52
Bivariate population
f y : ky ¡ ¹ k = dg¹
Dº;d(¹ º )
bDº;d(¹ º )
15/52
Nondifferentiability with respect to location
16/52
General design assumptionsW Assumptions on , , and design variance
W W W For any vector with finite moments, define
as the HT estimator of mean, assume
W For any with positive definite population variance-covariance matrix and finite fourth moment,
and
z
zK L · N
n ¼j · K U
Var(¹zN ;¼jFN ) · K 1VarSR S (¹zN ;SR S jF N )
[V (¹zN ;¼jF N )]¡ 1 VH T f ¹zN ;¼g¡ Ip£ p = Op(n¤¡ 1=2)
n = Op(N ¯º ); with ¯ 2 ( 2p
2p+1;1]
¹zN ;¼
n ¼i
2+±
n¤1=2(¹zN ;¼¡ ¹zN )jF Nd! N (0;§ zz )
17/52
Application specific assumption 1W The population distance distribution converges to a limiting
function
where W The limiting function is continuous in . and
, with finite derivatives and
W The norm is continuous on , with a continuous derivative , and bounded second derivative matrix .
(d;¹ ) 2 [0;1 ) £ <p:
d 2 [0 1 )¹ 2 <p
k¢k <p
Ã(¢) Hs(¢)
limN ! 1
Dº;d(¹ ) = Dd(¹ )
Dd(¹ )@Dd (¹ )
@d ; @Dd (¹ )@¹
@2Dd (¹ )@¹ 2 :
18/52
Application specific assumption 2W The population quantity
where andW Justification assumes a probabilistic model.W Markov’s Inequality, Borel-Cantalli Lemma.
®2 [14;1)
p Nºn
1N º
PUº
I(d<ky i ¡ ¹ k· d+hN º ) ¡ @D d (¹ )@d hN º
o! 0
hº = O(N ¡ ®º )
19/52
Application specific assumption 3W The population quantity
converges to 0 uniformly for and W Justification assumes a probabilistic model.W Proof: empirical process theory.
s 2 Cs¹ 2 <p
n¤º
1=2
N º
PUº
hI(ky i ¡ ¹ ¡ n¤º ¡ 1=2sk· d) ¡ I(ky i ¡ ¹ k· d) ¡ Dd(¹ º + n¤º
¡ 1=2s) +Dd(¹ )i
20/52
Design consistencyW Decomposition
W Intermediate result
W Consistency
n¤º1=2( bDº;d(¹ º ) ¡ bDº;d(¹ º ) ¡ Dd(¹ º ) +Dd(¹ º )) p! 0
n¤º
1=2( bDº;d(¹ º ) ¡ Dd(¹ º ))¯¯F º = Op(1)
n¤º
1=2³
bDº;d(¹ º ) ¡ Dº;d(¹ º )´
= n¤º
1=2³
bDº;d(¹ º ) ¡ bDº;d(¹ º ) ¡ Dg;d(¹ º ) +Dg;d(¹ º )´
+ n¤º
1=2³
bDº;d(¹ º ) ¡ Dº;d(¹ º )´
+ n¤º
1=2 (Dd(¹ º ) ¡ Dd(¹ º )) ;
21/52
Asymptotic normalityW Let and be the design
variance-covariance matrix of HT estimator of mean .
W Asymptotic normality
where
Unknown subpop size
assume subpop size and mean are known
Unknown subpop mean
b¹ ;i = (I(ky i ¡ ¹ º k· d);1;yi )0
b¹ ;i
a¹ =·1;¡ Dº;d(¹ º ) ¡ @Dd (¹ º )
@¹ º
T ¹ º ; @Dd (¹ º )@¹ º
T¸0
§ ¹ ;d
¡a0¹ § ¹ ;da¹
¢¡ 1=2 ³bDº;d(¹ º ) ¡ Dº;d(¹ º )
´¯¯F º
d! N (0;1)
22/52
Multivariate medianW Mean vector
W Generalized medianW Population
W Sample
W Existence and uniqueness
qº = arg infqP
Uºkyi ¡ qk
qº = arg infqP
A º1¼i
kyi ¡ qk
23/52
Multivariate medianW Estimating equations
W Population
W Sample
W Linearization of
W What if the estimating equation is not differentiable?
PUº
Ã(yi ¡ q) = 0P
A º1¼i
Ã(yi ¡ q) = 0:qº
qº = qº +"
1Nº
X
i2A º
Hs(yi ¡ qº )¼i
#¡ 1 1Nº
X
i2A º
Ã(yi ¡ qº )¼i
+op(n¤º
¡ 1=2)
24/52
Median-based distances Asymptotic results
W Design consistency and asymptotic normality of for .
W Design consistency and asymptotic normality of as an estimator of .
qºqº
bDº;d(qº )Dº;d(qº )
25/52
Mahalanobis distancesW Mean and median-based inference.
W Choose an appropriate norm to match the shape of underlying multivariate distribution.
W Estimate the variance-covariance matrix or other shape measure of subpopulation, and use Mahalanobis distance.
W Estimate the distribution of Mahalanobis distances.
W See application section for more details.
26/52
Naive variance estimatorW Use mean-based case to explain variance estimators.
W Recall the asymptotic variance of :
where
W Claim: The extra variance due to estimating the center can be ignored in elliptical distributions using quadratic norm.
W Naïve variance estimator, ignoring the gradient vector:
bDº;d(¹ º )V
³bDº;d(¹ º )
´= a0
¹ § ¹ ;da¹
a¹ =µ
1;¡ Dº;d(¹ º ) ¡³
@Dd (¹ º )@¹ º
´0¹ º ;
³@Dd (¹ º )
@¹ º
´0¶0:
¾2¹ ;d;naive =
³1;¡ bDº;d(¹ º )
´b§ ¹ ;d
³1;¡ bDº;d(¹ º )
´0:
27/52
Estimate the gradient vector by kernel smoothingW Idea: estimate by
where , e.g.: CDF of standard normal.
W Kernel estimator
W Design consistent for under mild assumptions.
W Jackknife variance estimation has been proposed for mean-based case.
K(¢) =RK(t)dt
@Dd (¹ º )@¹ º
Dd(¹ ) = limº ! 1
1N º
PUº
I(ky i ¡ ¹ k· d)
1N º
PA º
K³
d¡ ky i ¡ ¹ kh
´1¼i
³ º;d(¹ º ) = 1N º h
PA º
K³
d¡ ky i ¡ ¹ º kh
´Ã(yi ¡ ¹ º ) 1
¼i
28/52
Jackknife variance estimatorW Recall
W Recalculate mean for each jackknife replicate? Inconsistent!
W Proposed idea: incorporate an estimated gradient vector in replication estimation.
W For the l-th replicate sample, calculate
and use
bDº;d(¹ º ) = 1bN º
PA º
1¼i
I(ky i ¡ ¹ º k· d)
bD (l)(¹ º ) = bD(l)º;d(¹ º ) + ³ º;d(¹ º )(¹ (l)
º ¡ ¹ º )
bVJ K³
bDº;d(¹ º )´
=LX
l=1cl
³bD (l)(¹ º ) ¡ bDº;d(¹ º )
´2
29/52
Simulation studyW Goal of simulation study:
W Assess asymptotic properties of estimators.W Compare naive variance estimator with kernel estimator.
W Simulation parametersW P=2, G=5.W Subpopulations 1-4 are elliptically contoured,
subpopulations 5 is skewed.W Stratified SRS.W Norm: Euclidean norm.
30/52
Simulated population
31/52
Subpopulation distance distribution functions
32/52
=5000, =1000, =5
Cluster 4 Cluster 5
1.00 1.41 2.45 1.00 1.41 2.45
.44 .54 .71 .31 .52 .85
-0.11 0.00 -0.00 -0.00 -0.00 0.05
-0.01 0.00 -0.01 0.00 0.00 0.01
1.03 1.00 1.00 1.30 1.13 1.00
G
d
Effect of estimating the center
bias(D (¹ ))sd(D (¹ ))sd(D (¹ ))sd(D (¹ ))
bias(D (¹ ))sd(D (¹ ))
N n
Dº;d(¹ º )
33/52
=5000, =200, =5
Cluster 4 Cluster 5
1.00 1.41 2.45 1.0 1.41 2.45
.43 .53 .68 .35 .55 .88
-0.28 -0.05 0.12 0.05 0.14 0.10
0.03 0.04 0.06 0.00 -0.01 0.02
1.17 1.03 1.01 1.16 1.04 0.96
G
d
Effect of estimating the center
bias(D (¹ ))sd(D (¹ ))sd(D (¹ ))sd(D (¹ ))
bias(D (¹ ))sd(D (¹ ))
N n
Dº;d(¹ º )
34/52
=5000, =1000, =5
Cluster 4 Cluster 51.0 1.41 2.45 1.0 1.41 2.45
0.44 0.54 0.71 0.31 0.52 0.85
0.94 1.00 1.00 0.53 0.78 1.07
1.21 1.15 1.12 1.00 1.01 1.04
1.07 1.06 1.04 0.85 0.94 0.98
G
d
¾2d;S M
¾2d;M C
h=0:4
Average estimated variance relative to MC variance
N n
Dº;d(¹ º )
¾2d;S M
¾2d;M C
h=0:1
¾2d;N V
¾2d;M C
35/52
NRI applicationW Introduction to NRI.
W Outlier identification for a longitudinal survey.
W Strategy for initial partitioning in NRI.
W How to define Mahalanobis distances.
W Analysis of identified points.
36/52
National Resources Inventory (1)
W National Resources Inventory is a longitudinal survey of natural resources on non-Federal land in U.S.
W Conducted by the USDA NRCS, in co-operation with CSSM at Iowa State University.
W Produce a longitudinal database containing numerous agro-environmental variables for scientific investigation and policy-making.
W Information was updated every 5 years before 1997 and annually through a partially overlapping subsampling design.
37/52
National Resources Inventory (2)W Various aspects of land use, farming practice, and
environmentally important variables like wetland status and soil erosion.
W Measure both level and change over time in these variables.
W Primary mode of data collection is a combination of aerial photography and field collection.
W Outliers arise from errors in data collection, processing or some real points themselves behave abnormally.
38/52
Outlier identification for a longitudinal surveyW Identify outliers for periodically updated data.
W Build outlier identification rules on previous years’ data and use the rules to flag current observations.
Observe years
2001-2005
(2001,2002,2003)
(2003,2004,2005)
Training set
Test set
39/52
Target variablesW Non-pseudo core points with soil erosion in years
2001-2005.
W Variables: broad use, land use, USLE C factor, support practice factor, slope, slope length and USLE loss .
W USLE loss represents the potential long term soil loss in tons/acre.USLELOSS= R * K * LS * C * P
40/52
Point classification
b.u. Point Type b.u. Point Type1 Cultivated cropland 7 Urban and built-up
land2 Noncultivated
cropland8 Rural transportation
3 Pastureland 9 Small water areas4 Rangeland 10 Large water areas5 Forest land 11 Rederal land6 Minor land 12 CRP
41/52
Initial partitioningW Initial partitioning uses geographical association
and broad use category.Partition national data into state-wise categories.
Collapse northeastern states.
Partition each region based on broad use sequence into (1,1,1), (2,2,2) (3,3,3), (12,12,12) and points
with broad use change.
Merge points with same broad use change pattern, say (2,2,3), (1,1,12).
42/52
Defining distancesW Estimate subpopulation mean vector and
covariance matrix
W Calculate distance to the center
W The inverse matrix is defined through a principal value decomposition.
¹ º = 1bN º
PSº
y i¼i
b§ º = 1bN º
PSº
(yi ¡ ¹ º )(yi ¡ ¹ º )0 1¼i
kyi ¡ ¹ º kb§ º =q
(yi ¡ ¹ º )0b§ ¡º (yi ¡ ¹ º ):
b§ ¡º
43/52
Source of outlyingnessW Flagged 1% points in training set, and compared
test distances with 99%-quantile of training distances.
W Source of outlyingnesseº;i = b§ ¡ 1=2
º (¹ º ¡ y i )kb§ ¡ 1=2
º (¹ º ¡ y i )k
44/52
Analysis of flagged pointsW Agricultural specialists analyzed identified points by
suspicious variables.
W C factor: almost all points were considered suspicious.W Data entry errors
W Invalid entries c factor=1 for hayland, pastureland or CRP
W Unusual levels or trends in relation to landuse
(0.013, 0.13, 0.013, 0.013, 0.013)
(0.011, 0.06, 0.11, 0.003, 0.003)
45/52
Analysis of flagged pointsW P factor: all points are candidates for review
because of the change over time.
W Slope length: all points were flagged because of the level, not change over time.
W Land use: Most points flagged because of a change in the type of hayland or pastureland over time. Not a major concern to NRCS reviewers.
(1.0, 1.0, 1.0, 0.6, 1.0)
46/52
Nondifferentiable survey estimatorsW The sample distance distribution is nondifferentiable
function of the estimated location parameter.W A general class of survey estimators:
with corresponding population quantity
W A direct Taylor linearization may not be applicable, again use a differentiable limiting function , with derivative .
bT(^) = 1N
Pi2Sº
1¼i
h(yi ; ^)
TN (¸ N ) = 1N
P Ni=1 h(yi ;¸ N )
Not necessarily differentiable
³ (° )
bDº;d(¹ º )
T (° ) = limN ! 1
TN (° )
47/52
AsymptoticsW We provide a set of sufficient conditions on the limiting
function and a number of population quantities under which
where
W The extra variance due to estimating unknown parameter may or may not be negligible, depending on the derivative.
W Propose a kernel estimator to estimate unknown derivative.
n¤1=2hV( bT(^))
i ¡ 1=2 ³bT(^) ¡ TN (¸ N )
´¯¯F d! N (0;1)
( bT(^)) =³1;[³ (¸ N )]T
´V (¹z¼)
µ 1³ (¸ N )
¶:
48/52
Estimating distribution function using auxiliary informationW Ratio model
W Use as a substitute of , where .
W Difference estimator
W The extra variance due to estimating ratio is negligible (RKM, 1990).
Rxi yi R =P
S º yi =¼iPS º x i =¼i
bT(R) = 1N
nPSº
1¼i
I(yi · t) +hP
U I(R xi · t) ¡ PSº
1¼i
I(R xi · t)i o
yi = Rxi + ²i ; ²i » I D(0;xi ¾2)
49/52
Estimating a fraction below an estimated quantity W Estimate the fraction of households in poverty when
the poverty line is drawn at 60% of the median income.
with population quantity
W Assume that , the extra variance depends on .
bT(q) = 1N
PSº
1¼i
I(yi · 0:6q)
TN (qN ) = 1N
NPi=1
I(yi · 0:6qN )
limN ! 1
TN (°) = FY (0:6°)@F Y (0:6° )
@°
50/52
Nondifferentiable estimating equationsW The sample p-th quantile can be defined through
estimating equations
W The usual practice is to linearize the estimating function, but this approach is not applicable due to nondifferentiability.
W Provide a set of sufficient conditions on the monotonicity and smoothness of and its limit for proof.
S(t) = 1N
Pi2S
1¼i
I(yi ¡ t· 0) ¡ p
» = inff t : S(t) ¸ 0g
SN (t) = 1N
NPi=1
I(yi ¡ t· 0) ¡ p
»N = inff t : SN (t) ¸ 0g
SN (t)
51/52
Concluding remarksW Proposed an estimator for subpopulation distance
distribution and demonstrated its statistical properties.
W Application in a large-scale longitudinal survey.
W Theoretical extensions to nondifferentiable survey estimators.
52/52
Thank you