Post on 17-Jun-2020
Statistics 202:Data Mining
c©JonathanTaylor
Statistics 202: Data MiningWeek 2
Based in part on slides from textbook, slides of Susan Holmes
c©Jonathan Taylor
October 3, 2012
1 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Part I
Other datatypes, preprocessing
2 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Other datatypes
Document data
You might start with a collection of n documents (i.e. allof Shakespeare’s works in some digital format; all ofWikileaks’ U.S. Embassy Cables).
This is not a data matrix . . .
Given p terms of interest: {Al Qaeda, Iran, Iraq, etc.} onecan form a term-document matrix filled with counts.
3 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Other datatypes
Document data
Document ID Al Qaeda Iran Iraq . . .
1 0 0 3 . . ....
......
......
250000 2 1 1 . . .
4 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Other datatypes
Time series
Imagine recording the minute-by-minute prices of allstocks in the S & P 500 for last 200 days of trading.
The data can be represented by a 78000 × 500 matrix.
BUT, there is definite structure across the rows of thismatrix.
They are not unrelated “cases” like they might be in otherapplications.
5 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Transaction data
6 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Social media data
7 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Graph data
8 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Other data types
Graph data
Nodes on the graph might be facebook users, or publicpages.
Weights on the edges could be number of messages sentin a prespecified period.
If you take weekly intervals, this leads to a sequence ofgraphs
Gi = communication over i-th week.
How this graph changes is of obvious interest . . . .
Even structure of just one graph is of interest – we’ll comeback to this when we talk about spectral methods . . .
9 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Data quality
Some issues to keep in mind
Is the data experimental or observational?
If observational, what do we know about the datagenerating mechanism? For example, although the S&P500 example can be represented as a data matrix, there isclearly structure across rows.
General quality issues:
How much of the data missing? Is missingness informative?Is it very noisy? Are there outliers?Are there a large number of duplicates?
10 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Preprocessing
General procedures
Aggregation Combining features into a new feature. Example:pooling county-level data to state-level data.
Discretization Breaking up a continuous variable (or a set ofcontinuous variables) into an ordinal (or nominal)discrete variable.
Transformation Simple transformation of feature (log or exp)or mapping to a new space (Fourier transform /power spectrum, wavelet transform).
11 / 1
Statistics 202:Data Mining
c©JonathanTaylor
A continuous variable that could be discretized
4 2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
12 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Discretization by fixed width
4 2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
13 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Discretization by fixed quartile
4 2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
14 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Discretization by clustering
4 2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
15 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Variable transformation: bacterial decay
16 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Variable transformation: bacterial decay
17 / 1
Statistics 202:Data Mining
c©JonathanTaylor
BMW daily returns (fEcofin package)
18 / 1
Statistics 202:Data Mining
c©JonathanTaylor
ACF of BMW daily returns
19 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Discrete wavelet transform of BMW daily returns
20 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Part II
Dimension reduction, PCA &
eigenanalysis
21 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Dimension reduction
Combinations of features
Given a data matrix XXX n×p with p fairly large, it can bedifficult to visualize structure.
Often useful to look at linear combinations of the features.
Each β ∈ Rp, determines a linear rule
fβ(xxx) = xxxTβ
Evaluating this on each row XXX i of XXX yields a vector
(fβ(XXX i ))1≤i≤n = XXXβ.
22 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Dimension reduction
Dimension reduction
By choosing β appropriately, we may find “interesting”new features.
Suppose we take k much smaller than p vectors of βwhich we write as a matrix
BBBp×k
The new data matrix XBXBXB has fewer dimensions than XXX .
This is dimension reduction . . .
23 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Dimension reduction
Principal Components
This method of dimension reduction seeks features thatmaximize sample variance . . .
Specifically, for β ∈ Rp, define the sample variance ofVβ = XXXβ ∈ Rn
Var(Vβ) =1
n − 1
n∑i=1
(Vβ,i − V β)2
V β =1
n
n∑i=1
Vβ,i
Note that Var(VCβ) = C 2 · Var(Vβ) so we usually
standardize ‖β‖2 =√∑n
i=1 β2i = 1
24 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Dimension reduction
Principal Components
Define the n × n matrix
H = In×n −1
n111111T
This matrix removes means:
(Hv)i = vi − v .
It is also a projection matrix:
HT = H
H2 = H
25 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Dimension reduction
Principal Components with Matrices
With this matrix,
Var(Vβ) =1
n − 1βTXXXTHXXXβ.
So, maximizing sample variance, with ‖β‖2 = 1 is
maximize‖β‖2=1
βT(XXXTHXXX
)β.
This boils down to an eigenvalue problem . . .
26 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Dimension reduction
Eigenanalysis
The matrix XXXTHXXX is symmetric, so it can be written as
XXXTHXXX = VDV T
where Dk×k = diag(d1, . . . , dk), rank(XXXTHXXX ) = k andVp×k has orthonormal columns, i.e. V TV = Ik×k .
We always have di ≥ 0 and we take d1 ≥ d2 ≥ . . . dk .
27 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Dimension reduction
Eigenanalysis & PCA
Suppose now that β =∑k
j=1 ajvj with vj the columns ofV . Then,
‖β‖2 =
√√√√ k∑i=1
a2i
βT(XXXTHXXX
)β =
k∑i=1
a2i di
Choosing a1 = 1, aj = 0, j ≥ 2 maximizes this quantity.
28 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Dimension reduction
Eigenanalysis & PCA
Therefore, β1 = v1 the first column of V solves
maximize‖β‖2=1
βT(XXXTHXXX
)β.
This yields scores HXXX β1 ∈ Rn.
These are the 1st principal component scores.
29 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Dimension reduction
Higher order components
Having found the direction with “maximal samplevariance” we might look for the “second most variable”direction by solving
maximize‖β‖2=1,βT v1=0
βT(XXXTHXXX
)β.
Note we restricted our search so we would not just recoverv1 again.
Not hard to see that if β =∑k
j=1 ajvj this is solved bytaking a2 = 1, aj = 0, j 6= 2.
30 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Dimension reduction
Higher order components
In matrix terms, all the principal components scores are
(HXXXV )n×k
and the loadings are the columns of V .
This information can be summarized in a biplot.
The loadings describe how each feature contributes toeach principal component score.
31 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Olympic data
32 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Olympic data: screeplot
33 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Dimension reduction
The importance of scale
The PCA scores are not invariant to scaling of thefeatures: for Qp×p diagonal the PCA scores of XXXQ are notthe same as XXX .
Common to convert all variables to the same scale beforeapplying PCA.
Define the scalings to be the sample standard deviation ofeach feature. In matrix form, let
S2 =1
n − 1diag
(XXXTHXXX
)Define XXX = HXXXS−1. The normalized PCA loadings are
given by an eigenanalysis of XXXTXXX .
34 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Olympic data
35 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Olympic data: screeplot
36 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Dimension reduction
PCA and the SVD
The singular value decomposition of a matrix tells us wecan write
XXX = U∆V T
with ∆k×k = diag(δ1, . . . , δk), δj ≥ 0, k = rank(XXX ),UTU = V TV = Ik×k .
Recall that the scores were
XXXV = (U∆V T )V
= U∆
Also,
XXXTXXX = V∆2V T
so D = ∆2.
37 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Dimension reduction
PCA and the SVD
Recall XXX = HXXXS−1/2.
We saw that normalized PCA loadings are given by an
eigenanalysis of XXXTXXX .
It turns out that, via the SVD, the normalized PCA scores
are given by an eigenanalysis of XXXXXXT
because(XXXV
)n×k
= U∆
where U are eigenvectors of
XXXXXXT
= U∆2UT .
38 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Dimension reduction
Another characterization of SVD
Given a data matrix XXX (or its scaled centered version XXX )we might try solving
minimizeZZZ :rank(ZZZ)=k
‖XXX −ZZZ‖2F
where F stands for Frobenius norm on matrices
‖A‖2F =
n∑i=1
p∑j=1
A2ij
39 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Dimension reduction
Another characterization of SVD
It can be proven that
ZZZ = Sn×k(V T )k×p
where Sn×k is the matrix of the first k PCA scores and thecolumns of V are the first k PCA loadings.
This approximation is related to the screeplot: the heightof each bar describes the additional drop in Frobeniusnorm as the rank of the approximation increases.
40 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Olympic data: screeplot
41 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Dimension reduction
Other types of dimension reduction
Instead of maximizing sample variance, we might trymaximizing some other quantity . . .
Independent Component Analysis tries to maximize“non-Gaussianity” of Vβ. In practice, it uses skewness,kurtosis or other moments to quantify “non-Gaussianity.”
These are both unsupervised approaches.
Often, these are combined with supervised approaches intoan algorithm like:
Feature creation Build some set of features using PCA.Validate Use the derived features to see if they are
helpful in the supervised problem.
42 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Olympic data: 1st component vs. total score
43 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Part III
Distances and similarities
44 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Similarities
Start with XXX which we assume is centered andstandardized.
The PCA loadings were given by eigenvectors of thecorrelation matrix which is a measure of similarity.
The first 2 (or any 2) PCA scores yield an n × 2 matrixthat can be visualized as a scatter plot.
Similarly, the first 2 (or any 2) PCA loadings yield anp × 2 matrix that can be visualized as a scatter plot.
45 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Olympic data
46 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Similarities
The PCA loadings were given by eigenvectors of thecorrelation matrix which is a measure of similarity.
The visualization of the cases based on PCA scores weredetermined by the eigenvectors of XXXXXXT , an n × n matrix.
To see this, remember that
XXX = U∆V T
soXXXXXXT = U∆2UT
XXXTXXX = V∆2V T
47 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Similarities
The matrix XXXXXXT is a measure of similarity between cases.
The matrix XXXTXXX is a measure of similarity betweenfeatures.
Structure in the two similarity matrices yield insight intothe set of cases, or the set of features . . .
48 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Distances
Distances are inversely related to similarities.
If A and B are similar, then d(A,B) should be small, i.e.they should be near.
If A and B are distant, then they should not be similar.
For a data matrix, there is a natural distance betweencases:
d(XXX i ,XXX k)2 = ‖XXX i −XXX k‖22
=
p∑j=1
‖XXX ij −XXX kj‖22
= (XXXXXXT )ii − 2(XXXXXXT )ik + (XXXXXXT )kk
49 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Distances
Suggests a natural transformation between a similaritymatrix S and a distance matrix D
Dik = (Sii − 2 · Sik + Skk)1/2
The reverse transformation is not so obvious. Somesuggestions from your book:
Sik = −Dik
= e−Dik
=1
1 + Dik
50 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Distances
A distance (or a metric) on a set S is a functiond : S × S → [0,+∞) that satisfies
d(x , x) = 0; d(x , y) = 0 ⇐⇒ x = yd(x , y) = d(y , x)d(x , y) ≤ d(x , z) + d(z , y)
If d(x , y) = 0 for some x 6= y then d is a pseudo-metric.
51 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Similarities
A similarity on a set S is a function s : S × S → R andshould satisfy
s(x , x) ≥ s(x , y) for all x 6= ys(x , y) = s(y , x)
By adding a constant, we can often assume thats(x , y) ≥ 0.
52 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Examples: nominal data
The simplest example for nominal data is just the discretemetric
d(x , y) =
{0 x = y
1 otherwise.
The corresponding similarity would be
s(x , y) =
{1 x = y
0 otherwise.
= 1− d(x , y)
53 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Examples: ordinal data
If S is ordered, we can think of S as (or identify S with) asubset of the non-negative integers.
If |S | = m then a natural distance is
d(x , y) =|x − y |m − 1
≤ 1
The corresponding similarity would be
s(x , y) = 1− d(x , y)
54 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Examples: vectors of continuous data
If S = Rk there are lots of distances determined by norms.
The Minkowski p or `p norm, for p ≥ 1:
d(x , y) = ‖x − y‖p =
(k∑
i=1
|xi − yi |p)1/p
Examples:
p = 2 the usual Euclidean distance, `2
p = 1 d(x , y) =∑k
i=1 |xi − yi |, the “taxicabdistance”, `1
p =∞ the d(x , y) = max1≤i≤k |xi − yi |, the supnorm, `∞
55 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Examples: vectors of continuous data
If 0 < p < 1 this is not a norm, but it still defines a metric.
The preceding transformations can be used to constructsimilarities.
Examples: binary vectors
If S = {0, 1}k ⊂ Rk the vectors can be thought of asvectors of bits.
The `1 norm counts the number of mismatched bits.
This is known as Hamming distance.
56 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Example: Mahalanobis distance
Given Σk×k that is positive definite, we defineMahalanobis distance on Rk by
dΣ(x , y) =(
(x − y)TΣ−1(x − y))1/2
This is the usual Euclidean distance, with a change ofbasis given by a rotation and stretching of the axes.
If Σ is only non-negative definite, then we can replace Σ−1
with Σ†, its pseudo-inverse. This yields a pseudo-metricbecause it fails the test
d(x , y) = 0 ⇐⇒ x = y .
57 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Example: similarities for binary vectors
Given binary vectors x , y ∈ {0, 1}k we can summarize theiragreement in a 2× 2 table
x = 0 x = 1 totalyy = 0 f00 f01 f0·y = 1 f10 f11 f1·totalx f·0 f·1 k
58 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Example: similarities for binary vectors
We define the simple matching coefficient (SMC)similarity by
SMC (x , y) =f00 + f11
f00 + f01 + f10 + f11=
f00 + f11
k
= 1− ‖x − y‖1
k
The Jaccard coefficient ignores entries where xi = yi = 0
J(x , y) =f11
f01 + f10 + f11.
59 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Example: cosine similarity & correlation
Given vectors x , y ∈ Rk the cosine similarity is defined as
−1 ≤ cos(x , y) =〈x , y〉‖x‖2‖y‖2
≤ 1.
The correlation between two vectors is defined as
−1 ≤ cor(x , y) =〈x − x · 111, y − y · 111〉‖x − x · 111‖2‖y − y · 111‖2
≤ 1.
60 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Example: correlation
An alternative, perhaps more familiar definition:
cor(x , y) =SxySx Sy
Sxy =1
k − 1
k∑i=1
(xi − x)(yi − y)
S2x =
1
k − 1
k∑i=1
(xi − x)2
S2y =
1
k − 1
k∑i=1
(yi − y)2
61 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Correlation & PCA
The matrix 1n−1XXX
TXXX was actually the matrix of pair-wise
correlations of the features. Why? How?
1
n − 1
(XXX
TXXX)ij
=1
n − 1
(D−1/2XXXTHXXXD−1/2
)The diagonal entries of D are the sample variances of eachfeature.
The inner matrix multiplication computes the pair-wisedot-products of the columns of HXXX(
XXXTHXXX)ij
=n∑
k=1
(Xki − Xi )(Xkj − Xj)
62 / 1
Statistics 202:Data Mining
c©JonathanTaylor
High positive correlation
63 / 1
Statistics 202:Data Mining
c©JonathanTaylor
High negative correlation
64 / 1
Statistics 202:Data Mining
c©JonathanTaylor
No correlation
65 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Small positive
66 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Small negative
67 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Combining similarities
In a given data set, each case may have many attributes orfeatures.
Example: see the health data set for HW 1.
To compute similarities of cases, we must pool similaritiesacross features.
In a data set with M different features, we writexi = (xi1, . . . , xiM), with each xim ∈ Sm.
68 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Distances and similarities
Combining similarities
Given similarities sm on each Sm we can define an overallsimilarity between case xi and xj by
s(xi , xj) =M∑
m=1
wmsm(xim, xjm)
with optional weights wm for each feature.
Your book modifies this to deal with “asymmetricattributes”, i.e. attributes for which Jaccard similaritymight be used.
69 / 1