Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office...
-
date post
19-Dec-2015 -
Category
Documents
-
view
219 -
download
2
Transcript of Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office...
![Page 1: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/1.jpg)
Neural Computation 0368-4149-01
Prof. Nathan IntratorTuesday 16:00-19:00 Schreiber 007
Office hours: Wed [email protected]
![Page 2: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/2.jpg)
Outline
• Goals for neural learning - Unsupervised• Goals for statistical/computational learning
– PCA– ICA– Exploratory Projection Pursuit– Search for non-Gaussian distributions
• Practical implementations
![Page 3: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/3.jpg)
Statistical Approach to Unsupervised Learning
• Understanding the nature of data variability• Modeling the data (sometimes very flexible model)• Understanding the nature of the noise• Applying prior knowledge• Extracting features based on:
– Prior knowledge– Class prediction– Unsupervised learning
![Page 4: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/4.jpg)
4
Principal Component Analysis.
Włodzisław Duch
SCE, NTU, Singapore
http://www.ntu.edu.sg/home/aswduch
![Page 5: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/5.jpg)
Linear transformations – example
2D vectors X in a unit circle with mean (1,1); Y = A*X, A = 2x2 matrix
The shape is elongated, rotated and the mean is shifted.
1 1
2 2
2 1
1 1
Y X
Y X
![Page 6: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/6.jpg)
Invariant distances
Euclidean distance is not invariant to general linear transformations
This is invariant only for orthonormal matrices ATA = I that make rigid rotations, without stretching or shrinking distances.
Idea: standardize the data in some way to create invariant distances.
Y A X
T21 2 1 2 1 2
T1 2 1 2T
Y Y Y Y Y Y
X X A A X X
![Page 7: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/7.jpg)
Data standardization
For each vector component X(j)T=(X1(j), ... Xd
(j)), j=1 .. n
calculate mean and std: n – number of vectors, d – their dimension
( ) ( )
1 1
1 1;
i
n nj j
ij j
X Xn n
X XVector of mean
feature values.
Averages over rows. (1) (2) ( )
(1) (2) ( )1 1 1 1
(1) (2) ( )2 2 2 2
(1) (2) ( )
n
n
n
nd d d d
X X X X
X X X X
X X X X
X X X
![Page 8: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/8.jpg)
Standard deviation
Calculate standard deviation:
Transform X => Z, standardized data vectors
( )
1
22 ( )
1
1
1
1
i
i i
nj
ij
nj
ij
X Xn
X Xn
Vector of mean feature values.
Variance = square of standard deviation (std), sum of all deviations from the mean value.
( ) ( )j ji i i iZ X X
![Page 9: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/9.jpg)
Std data
Std data: zero mean and unit variance.
Standardize data after making data transformation.
Effect: data is invariant to scaling only (diagonal transformation).
Distances are invariant, data distribution is the same.
How to make data invariant to any linear transformations?
,
( ) ( )
1 1
2 22 ( ) ( ) 2
1 1
1 10
1 11
1 1
i i
Z i i i
n nj j
i i ij j
n nj j
i i ij j
Z Z X Xn n
Z Z X Xn n
![Page 10: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/10.jpg)
Data standardization example
For our example Y=AX, assuming X means=1 and variances = 1
Transformation
Vector of mean
feature values.
Variance
check it!
1 3 2 1 1
1 2 1 1 1a
X Y
1 1
2 2
2 1
1 1
Y X
Y X
T1 5Diag
1 2
X Yσ σ AA
T21 2 1 2 1 2T Y Y X X A A X X How to make this
invariant?
![Page 11: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/11.jpg)
Covariance matrixVariance (spread around mean value) + correlation between features.
where X is d x n dimensional matrix of vectors shifted to their means.
Covariance matrix is symmetric Cij = Cji and positive definite.
Diagonal elements are variances (square of std), si2 = Cii
( ) ( )
1
T( ) ( ) T
1
1; , 1
1
1 1
1 1
i
nk k
ij i j jk
nk k
k
C X X X X i j dn
n n
XC X X X X XX
[ 1, 1]ij ij i jr C Spherical distribution of data has Cij=I (unit matrix).
Elongated ellipsoids: large off-diagonal elements, strong correlations between features.
CX is d x d
![Page 12: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/12.jpg)
Mahalanobis distance
Linear combinations of features leads to rotations and scaling of data.
Mahalanobis distance:
is invariant to linear transformations:
T; ; Y X Y AX Y AX C AC A
T21 2 1 2 1 21
T 11 2 1 2T T 1 1
21 2
Y
X
YC
X
C
Y Y Y Y C Y Y
X X A A C A A X X
X X
2 T 1
XXC
X X C X
![Page 13: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/13.jpg)
Principal componentsHow to avoid correlated features?
Correlations covariance matrix is non-diagonal !
Solution: diagonalize it, then use transformation that makes it
diagonal to de-correlate features.
C – symmetric, positive definite matrix XTCX > 0 for ||X||>0;
its eigenvectors are orthonormal:
its eigenvalues are all non-negative
Z – matrix of orthonormal eigenvectors (because Z is real+symmetric),
transforms X into Y, with diagonal CY, i.e. decorrelated.
T ( ) ( )
T T
; ;i ii
X X
Y X
Y Z X C Z Z C Z ZΛ
C Z C Z Z ZΛ Λ
In matrix form, X, Y are dxn, Z, CX, CY are dxd
( )T ( )i jij Z Z
![Page 14: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/14.jpg)
Matrix form
Eigen problem for C matrix in matrix form: X aC Z ZΛ
11 12 1 11 12 1
21 22 2 21 22 2
1 2 1 2
11 12 1 1
21 22 2 2
1 2
0 0
0 0
0 0
d d
d d
d d dd d d dd
d
d
d d dd d
C C C Z Z Z
C C C Z Z Z
C C C Z Z Z
Z Z Z
Z Z Z
Z Z Z
![Page 15: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/15.jpg)
Principal componentsPCA: old idea, C. Pearson (1901), H. Hotelling 1933
Result: PC are linear combinations of all features, providing new uncorrelated features, with diagonal covariance matrix = eigenvalues.
T
T
;
Y X
Y Z X
C Z C Z Λ
TXZΛZ C
Small li small variance data change little in direction Yi
PCA minimizes C matrix reconstruction errors:
Zi vectors for large li are sufficient to get:
because vectors for small eigenvalues will have very
small contribution to the covariance matrix.
Y – principal components, or vectors X transformed using eigenvectors of CX
Covariance matrix of transformed vectors is diagonal => ellipsoidal distribution of data.
![Page 16: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/16.jpg)
Two components for visualization
New coordinate system: axis ordered according to variance = size of the eigenvalue.
First k dimensions account for
1
1
k
ii
dk
ii
V
fraction of all variance (please note that li are variances); frequently 80-90% is sufficient for rough description.
Diagonalization methods: see Numerical Recipes, www.nr.com
![Page 17: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/17.jpg)
PCA properties
PC Analysis (PCA) may be achieved by:
• transformation making covariance matrix diagonal
• projecting the data on a line for which the sums of squares of distances from original points to projections is minimal.
• orthogonal transformation to new variables that have stationary variances
True covariance matrices are usually not known, estimated from data.
This works well on single-cluster data; more complex structure may require local PCA, separately for each cluster.
PC is useful for: finding new, more informative, uncorrelated features;
reducing dimensionality: reject low variance features,
reconstructing covariance matrices from low-dim data.
![Page 18: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/18.jpg)
PCA Wisconsin exampleWisconsin Breast Cancer data:
• Collected at the University of Wisconsin Hospitals, USA.
• 699 cases, 458 (65.5%) benign (red), 241 malignant (green).
• 9 features: quantized 1, 2 .. 10, cell properties, ex:
Clump Thickness, Uniformity of Cell Size, Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei,
Bland Chromatin, Normal Nucleoli, Mitoses.
2D scatterograms do not show any structure no matter which subspaces are taken!
![Page 19: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/19.jpg)
Example cont.PC gives useful information already in 2D.
Taking first PCA component of the standardized data:
If (Y1>0.41) then benign else malignant
18 errors/699 cases = 97.4%
Transformed vectors are not
standardized, std’s are below.
Eigenvalues converge slowly, but classes are
separated well.
![Page 20: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/20.jpg)
PCA disadvantages
Useful for dimensionality reduction but: • Largest variance determines which components are used, but
does not guarantee interesting viewpoint for clustering data.• The meaning of features is lost when linear combinations are
formed.
Analysis of coefficients in Z1 and other important eigenvectors may show which original features are given much weight.
PCA may be also done in an efficient way by performing singular value decomposition of the standardized data matrix.
PCA is also called Karhuen-Loève transformation.
Many variants of PCA are described in A. Webb, Statistical pattern recognition, J. Wiley 2002.
![Page 21: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/21.jpg)
2 skewed distributions
PCA transformation for 2D data:
First component will be chosen along the largest variance line, both clusters will strongly overlap, no interesting structure will be visible.
In fact projection to orthogonal axis to the first PCA component has much more discriminating power.
Discriminant coordinates should be used to reveal class structure.
![Page 22: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/22.jpg)
High Dimensional Data
Dimension Reduction
Feature ExtractionVisualisationClassification
Analysis
![Page 23: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/23.jpg)
Projection Pursuit
what: An automated procedure that seeks interesting low dimensional projections of a high dimensional cloud by numerically maximizing an objective function or projection index.
Huber, 1985
![Page 24: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/24.jpg)
Projection Pursuitwhy:
Curse of dimensionality
• Less Robustness
• worse mean squared error
• greater computational cost
• slower convergence to limiting distributions
• …
• Required number of labelled samples increases with dimensionality.
![Page 25: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/25.jpg)
What is an interesting projection
In general: the projection that reveals more
information about the structure.
In pattern recognition:
a projection that maximises class separability in a low dimensional
subspace.
![Page 26: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/26.jpg)
Projection Pursuit
Dimensional ReductionFind lower-dimensional projections of a high-dimensional point cloud to facilitate
classification.
Exploratory Projection PursuitReduce the dimension of the problem to facilitate visualization.
![Page 27: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/27.jpg)
Projection Pursuit
How many dimensions to use
• for visualization
• for classification/analysis
Which Projection Index to use
• measure of variation (Principal Components)
• departure from normality (negative entropy)
• class separability(distance, Bhattacharyya, Mahalanobis, ...)
• …
![Page 28: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/28.jpg)
Projection Pursuit
Which optimization method to choose
We are trying to find the global optimum among local ones
• hill climbing methods (simulated annealing)
• regular optimization routines with random starting points.
![Page 29: Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a286/html5/thumbnails/29.jpg)
Timetable for Dimensionality reduction
• Begin 16 April 1998
• Report on the state-of-the-art. 1 June 1998
• Begin software implementation 15 June 1998
• Prototype software presentation 1 November 1998