Machine Learning and Statistical Analysis
-
Upload
cruz-young -
Category
Documents
-
view
19 -
download
0
description
Transcript of Machine Learning and Statistical Analysis
Jong Youl ChoiComputer Science Department([email protected])
Social Bookmarking
2
Socialized
Tags
Bookmarks
3
Principles of Machine Learning Bayes’ theorem and maximum likelihood
Machine Learning Algorithms Clustering analysis Dimension reduction Classification
Parallel Computing General parallel computing architecture Parallel algorithms
4
DefinitionAlgorithms or techniques that enable computer (machine) to “learn” from data. Related with many areas such as data mining, statistics, information theory, etc.
Algorithm Types Unsupervised learning Supervised learning Reinforcement learning
5
Topics Models▪ Artificial Neural Network
(ANN)▪ Support Vector Machine
(SVM)
Optimization▪ Expectation-Maximization
(EM)▪ Deterministic Annealing
(DA)
Posterior probability of i, given X
i 2 : Parameter X : Observations P(i) : Prior (or marginal) probability
P(X|i) : likelihood
Maximum Likelihood (ML) Used to find the most plausible i 2 , given X Computing maximum likelihood (ML) or log-
likelihood Optimization problem
6
ProblemEstimate hidden parameters (={, })from the given data extracted from k Gaussian distributions
Gaussian distribution
Maximum Likelihood
With Gaussian (P = N),
Solve either brute-force or numeric method
7
(Mitchell , 1997)
Problems in ML estimation Observation X is often not complete Latent (hidden) variable Z exists Hard to explore whole parameter space
Expectation-Maximization algorithm Object : To find ML, over latent distribution P(Z |X,) Steps
0. Init – Choose a random old
1. E-step – Expectation P(Z |X, old)2. M-step – Find new which maximize likelihood. 3. Go to step 1 after updating old à new
8
DefinitionGrouping unlabeled data into clusters, for the purpose of inference of hidden structures or information
Dissimilarity measurement Distance : Euclidean(L2), Manhattan(L1), … Angle : Inner product, … Non-metric : Rank, Intensity, …
Types of Clustering Hierarchical ▪ Agglomerative or divisive
Partitioning▪ K-means, VQ, MDS, …
9(Matlab
helppage)
Find K partitions with the total intra-cluster variance minimized
Iterative method Initialization : Randomized yi
Assignment of x (yi fixed)
Update of yi (x fixed)
Problem? Trap in local minima
10(MacKay, 2003)
Deterministically avoid local minima No stochastic process (random walk) Tracing the global solution by changing
level of randomness
Statistical Mechanics Gibbs distribution
Helmholtz free energy F = D – TS▪ Average Energy D = < Ex>
▪ Entropy S = - P(Ex) ln P(Ex)
▪ F = – T ln Z
In DA, we make F minimized
11
(Maxima and Minima, Wikipedia)
Analogy to physical annealing process Control energy (randomness) by temperature (high
low) Starting with high temperature (T = 1) ▪ Soft (or fuzzy) association probability▪ Smooth cost function with one global minimum
Lowering the temperature (T ! 0)▪ Hard association▪ Revealing full complexity, clusters are emerged
Minimization of F, using E(x, yj) = ||x-yj||2
Iteratively,12
DefinitionProcess to transform high-dimensional data into low-dimensional ones for improving accuracy, understanding, or removing noises.
Curse of dimensionality Complexity grows exponentially
in volume by adding extra dimensions
Types Feature selection : Choose representatives (e.g.,
filter,…) Feature extraction : Map to lower dim. (e.g., PCA,
MDS, … )13
(Koppen, 2000)
Finding a map of principle components (PCs) of data into an orthogonal space, such that
y = W x where W 2 Rd£h (hÀd)
PCs – Variables with the largest variances Orthogonality Linearity – Optimal least mean-square
error
Limitations? Strict linearity specific distribution Large variance assumption 14
x1
x2
PC 1PC 2
Like PCA, reduction of dimension by y = R x where R is a random matrix with i.i.d columns and R 2 Rd£p (pÀd)
Johnson-Lindenstrauss lemma When projecting to a randomly selected subspace,
the distance are approximately preserved
Generating R Hard to obtain orthogonalized R Gaussian R Simple approach
choose rij = {+31/2,0,-31/2} with probability 1/6, 4/6, 1/6 respectively
15
Dimension reduction preserving distance proximities observed in original data set
Loss functions Inner product Distance Squared distance
Classical MDS: minimizing STRAIN, given From , find inner product matrix B (Double
centering)
From B, recover the coordinates X’ (i.e., B=X’X’T )
16
SMACOF : minimizing STRESS Majorization – for complex f(x),
find auxiliary simple g(x,y) s.t.:
Majorization for STRESS
Minimize tr(XT B(Y) Y), known as Guttman transform
17
(Cox, 2001)
Competitive and unsupervised learning process for clustering and visualization
Result : similar data getting closer in the model space
18
Input Model
Learning Choose the best similar
model vector mj with xi
Update the winner and its neighbors by mk = mk + (t) (t)(xi – mk)
(t) : learning rate(t) : neighborhood size
19
DefinitionA procedure dividing data into the given set of categories based on the training set in a supervised way
Generalization Vs. Specification Hard to achieve both Avoid overfitting(overtraining)▪ Early stopping▪ Holdout validation▪ K-fold cross validation ▪ Leave-one-out cross-validation
Validation Error
Training Error
Underfitting Overfitting
(Overfitting, Wikipedia)
Perceptron : A computational unit with binary threshold
Abilities Linear separable decision surface Represent boolean functions (AND, OR, NO)
Network (Multilayer) of perceptrons Various network architectures and capabilities
20
Weighted SumWeighted Sum Activation Function
Activation Function
(Jain, 1996)
Learning weights – random initialization and updating
Error-correction training rules Difference between training data and output: E(t,o) Gradient descent (Batch learning) ▪ With E = Ei ,
Stochastic approach (On-line learning)▪ Update gradient for each result
Various error functions Adding weight regularization term ( wi
2) to avoid overfitting
Adding momentum (wi(n-1)) to expedite convergence
21
Q: How to draw the optimal linear separating hyperplane? A: Maximizing margin
Margin maximization The distance between H+1 and
H-1:
Thus, ||w|| should be minimized 22
Margin
23
Constraint optimization problem Given training set {xi, yi} (yi 2 {+1, -1}): Minimize :
Lagrangian equation with saddle points
Minimized w.r.t the primal variable w and b:
Maximized w.r.t the dual variables i (all i ¸ 0)
xi with i > 0 (not i = 0) is called support vector (SV)
Soft Margin (Non-separable case) Slack variables i < C Optimization with additional
constraint
Non-linear SVM Map non-linear input to feature space Kernel function k(x,y) = h(x), (y)i Kernel classifier with support vectors
si
24
Input Space Feature Space
Memory Architecture
Decomposition Strategy Task – E.g., Word, IE, … Data – scientific problem Pipelining – Task + Data
25
Shared Memory Distributed Memory
Symmetric Multiprocessor (SMP) OpenMP, POSIX, pthread, MPI Easy to manage but expensive
Symmetric Multiprocessor (SMP) OpenMP, POSIX, pthread, MPI Easy to manage but expensive
Commodity, off-the-shelf processors MPI Cost effective but hard to maintain
Commodity, off-the-shelf processors MPI Cost effective but hard to maintain
(Barney, 2007)
(Barney, 2007)
Shrinking Recall : Only support vectors (i>0) are
used in SVM optimization Predict if data is either SV or non-SV Remove non-SVs from problem space
Parallel SVM Partition the problem Merge data hierarchically Each unit finds support vectors Loop until converge
26(Graf, 2005)
27