Recent Advances of Machine Learning -...

Recent Advances of Machine Learning

Jong-June Jeon

University of Seoul

September 7, 2017

Jong-June Jeon Recent Advances of Machine Learning

Learning based on data

Regularization method

Classification and deep learning


Learning based on data


Regression: Statistical method for learning the relation betweentwo more variables

Figure: Scatter plots of paired data


Regression: Statistical method for learning the relation betweentwo more variables

Figure: Which one is better?


Conventional interest in regression models

Detected signal is statistically significant?

If the true model is linear, then how can I measure theuncertainty of the estimated model

What is the lower bound of the asymptotic variance of theestimated models.

What is the most efficient estimation method?

How can I select the true model under large samples?

Inference !


Prediction

If we are only interested in prediction, there is another story.

Inference is out of interest (eg p-value, R square, selectionconsistency)Under the circumstances of accumulating data, the learned(estimated) model depends on the observed data.Predictive performances should be assessed by future data.

Goal is to improve prediction accuracy !


Change of paradigm

Past: data is scare resource. If data is observed once, themodel is estimated based on the observed data.

Now: data is not scare resource any more. Whenever data isobserved, the model is learned(estimated) based on the data.


Linear model

Explanatory variable : x ∈ Rp

Response variable: y ∈ RLinear predictor: y = xT β

β = (β1, · · · , βp)T ∈ Rp is called of regression coefficient.


Linear model: estimation

Let training set be Tr = {(yi,xi) : 1 ≤ i ≤ n}Least square method is given by the minimizer of RSS(β)where

RSS(β) =

n∑i=1

(yi − xTi β)2

This estimator is called of least square estimator(LSE)


Figure: Estimated predictor: Y = 0.5274 + 1.0212X



Generative model assumption

yi = xTi β∗ + εi

εi ∼iid (0, σ2)

If the true model is linear, it is known that LSE, β, convergesto the true parameter in probability.√n(β − β∗) weekly converges Gaussian distribution.



Note that the uncertainty of β depends on the training set TrLet x0 be evaluation points then linear predictor is xT0 β andthe risk defined though l2 loss function is given byEF (Y − x0β)2 where Y ∼ F (conditional distribution givenX = x0.

Generally risk of the predicted model is given by

R(β) = EF (Y −Xβ)2

where (Y,X) ∼ F


Linear model:Estimation

model bias: discrepancy between true model and expectationof estimated model.

Suppose that E(Y |X = x) 6= xTi β∗j

For example,

Yi = 3 +X2i + εi where εi ∼ N(0, 1)

Note that E(Y |Xi = x) = 3 + x2

Try to fit a model in linear model space.


Linear model:Estimation

Figure: Blue line denotes the estimate model in the linear model space


Learning based on algorithm

K-nearest neighbourhood algorithm

Let (yi, xi) for i = 1, · · · , n be training samples.

Nk(x): the index of k samples close to x

Predictor is given by

Y (x) =1

k

∑xi∈Nk(x)

yi


10-NN algorithm

Y (0) =


10-NN algorithm

Y (0) = 2.446989


10-NN algorithm

Y (1) = 4.024909


10-NN algorithm

Y (−1) = 4.314436


10-NN algorithm


Linear model in classification problem

Figure: x = (x1, x2)′ ∈ R2



linear predictor: y = 0.62223− 0.03565x1 − 0.05486x2

decision boundary:{(x1, x2) ∈ R2 : 0.5 = 0.62223− 0.03565x1 − 0.05486x2}



Figure: decision boundary is given by hyperplane


K-NN in in classification problem

Y (x) =

{1 if 1

k

∑xi∈Nk(x)

yi ≥ 1/2

0 o.w



1k

∑xi∈Nk(x)

yi = Y ((0, 2)) =



1k

∑xi∈Nk(x)

yi = 9/11 ≥ 1/2⇒ Y ((0, 2)) = 1



1k

∑xi∈Nk(x)

yi = Y ((0, 0)) =



1k

∑xi∈Nk(x)

yi = 5/11 ≤ 1/2⇒ Y ((0, 0)) = 0



Non-linear decision boundary is obtained by K-NN algorithm.


Model complexity

The selection of K is crucial to obtain the best predictiveperformance.


Model complexity


Idea of K-NN algorithm

K-NN algorithm can produce nonlinear model easily.

The idea of K-NN algorithm is local approximation.

Y (x) =1

k

∑xi∈Nk(x)

yi

For small K less data are used for estimating local mean suchthat model become complex and variance of the local meanincreases.

For large K vice versa.


Bias-Variance Trade off

Prediction error = (Model bias)2 + Model Variance

Complex model : Large variance and small bias

Simple model: small variance and large bias


Selection of model with moderate complexity is required to achievethe best predictive model under restricted training sample.


Ensemble method: variance reducing method by combiningpredicted models(weak learners)

We will see examples of ensemble method applied to tree model.


Tree model

Learning algorithm to split regions


data underlying function


Single regression tree


10 regression trees using randomly sampled data


averaging tree


Hard problem for classification tree


single tree


25 averaged tree


25 voted tree


bagging: Bootstrap Aggregating

Bootstrapping: re-sample data B times T (b)r

Aggregating

Learn model f b using T (b)r for each b = 1, · · · , B

Aggregating

Regression

fbag(x) =1

B

B∑b=1

fb(x)

Classification

fbag(x) = Voting{fb(x) : 1 ≤ b ≤ B}


Thererical background of bagging

Variance reduction

Bias does not change

Note that “Prediction error =Bias2 + Variance”


Linear model

Why did so many statistician and mathematician study linearmodel?


High dimensional linear model

Suppose that Yi = f(Zi) + εi where f is smooth function.

For sufficiently large p f(z) ' β0 +∑p

j=1 βjzj and

E(Yi|Zi) ' β0 +

p∑j=1

βjXij

where Xij = Zji .

Let Xi = (Xi1, · · · , Xip)′ ∈ Rp and Yi ∈ R

Linear model

Yi = β0 +

p∑j=1

βjXij + εi


High dimensional linear model linear modelEven for non-smooth function f ,

E(Yi|Xi) ' β0 +

p∑j=1

β′jBj(Xi)

by appropriate selection of basis functions Bjs.

How to control model complexity in the high dimensional linearmodel?


Regularization method


Statistical learning by empirical risk minimization

zi = (yi,x′i)′ ∼iid P

yi = g(xi) + εi for i = 1 · · · , n for g ∈ Gloss function : l : Z × G 7→ R+

l2 loss function: l(zi, g) = (yi − g(xi))2

risk function:

Risk function : R(g) = EP l(z, g)Empirical risk function: Rn(g) =

∑ni=1 l(zi, g)/n

Statistical learning

g = argming∈GRn(g)


Visualization of bias-variance trade-off

Regularization method utilizes the bias-variance trade-off byrestricting model space


Regularization

g = argming∈GRn(g)

subject to J(g) ≤ C,

where J : G 7→ R+ is a penalty (regularization) function

We can write the above optimization problem as follows:

g = argmingRn(g) + λJ(g)

for some λ ≥ 0


LSE

(yi,xi) for i = 1, · · · , n : pairs of response and explanatoryvariables (yi ∈ R and xi ∈ Rp)β = (β1, · · · , βp)′

yi = x′iβ + εi (εi ∼iid (0, σ2))

LSE:

β = argminβ

n∑i=1

(yi − x′iβ)2


LSE

LSE is the Best linear Unbiased Estimator (BLUE).

When n > p LSE is not unique.

When n ' p, the variance of LSE is large.


Ridge estimator

βλ = argminβ

n∑i=1

(yi − x′iβ)2 + λ

p∑j=1

β2j︸︷︷︸penalty function

for λ > 0.

Ridge estimator always exists.

There exist a λ > 0 such that the ridge estimatorcorresponding to the λ has better predictive performance thanLSE.

When p > n the consistency of βλ (convergence to the trueparameter) is not guaranteed.


LASSO estimator

βλ = argminβ

n∑i=1

(yi − x′iβ)2 + λ

p∑j=1

|βj |︸︷︷︸penalty function

for λ > 0.

In the high dimensional problem ( p ' exp(n) ) LASSO workswell.

Under regularity condition the lasso estimator achievesminimax optimal error bound.

But strong model conditions are required for selectionconsistency.


Regularized estimator with nonconvex penalty

βλ = argminβ

n∑i=1

(yi − x′iβ)2 +

p∑j=1

Gλ(|βj |)︸︷︷︸penalty function

for λ > 0.

When signals is large ( > O(1/√n)), the method can choose

the signal variable well (oracle property).

But strong model conditions are required for minimaxoptimality.


Regularization method provides a useful view of machine learningapplication


Sparsity

Form non-differential points in model constraints we knowthat there are positive probability that estimated model hasexactly zero coefficients.

By introducing various type of constraints havingnon-differential points, structural learning is possible.


Structural sparse modeling 1

yi = µi + εi for i = 1, · · · , n.

µ = (µ1, · · · , µn):

(µ1, · · · , µn) = argminµ

n∑i=1

(yi − µi)2 + λ

n−1∑j=1

|µj+1 − µj |︸︷︷︸penalty function

Note that

λ =∞: µj = y for all j.

λ = 0: µj = yi for all j.

λ controls the number of change points.


Signal approximatorModel:

yi = βi + εi

where εi ∼iid N(0, σ2) for i = 1, · · · , n.

Figure: Sigmal approximator


Structural sparse modeling 2

yi = x′iβ + εi for i = 1, · · · , n.

We obtain the following estimator of

β = argminβ

n∑i=1

(yi − x′iβ)2 + λ∑j 6=k|βj − βk|

clustering of estimate coefficients.


Map of estimated trends in extreme precipitation


Classification and deep learning


Binary Classification

Reponse varible: y ∈ Y = {−1, 1}Explanatory variable: x ∈ XClassification function: C : X 7→ Y

ex) Let y be a variable denotes disease or normal, and let x be avector of result of diagnosys. A doctor is a classification functionto map x to Y.


Let P be a distribution of (y,x).

missclassification error of C on population:

P (C(x) 6= y) = P (C(x) = −1, y = 1) + P (C(x) = 1, y = −1)

Bayes error:

minCP (C(x) 6= y)

Bayes classifier:

C∗ = argminCP (C(x) 6= y)


Note that the bayes classifier is given by

C∗(x) =

{1 if P (y = 1|x) ≥ 0.5−1 o.w.


We assume that (yi,xi) for i = 1, · · · , n are iid random samples.

minC

n∑i=1

#(C(xi) 6= yi)/n

The final goal is to estimate a classifier which minimizes theclassification error.

The function C is too complex such that we adopt analternative method to construct classifier.


Scoring function

f : X → R: a function assigns a score to the observation withcovariate x. Using this score, we construct a classifier asfollowing;

C(x; f) =

{1 if f(x) ≥ 0−1 o.w.


Estimation and surrogate loss function It is reasonable to findf minimizing

n∑i=1

I(C(xi; f) 6= yi)/n,

which is equivalently written by∑n

i=1 I(yif(xi) < 0). Hereyif(xi) is called margin.cf) I(x > 0) = 1,if x > 0, I(x > 0) = 0, otherwise.


Convex surrogate loss functionHowever, this task requires to heavy computation so that wecannot use this method. The complexity comes frome 0-1 lossfunction. We replace the 0-1 loss function with other convex lossfunction called of the surrogate loss function.


Surrogate loss function


Surrogate loss functionFinally, we minimize the surrogate risk function

L(f ;φ) =

n∑i=1

φ(yif(xi))/n

The estimator in the classfication problem is given by

f = argminfL(f ;φ)


If the region of f(x) = 0 is similar to that of P (Y = 1|x) = 0.5,then f(x) gives an approximated bayes classifier.

C(x) =

{1 if f(x) ≥ 0.5−1 o.w

If f(x) is linear, the region f(x) = 0 is linear space. If the bayesclassifier is not linear, the model bias in classification always exists.


Example: logistic regression

linear model:f(x) = x′β

logistic loss:φ(yf(x)) = −yf(x) + log(1 + exp(−yf(x)))

estimator: argminβ∑n

i=1 φ(yix′iβ)

The estimator is equal to Maximum likelihood estimator in thelogistic regression model.


How can we produce a complex model f?


Estimation of non-linear classifier

Additive models: assume that f is a additive model of trees(stumps) or simple classifier. When φ(z) = exp(−z) and ridgepenalty (see regularization) is applied, then the classificationproblem is called adaBoost.

Feature mapping: assume that the input space are project ona feature space. When φ(z) = (1− z)+ and ridge penalty (seeregularization) is applied, then the classification problem iscalled support vector machine.


Neural Network for classification

Composition

Input data is mapped onto linear spaceIts image is transformed by non-linear activation function(sigmoid function, tanh, RELU...)The transformed data is mapped onto linear space again.· · ·A feature of input data is obtained by above compositions.Apply the conventional classification method.



Figure: Visualization of neural network


Neural Network with l2 loss for classificationHere we consider a single layer neural network without a bias termfor notational simplicity. The objective function is given by

L(β,α) =

n∑i=1

(1− yi(k∑j=1

βjσ(x′αj))2/n,

where σ is sigmoid function, β = (β1, · · · , βk) andα = (α1, · · · ,αk).



argminβ,αL(β) = argminβ,α

n∑i=1

(1− yik∑j=1

βjσ(x′iαj))2

Note that the score functions is given by f(x) =∑k

j=1 βjσ(x′αj)

and the surrogate loss fucntion φ(x) = (1− x)2.


It is known that all methods (additive models, featuremapping, composition) can achieves ideal decision boundaryas # of data goes to infinity.

Then the natural question is that ”what is more efficient wayto construct a complex model”: which methods requires lessparameters to construct a nearly optimal decision functions.

Answer is simple. It depends on the true model. But manydifficult problems are solved by deep neural network (image,video data analysis). ⇒ High order composition works ! (deeplearning)


Deep neural network

Construct a nonlinear model by compositions

f(x) = hk ◦ hk−1 · · · ◦ h1(x)

Regularization (tune the estimated model)

Scheduling learning rateSelection of moment parameterDrop-out rate# of layer (depth)· · ·

Computational issue

# of parameter is very large (107 ∼)Develop parallel optimization algorithm.GPGPU computing


Recent advances of machine learning answers the question,“how to efficiently construct a model space that can fittedwell for considered data.”

Three methods, additive models, high order feature mapping,compositions, are competing.

Successes in engineering fields tells that complex model spaceinduced by compositions is useful for many areas.

But regularizing the model is still crucial task to select thebest model.


Thank you


Recent Advances of Machine Learning -...

Documents

Transcript of Recent Advances of Machine Learning -...