Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear...

Introduction to Machine Learning

Non-linear prediction with kernels

Prof. Andreas KrauseLearning and Adaptive Systems (las.ethz.ch)

Recall: Linear classifiers

–+++

–– –

––

–+–

y = sign(wTx)

Recall: The Perceptron problemSolve

Optimize via Stochastic Gradient Descent

`P (w; yi,xi) = max(0,�yiwTxi)

w = argminw

`P (w;xi, yi)

Solving non-linear classification tasksHow can we find nonlinear classification boundaries?Similar as in regression, can use non-linear transformations of the feature vectors, followed bylinear classification

Recall: linear regression for polynomialsWe can fit non-linear functions via linear regression, using nonlinear features of our data (basis functions)

For example: polynomials (in 1-D)

f(x) =dX

wi�i(x)

f(x) =mX

Polynomials in higher dimensionsSuppose we wish to use polynomial features, but ourinput is higher-dimensionalCan still use monomial featuresExample: Monomials in 2 variables, degree = 2

Avoiding the feature explosionNeed O(dk) dimensions to represent (multivariate) polynomials of degree k on d featuresExample: d=10000, k=2 è Need ~100M dimensions

In the following, we can see how we can efficientlyimplicitly operate in such high-dimensional feature spaces(i.e., without ever explicitly computing the transformation)

Revisiting the Perceptron/SVMFundamental insight: Optimal hyperplane lies in thespan of the data

(Handwavy) proof: (Stochastic) gradient descentstarting from 0 constructs such a representation

More abstract proof: Follows from the„representer theorem“ (not discussed here)

↵iyixi

wt+1 = wt + ⌘tytxt [ytwTt xt < 0]

wt+1 = wt(1� 2�⌘t) + ⌘tytxt [ytwTt xt < 1]

Perceptron:

Reformulating the Perceptron

Advantage of reformulation

Key observation: Objective only depends on inner products of pairs of data pointsThus, we can implicitly work in high-dimensional spaces, as long as we can do inner products efficiently

x 7! �(x)

↵ = argmin↵1:n

max{0,�nX

↵jyiyjxTi xj}

xTx0 7! �(x)T�(x0) =: k(x,x0)

„Kernels = efficient inner products“Often, can be computed much moreefficiently than

Simple example: Polynomial kernel in degree 2

k(x,x0)�(x)T�(x0)

Polynomial kernels (degree 2)Suppose and

x = [x1, . . . , xd]T x0 = [x0

1, . . . , x0d]

(xTx0)2 =

Polynomial kernels: Fixed degreeThe kernelimplicitly represents all monomials of degree m

How can we get monomials up to order m?

k(x,x0) = (xTx0)m

Polynomial kernelsThe polynomial kernelimplicitly represents all monomials of up to degree m

Representing the monomials (and computing innerproduct explicitly) is exponential in m!!

k(x,x0) = (1 + xTx0)m

The „Kernel Trick“Express problem s.t. it only depends on inner productsReplace inner products by kernels

This „trick“ is very widely applicable!

k(xi,xj)xTi xj

Example: Perceptron

Will see further examples later 17

↵ = argmin↵1:n

max{0,�nX

↵jyiyjxTi xj}

↵ = argmin↵1:n

max{0,�nX

↵jyiyjk(xi,xj)}

Derivation: Kernelized Perceptron

Kernelized PerceptronInitialize For t=1,2,...

Pick data point (xi, ,yi ) uniformly at randomPredict

If set

For new point x, predict

y = sign⇣ nX

↵jyjk(xj ,xi)⌘

y 6= yi ↵i ↵i + ⌘t

↵1 = · · · = ↵n = 0

Training

Predictiony = sign

⇣ nX

↵jyjk(xj ,x)⌘

Demo: Kernelized Perceptron

QuestionsWhat are valid kernels?How can we select a good kernel for our problem?Can we use kernels beyond the perceptron?Kernels work in very high-dimensional spaces. Doesn‘t this lead to overfitting?

Properties of kernel functionsData space XA kernel is a functionCan we use any function?

k must be an inner product in a suitable spaceèk must be symmetric!

èAre there other properties that it must satisfy?22

k : X ⇥X ! R

Positive semi-definite matricesSymmetric matrix is positive semidefinite iff

M 2 Rn⇥n

Kernels è semi-definite matricesData space X (possibly infinite)Kernel functionTake any finite subset of dataThen the kernel (gram) matrix

is positive semidefinite

k : X ⇥X ! RS = {x1, . . . ,xn} ✓ X

B@k(x1,x1) . . . k(x1,xn)

......

k(xn,x1) . . . k(xn,xn)

B@�(x1)T�(x1) . . . �(x1)T�(xn)

......

�(xn)T�(x1) . . . �(xn)T�(xn)

Semi-definite matrices è kernelsSuppose the data space X={1,...,n} is finite, and we aregiven a positive semidefinite matrixThen we can always construct a feature map

such that

K 2 Rn⇥n

Ki,j = �(i)T�(j)� : X ! Rn

Outlook: Mercer‘s TheoremLet X be a compact subset of anda kernel function

Then one can expand k in a uniformly convergentseries of bounded functions s.t.

Can be generalized even further26

Rn k : X ⇥X ! Rn

k(x, x0) =1X

�i �i(x)�i(x0)

Definition: kernel functionsData space XA kernel is a function satisfying1) Symmetry: For any it must hold that

2) Positive semi-definiteness: For any n, any set, the kernel (Gram) matrix

must be positive semi-definite27

k : X ⇥X ! R

S = {x1, . . . ,xn} ✓ X

B@k(x1,x1) . . . k(x1,xn)

......

k(xn,x1) . . . k(xn,xn)

x,x0 2 X

k(x,x0) = k(x0,x)

Examples of kernels on Linear kernel:Polynomial kernel:Gaussian (RBF,squared exp. kernel):

Laplacian kernel:

k(x,x0) = xTx0

k(x,x0) = (xTx0 + 1)d

k(x,x0) = exp(�||x� x0||22/h2)

k(x,x0) = exp(�||x� x0||1/h)

Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear...

Documents

Transcript of Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear...

Machine Learning Made Easy€¦ · Machine Learning –What is Machine Learning and why do we need it? –Common challenges in Machine Learning Example 1: Human activity learning

Machine Learning FabSpace March 2018 - irit.fr Learning FabSpace March 2018... · Machine Learning 5 Machine Learning • Supervised : Train / test • Predictiveanalysis Machine

Machine Learning: Machine Learning:

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING · Intelligence & Machine Learning technologies and applications including Machine Learning, Deep Learning, Computer Vision, Natural Language

Predicting Housing Prices With Azure Machine Learning · Azure Machine Learning. Expected Learning Outcomes Azure Machine Learning @tetranoodle Build Regression Machine Learning Model.

Methods MACHINE LEARNING FROM MACHINE LEARNING TO …

1-Machine Learning in Oracle - ITOUG2017/12/01 · – Oracle Machine Learning (on Oracle Autonomous Data Warehouse Cloud Service) – Automated Machine Learning => Machine Learning

Machine Learning with MATLAB · 2 Agenda Machine Learning Overview Machine Learning with MATLAB –Unsupervised Learning Clustering –Supervised Learning Classification Regression

Fast Track Machine Learning Part 1 (Machine Learning Overview) - Types of Machine Learning Problems

Introduction To Azure Machine Learning...Expected Learning Outcomes Azure Machine Learning @tetranoodle Supervised Machine Learning Machine Learning Algorithms in Azure ML Workflow

MACHINE REASONING: A PERSPECTIVE AND …...“From machine learning to machine reasoning”. Machine Learning Volume 94 Issue 2, 2004, Pages 133-149 Machine Learning Volume …

Machine Learning and Azure Machine Learning

¿Qué es Machine Learning? Usos de Machine Learning.

DAT340/DIT866 Applied Machine Learning: Machine learning ...richajo/dit866/lectures/l11/l11.pdfDAT340/DIT866 Applied Machine Learning: Machine learning and ethics Vilhelm Verendel

Machine Learning with MATLAB - … · 2 Agenda Machine Learning Overview Machine Learning with MATLAB –Unsupervised Learning Clustering –Supervised Learning Classification

IML Tutorial 2 Linear Regression - las.inf.ethz.ch

Machine Learning and Data Mining: Lecture Noteshertzman/411notes.pdf · CSC 411 / CSC D11 Introduction to Machine Learning 1 Introduction to Machine Learning Machine learning is a

Machine Learning Lecture 5 Bayesian Learning G53MLE | Machine Learning | Dr Guoping Qiu1.

COMP 562: Introduction to Machine Learning · Introduction to Machine Learning Machine learning: What and Why? Quiz: De ning the Learning Task "Machine Learning is the study of algorithms

Machine Learning Chapter 11. 2 Machine Learning What is learning?