# ppt

date post

09-Sep-2014Category

## Documents

view

458download

0

Embed Size (px)

description

### Transcript of ppt

- CS 446:Machine Learning Gerald DeJong [email_address] 3-0491 3320 SC Recent approval for a TA to be named later

- Office hours: after most classes and Thur @ 3

- Text: Mitchells Machine Learning

- Midterm: Oct. 4

- Final: Dec. 12 each a third

- Homeworks / projects

- Submit at the beginning of class

- Late penalty: 20% / day up to 3 days

- Programming, some in-class assignments

- Class web site soon

- Cheating: none allowed!We adopt dept. policy

Please answer these and hand in now

- Name

- Department

- Where (If?*) you had Intro AI course

- Who taught it (esp. if not here)

- 1) Why interested in Machine Learning?

- 2) Any topics you would like to see covered?

- * may require significant additional effort

Approx. Course Overview / Topics

- Introduction:Basic problems and questions

- A detailed examples:Linear threshold units

- Basic Paradigms:

- PAC (Risk Minimization); Bayesian Theory; SRM (Structural Risk Minimization); Compression; Maximum Entropy;

- Generative/Discriminative; Classification/Skill;

- Learning Protocols

- Online/Batch;Supervised/Unsupervised/Semi-supervised; Delayed supervision

- Algorithms:

- Decision Trees (C4.5)

- [Rules and ILP (Ripper, Foil)]

- Linear Threshold Units (Winnow, Perceptron; Boosting; SVMs; Kernels)

- Probabilistic Representations (nave Bayes, Bayesian trees; density estimation)

- Delayed supervision: RL

- Unsupervised/Semi-supervised: EM

- Clustering, Dimensionality Reduction, or others of student interest

What to Learn

- Classifiers: Learn a hidden function

- Concept Learning:chair ?face ?game ?

- Diagnosis:medical; risk assessment

- Models: Learn a map(and use it to navigate)

- Learn a distribution(and use it to answer queries)

- Learn a language model;Learn an Automaton

- Skills:

- Learn to play games; Learn a Plan / Policy

- Learn to Reason; Learn to Plan

- Clusterings:

- Shapes of objects; Functionality; Segmentation

- Abstraction

- Focus onclassification (importance, theoretical richness, generality,)

What to Learn?

- Direct Learning: (discriminative, model-free[bad name])

- Learn a function that maps an input instance to the sought after property.

- Model Learning: (indirect, generative)

- Learning a model of the domain; then use it to answer various questions about the domain

- In both cases, several protocols can be used

- Supervised learner is given examples and answers

- Unsupervised examples, but no answers

- Semi-supervised some examples w/answers, others w/o

- Delayed supervision

SupervisedLearning

- Given: Examples(x,f( x)) of some unknown function f

- Find:A good approximation tof

- xprovides some representation of the input

- The process of mapping a domain element into a representation is calledFeature Extraction. (Hard; ill-understood; important)

- x 2{0,1} n orx 2 < n

- The target function (label)

- f(x) 2{-1,+1}Binary Classification

- f(x) 2{1,2,3,.,k-1}Multi-class classification

- f(x) 2

Example and Hypothesis Spaces X H X: Example Space set of all well-formed inputs [w/a distribution] H: Hypothesis Space set of all well-formed outputs - - + + + - - - + SupervisedLearning: Examples

- Disease diagnosis

- x: Properties of patient (symptoms, lab tests)

- f : Disease (or maybe: recommended therapy)

- Part-of-Speech tagging

- x: An English sentence (e.g., Thecanwill rust)

- f : The part of speech of a word in the sentence

- Face recognition

- x: Bitmap picture of persons face

- f : Name the person (or maybe: a property of)

- Automatic Steering

- x: Bitmap picture of road surface in front of car

- f : Degrees to turn the steering wheel

y=f(x 1 , x 2 , x 3 , x 4 ) Unknown function x 1 x 2 x 3 x 4 ALearning Problem X H ? ? (Boolean: x1, x2, x3, x4,f ) y=f(x 1 , x 2 , x 3 , x 4 ) Unknown function x 1 x 2 x 3 x 4 Training Set Examplex 1 x 2 x 3 x 4y 1 00100 3 00111 410011 5 01100 611000 701010 2 01000 Hypothesis Space

- Complete Ignorance:

- How many possible functions?

- 2 16 = 56536 over four input features.

- After seven examples how many possibilities for f?

- 2 9 possibilities remain for f

- How many examples until we figure out which is correct?

- We need to see labels for all 16 examples!

- Is Learning Possible?

Examplex 1 x 2 x 3 x 4y 1111? 0000? 1000? 1011? 11000 1101? 1010? 10011 01000 01010 01100 0111? 00111 00100 0001? 1110? Another Hypothesis Space

- Simple Rules:There areonly 16 simple

- conjunctive rules of the formy=x i x j x k ...

- No simple rule explains the data. The same is true for simple clauses

1 00100 3 00111 410011 5 01100 611000 701010 2 01000 y =cx 111000 x 201000 x 301100 x 401011 x 1 x 2 11000 x 1 x 300111 x 1 x 400111 RuleCounterexample x 2 x 300111 x 2 x 400111 x 3 x 410011 x 1 x 2 x 30011 1 x 1 x 2 x 40011 1 x 1 x 3 x 40011 1 x 2 x 3 x 40011 1 x 1 x 2 x 3 x 40011 1 RuleCounterexample Third Hypothesis Space

- m-of-n rules:There are 29 possible rules

- of the formy = 1if and only if at leastm

- of the followingnvariables are1

- Found a consistent hypothesis.

1 00100 3 00111 410011 5 01100 611000 701010 2 01000 x 1 3--- x 2 2--- x 3 1--- x 4 7--- x 1, x 2 23-- x 1,x 3 13-- x 1,x 4 63-- x 2, x 3 23-- variables1 -of2 -of3 -of4 -of x 2,x 4 23-- x 3,x 4 44-- x 1, x 2,x 3 133- x 1, x 2,x 4 233- x 1, x 3, x 4 1 3- x 2,x 3, x 4 153- x 1,x 2,x 3, x 4 1533 variables1 -of2 -of3 -of4 -of Views of Learning

- Learning is the removal of our remaining uncertainty:

- Suppose weknewthat the unknown function was an m-of-n Boolean function, then we could use the training data to infer which function it is.

- Learning requires guessing a good, small hypothesis class :

- We can start with a very small class and enlarge it until it contains an hypothesis that fits the data.

- We could be wrong !

- Our prior knowledge might be wrong:y=x4one-of (x1, x3) is also consistent

- Our guess of the hypothesis class could be wrong

- If this is the unknown function, then we will make errors when we are given newexamples, and are asked to predict the value of the function

General strategy for Machine Learning

- H should respect our prior understanding:

- Excess expressivity makes learning difficult

- Expressivity of H should match our ignorance

- Understand flexibility of std. hypothesis spaces:

- Decision trees, neural networks, rule grammars, stochastic models

- Hypothesis spaces of flexible size; Nested collections of hypotheses.

- ML succeeds when these interrelate

- Develop algorithms for finding a hypothesis h that fits the data

- h will likely perform well when the richness of H is less than the information in the training set

Terminology

- Training example:An pair of the form (x, f (x))

- Target function (concept):The true function f (?)

- Hypothesis:A proposed function h, believed to be similar to f.

- Concept:Boolean function.Example for which f (x)= 1 arepositiveexamples; those for whichf (x)= 0 arenegativeexamples (instances)(sometimes used interchangeably w/ Hypothesis)

- Classifier:A discrete valued function. The possible value of f: {1,2,K} are the classes orclass labels .

- Hypothesis space:The space of all hypotheses that can, in principle, be output by the learning algorithm.

- Version Space:The space of all hypothesis in the hypothesis space that have not yet been ruled out.

Key Issues in Machine Learning

- Modeling

- How to formulate application problems as machine learning problems ?

- Learning Protocols (where is the data coming from, how?)

- Project examples:[complete products]

- Given a seminar announcement, place the relevant information in my outlook

- Given a message, place it in the appropriate folder

- Image processing:

- Given a folder with pictures; automatically rotate all those that need it.

- My office:

- have my office greet me in the morning and unlock the door (but do it only for me!)

- Context Sensitive Spelling:Incorporate into Word

Key Issues in Machine Learning

- Modeling

- How to formulate application problems as machine learning problems ?

- Learning Protocols (where is the data coming from, how?)

- Representation:

- What are good hypothesis spaces ?

- Any rigorous way to find these? Any general approach?

- Algorithms:

- What are good algorithms?

- How do we define success?

- Generalization Vs. over fitting

- The computational problem

Example: Generalization vs Overfitting

- What is a Tree ?

- A botanist Her brother

- A tree is something with A tree is agreen thing

- leaves Ive seen before

- Neither will generalize well

Self-organize into Groups of 4 or 5

- Assignment 1

- The Badges Game

- Prediction or Modeling?

- Representation

- Background Knowledge

- When did learning take place?

- Learning Protocol?

- What is the problem?

- Algorithms

Linear Discriminators

- I dont know { whether, weather} to laugh or cry

- How can we make this a learning problem?

- We will look for a function

- F: Sentences { whether, weather}

- We need to define the domain of this function better.

- An option : For each wordwin English define aBooleanfeature x w:

- [x w=1] iff w is in the sentence

- This maps a sentence to a point in {0,1} 50,000

- In this space:some points arewhether points

- some areweather points

Learning Protocol? Supervised? Unsupervised? Whats Good?

- Learning problem :

- Find a function that

- best separates the data

- What function?

- Whats best?

- How to find it?

- A possibility: Define the learning problem to be:

- Find a (linear) function that best separates the data

Exclusive-OR(XOR)

- (x 1 x 2) ( : {x 1 } : {x 2 })

- In general: a parity function.

- x i 2{0,1}

- f(x 1 , x 2 ,, x n ) = 1

- iffx iis even

- This function is not

- linearly separable .

x 1 x 2 Sometimes Functions Can be Made Linear

- x 1x 2x 4 x 2x 4x 5 x 1x 3x 7

- Space: X= x 1 , x 2 ,, x n

- input Transformation

- New Space: Y = {y 1 ,y 2 ,} = {x i ,x ix j , x ix jx j }

y 3 y 4 y 7 New discriminator is functionally simpler Weather Whether

- Data are not separable in one dimension

- Not separable if you insist on using a specific class of functions

Feature Space x Blown Up Feature Space

- Data are separable in space

x x 2 Key issue: what features to use.Computationally, can be done implicitly(kernels) A General Framework for Learning

- Goal:predict an unobserved output value y2Y

- based on an observed input vector x2X

- Estimate a functional relationshipy~f(x)

- from a set{(x,y) i } i=1,n

- Most relevant -Classification :y{0,1}(ory{1,2,k})

- (But, within the same framework can also talk aboutRegression, y2 <

- What do we want f(x) to satisfy?

- We want to minimize the Loss (Risk):L(f()) = EX,Y ( [f(x) y] )

- Where:EX,Ydenotes the expectation with respect to the true distribution .

Simply: # of mistakes [] is a indicator function A General Framework for Learning (II)

- We want to minimize the Loss:L(f()) = EX,Y ( [f(X) Y] )

- Where:EX,Ydenotes the expectation with respect to the true distribution .

- We cannot do that. Why not?

- Instead, wetryto minimize the empirical classification error.

- For a set of training examples{(X i ,Y i )} i=1,n

- Try to minimize the observed loss

- (IssueI : when is this good enough? Not now)

- This minimization problem is typically NP hard.

- To alleviate this computational problem, minimize a new function a convex upper bound of the classification error function

- I (f(x),y) =[f(x) y] = {1 when f(x) y; 0 otherwise}

Learning as an Optimization Problem

- A Loss Function L(f(x),y)measures the penalty incurred by a classifierfon example(x,y).

- There are many different loss functions one could define:

- Misclassification Error:

- L(f(x),y) = 0 if f(x) = y;1 otherwise

- Squared Loss:

- L(f(x),y) = (f(x) y) 2

- Input dependent loss:

- L(f(x),y) = 0 if f(x)= y;c(x)otherwise.

A continuous convexloss function also allowsa conceptually simpleoptimization algorithm. f(x) y How to Learn?

- Localsearch:

- Start with a linear threshold function.

- See how well you are doing.

- Correct

- Repeat until you converge.

- There are other ways that do not

- search directly in the

- hypotheses space

- Directly compute the hypothesis?

Learning Linear Separators (LTU)

- f(x) = sgn {xw - } = sgn{ i=1 nw ix i- }

- x= ( x 1,x 2 , ,x n )2{0,1} n

- is the feature based

- encoding of the data point

- w= ( w 1,w 2 , ,w n )2 < n

- is the target function.

- determines the shift with

- respect to the origin

w Expressivity

- f(x) = sgn {xw - } = sgn{ i=1 nw ix i- }

- Many functions are Linear

- Conjunctions:

- y = x 1 x 3 x 5

- y = sgn{1x 1+ 1x 3+ 1x 5 - 3}

- At least m of n:

- y = at least 2 of { x 1,x 3 ,x 5}

- y = sgn{1x 1+ 1x 3+ 1x 5 - 2}

- Many functions are not

- Xor:y = x 1 x 2 x 1 x 2

- Non trivial DNF:y = x 1 x 2 x 3 x 4

- But some can be made linear

Probabilistic Classifiers as well Canonical Representation

- f(x) = sgn {xw - } = sgn{ i=1 nw ix i- }

- sgn {xw - } sgn {xw}

- Where:

- x = (x, - )and w = (w,1)

- Moved from anndimensional representation to an(n+1)dimensional representation, but now can look for hyperplans that go through the origin.

LMS: An online, local search algorithm

- A local search learning algorithm requires:

- Hypothesis Space:

- Linear Threshold Units

- Loss function:

- Squared loss

- LMS(Least Mean Square, L 2 )

- Search procedure:

- Gradient Descent

w LMS: An online, local search algorithm

- Letw (j)be our current weight vector

- Our prediction on the d-th examplexis therefore:

- Lett d be the target value for this example ( real value; represents ux )

- A convenienterrorfunction ofthe data set is:

(i(subscript) vector component;j(superscript) -time; d example #) Assumption:x2R n ;u2R nis the target weight vector; the target (label) ist d= uxNoise has been added; so, possibly, no weight vector is consistent with the data. Gradient Descent

- We use gradient descent to determine the weight vector that minimizesErr (w);

- Fixing the set D of examples, E is a function ofw j

- At each step, the weight vector is modified in the direction that produces the steepest descent along the error surface .

E(w) w w 4w 3w 2w 1

- To find the best direction in theweightspacewe compute the gradient

- ofEwith respect to each of the components of

- This vector specifies the direction that produces the steepest

- increase in E;

- We want to modifyin the direction of

- Where:

Gradient Descent

- We have:

- Therefore:

Gradient Descent: LMS Gradient Descent: LMS

- Weight update rule:

Gradient Descent: LMS

- Weight update rule:

- Gradient descent algorithm for training linear units:

- -Start with an initial random weight vector

- -For every example d with target value:

- -Evaluate the linear unit

- -updateby addingto each component

- -Continue until E below some threshold

- Weight update rule:

- Gradient descent algorithm for training linear units:

- -Start with an initial random weight vector

- -For every example d with target value:

- -Evaluate the linear unit

- -updateby addingto each component

- -Continue until E below some threshold

- Because the surface contains only a single global minimum

- the algorithm will converge to a weight vector with minimum

- error, regardless of whether the examples are linearly separable

Gradient Descent: LMS

- Weight update rule:

Incremental Gradient Descent: LMS Incremental Gradient Descent - LMS

- Weight update rule:

- Gradient descent algorithm for training linear units:

- -Start with an initial random weight vector

- -For every example d with target value:

- -Evaluate the linear unit

- -updatebyincrementallyadding

- to each component

- -Continue until E below some threshold

- In general - does not converge to global minimum

- Decreasing R with time guarantees convergence

- Incremental algorithms are sometimes advantageous

Learning Rates and Convergence

- In the general (non-separable) case the learning rate R

- must decrease tozero to guarantee convergence. It cannotdecrease too quickly nor too slowly.

- The learning rate is called thestep size.There are more

- sophisticatesalgorithms (Conjugate Gradient)that choose

- the step size automatically and converge faster.

- There is only one basin for linear threshold units, so a

- local minimum is the global minimum. However, choosing

- a starting point can make the algorithm converge much

- faster.

Computational Issues Assume the data is linearly separable. Sample complexity: Suppose we want to ensure that our LTU has an error rate(on new examples) of less thanwith high probability(at least (1- )) How large must m (the number of examples) be in order toachieve this? It can be shown that forndimensional problems m = O(1/[ln(1/ ) + (n+1) ln(1/ ) ]. Computational complexity: What can be said? It can be shown that there exists a polynomial time algorithm forfindingconsistent LTU (by reduction from linear programming).(On-line algorithms have inverse quadratic dependence on the margin) Other methods for LTUs

- Direct Computation:

- Set J( w ) = 0 and solve forw. Can be accomplished using SVD

- methods.

- Fisher Linear Discriminant:

- A direct computation method.

- Probabilisticmethods (naive Bayes):

- Produces a stochastic classifier that can be viewed as a linear

- threshold unit.

- Winnow:

- A multiplicative update algorithm with the property that it can

- handlelargenumbers of irrelevant attributes.

Summary of LMS algorithms for LTUs Local search:Begins with initial weight vector. Modifies iteratively to minimize and error function. The error function islooselyrelated to the goal ofminimizing the number of classification errors.Memory:The classifier is constructed from the training examples.The examples can then be discarded. Online or Batch: Both online and batch variants of the algorithms can be used. Fisher Linear Discriminant

- This is a classical method for discriminant analysis.

- It is based on dimensionality reduction finding a better representation for the data.

- Notice that just finding good representations for the data maynot always be good for discrimination . [E.g., O, Q]

- Intuition:

- Consider projecting data fromddimensions to the line.

- Likely results in a mixed set of points andpoor separation.

- However, bymoving the line aroundwe might be able to find an orientation for which the projected samples are well separated.

Fisher Linear Discriminant

- Sample S= {x 1 , x 2 , x n}2 < d

- P, N are the positive, negative examples, resp.

- Letw 2 < d . And assume||w||=1.Then:

- The projection of a vectorxon a line in the direction w,isw t x .

- If the data is linearly separable, there exists a good directionw .

(all vectors are column vectors) Finding a Good Direction

- Sample mean (positive, P; Negative, N):

- M p= 1/|P| Px i

- The mean of the projected (positive, negative) points

- m p= 1/|P| Pw t x i = 1/|P| Py i= w t M p

- Is simply the projection of the sample mean.

- Therefore, the distance between the projected means is:

- |m p- m N |= |w t (M p -M N)|

Want large difference Finding a Good Direction (2)

- Scalingwisnt the solution. We want the difference to be large relative to some measure of standard deviation for each class.

- S 2 p= P(y-m p) 2s 2 N= N(y-m N) 2

- 1/ (S 2 p+ s 2 N)within class scatter : estimates the variances of the sample.

- TheFischer linear discriminantemploys the linear function w t x for which

- J(w) = | m P m N | 2/ S 2 p+ s 2 N

- is maximized.

- How to make this a classifier?

- How to find the optimal w?

- Some Algebra

J as an explicit function of w (1)

- Compute the scatter matrices:

- S p= P(x-M p)(x-M p) tS N= N(x-M N)(x-M N) t

- and

- S W= S p+ S p

- We can write:

- S 2 p= P(y-m p) 2= P(w tx -w tM p) 2=

- = Pw t(x- M p) (x- M p) tw = w tS p w

- Therefore:

- S 2 p+ S 2 N= w tS W w

- S Wis the within-class scatter matrix. It is proportional to the sample covariance matrix for the d-dimensional sample.

J as an explicit function of w (2)

- We can do a similar computation for the means:

- S B= (M P -M N)(M P -M N) t

- and we can write:

- (m P -m N) 2=(w tM P -w tM N) 2=

- =w t(M P -M N ) (M P -M N )tw = w tS B w

- Therefore:

- S Bis thebetween-class scatter matrix . It is the outer product of two vectors and therefore its rank is at most 1.

- S B wis always in the direction of (M P -M N)

J as an explicit function of w (3)

- Now we can compute explicitly: We can do a similar computation for the means:

- J(w) =| m P m N | 2/ S 2 p+ s 2 N = w tS B w / w tS W w

- We are looking for a the value ofwthat maximizes this expression.

- This is a generalized eigenvalue problem; whenS Wis nonsingular, it is just a eigenvalue problem. The solution can be written without solving the problem, as:

- w=S -1 W (M P -M N)

- This is the Fisher Linear Discriminant .

- 1 : We converted a d-dimensional problem to a 1-dimensional problem and suggested a solution thatmakes some sense.

- 2:We have a solution that makes sense; how to make it a classifier? And, how good it is?

Fisher Linear Discriminant - Summary

- It turns out that both problems can be solved if we make assumptions. E.g., if the data consists of two classes of points, generated according to a normal distribution, with the same covariance. Then:

- The solution is optimal.

- Classification can be done by choosing a threshold, which can be computed.

- Is this satisfactory?

Introduction - Summary

- We introduced the technical part of the class by giving two examples for (very different) approaches to linear discrimination.

- There are many other solutions.

- Questions 1 : But this assumes that we are linear. Can we learn a function that is more flexible in terms of what it does with the features space?

- Question 2 : Can we say something about the quality of what we learn (sample complexity, time complexity; quality)