Lecture 2 - UCLA Statisticsakfletcher/stat261_2019/Lectures/02_Lecture.pdfΒ Β· 02 β π₯ 2 2π...
Transcript of Lecture 2 - UCLA Statisticsakfletcher/stat261_2019/Lectures/02_Lecture.pdfΒ Β· 02 β π₯ 2 2π...
STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2019
Prof. Allie Fletcher
Lecture 2
People: Prof. Allie Fletcher. TA: Ruiqi Gao [email protected]
Where: MW 3:30-4:45pm, Public Affairs Bldg 2238
Grading: C261: Midterm 20%, Final 35%, HW and labs 25%, Quizzes&Participation 10%, Project 10%, C161: Midterm 20%, Final 35%, HW and labs 35%, Quizzes&Participation 10% Project is for graduate students only (see below) Homework will include programming assignments Midterm tentatively May 8 Midterm and final are closed book. Equation sheet is provided.
Course Admin
2
Decision Theory Classification, Maximum Likelihood and Log likelihood MAP Estimation, Bayes Risk Probability of errors, ROC
Empirical Risk Minimization Problems with decision theory, empirical risk minimization Probably approximately correct learning
Curse of Dimensionality Parameter Estimation Probabilistic models for supervised and unsupervised learning ML and MAP estimation Examples
Outline
3
How to make decision in the presence of uncertainty? History: Prominent in WWII: radar for detecting aircraft, codebreaking, decryption
Observed data x β X, state y β Y p(x|y): conditional distribution
For each class, model of how the data is generated Example: y β {0, 1} (salmon vs. sea bass) or (airplane vs. bird, etc.) x: length of fish
Classification
General classification problem: Assume each sample belongs to one of πΎ classes Observe data on the sample π Want to estimate class label π¦ = 0,1, β¦ ,πΎ β 1 E.g. dog/cat, spam/real, β¦
Strong assumption needed for: decision theory Given each class label π¦π, we know conditional distribution π(π|π¦π) Model of how the data is generated We will discuss how we learn this density laterβ¦
Classification
Which fish type is more likely to given the observed fish length x?
If π π₯ π¦ = 1 ) > π π₯ π¦ = 0 ) guess sea bass; otherwise classify the fish as salmon π π₯ π¦ called the likelihood of π₯ given class π¦ Select class with highest likelihood π¦οΏ½ = arg max π(π₯|π¦) Likelihood ratio test (LRT):
If π π₯ π¦=1 )π π₯ π¦=0 )
> 1, guess sea bass
Maximum Likelihood (ML) Decision
ML classification: π¦οΏ½ = arg max π(π₯|π¦)
Binary case: π¦οΏ½ = οΏ½1 π π₯ 1 > π(π₯|0)0 π π₯ 1 β€ π(π₯|0)
For density on right, we get thresholding decision rule in terms of x:
π¦οΏ½ = οΏ½1 if π₯ > π‘0 if π₯ β€ π‘
π‘ = threshold value where π π‘ 1 = π(π‘|0)
ML Classification
7
Select π¦οΏ½ = 1 if π₯ > π‘
Select π¦οΏ½ = 0 if π₯ β€ π‘
Threshold π‘
With likelihoods, it is often easier to work in log domain Consider binary classification: π¦ β {0,1} Define the log likelihood ratio:
πΏ π₯ β lnπ(π₯|π¦ = 1)π(π₯|π¦ = 0)
ML estimation = likelihood ratio test (LRT):
π¦οΏ½ = οΏ½1 if πΏ π₯ > 00 if πΏ π₯ β€ 0
What do we do at boundary? When πΏ π₯ = 0, we can select either class. Flip a coin, select π¦ = 0, select π¦ = 1, β¦ It doesnβt really matter If π₯ is continuous, probability that πΏ π₯ = 0 exactly is zero
Likelihood Ratio
8
πΏ π₯ > 0 πΏ π₯ < 0
πΏ π₯ = 0
Classic Iris dataset used for teaching machine learning Get data π = [π₯1, π₯2, π₯3, π₯4] for 4 features Sepal length, sepal width, petal length, petal width 150 samples total, 50 samples from each class
Class label π¦ β {0,1,2} for versicolor, setosa, virginica Problem: Learn a classifier for the type of Iris (π¦) from data π
Example: Iris Classification
9
To make this example simple, assume for now: We classify using only one feature: π₯ =sepal width (cm) Select between two classes: Versicolor (π¦ = 0) and Setosa (π¦ = 1)
Also, assume we are given two densities: π π₯ π¦ = 0 and π π₯ π¦ = 1 We assume they are conditionally Gaussian: π π₯ π¦ = π = π π₯ ππ ,ππ2 Densities represent the condition density of sepal width given the class We will talk about how we get these densities from data laterβ¦
Example: Decision Theory for Iris Classification
10
Decision theory requires we know π(π|π¦) This is a big assumption! π(π|π¦) is called the population likelihood Describes theoretical distribution of all samples
But, in most real problems: we have only data samples (ππ ,π¦π) Ex: Iris dataset, we have 50 samples / class
To use decision theory, we could estimate a density π(π|π¦ = π) for each π from samples Ex: Could assume π(π|π¦) is Gaussian Estimate mean and variance from samples
Later, we will talk about: How to do density estimation And if density estimation + decision theory is good idea
How do we get π(π₯|π¦)?
11
Histograms for two Iris classes Also plotted is Gaussian with same mean and variance
Consider binary classification: π¦ = 0,1 π π₯ π¦ = π = π π₯ ππ ,π2 , π1 > π0 Two Gaussians with same variance
Likelihood:
π π₯ π¦ = π = 12ππ
exp(β 12π2
π₯ β ππ 2)
πΏ π₯ β ln π π₯ 1π π₯ 0
= β 12π2
π₯ β π1 2 β π₯ β π0 2
With some algebra: πΏ π₯ = (π1βπ0)π2
π₯ β οΏ½Μ οΏ½ , οΏ½Μ οΏ½ = π0+π12
ML estimate: π¦οΏ½ = 1 β πΏ π₯ β₯ 0 β π₯ β₯ οΏ½Μ οΏ½
With some algebra we get: π¦οΏ½ = οΏ½1 if π₯ > οΏ½Μ οΏ½0 if π₯ β€ οΏ½Μ οΏ½ ,
Example Problem: ML for Two Gaussians, Different Means
12
οΏ½Μ οΏ½
Consider binary classification: π¦ = 0,1 π π₯ π¦ = π = π π₯ 0,ππ2 , π0 < π1 Two Gaussians with different variances, zero mean
Log likelihood ratio:
π π₯ π¦ = π = 12πππ
exp(β π₯2
2ππ2)
πΏ π₯ β ln π π₯ 1π π₯ 0
= π₯2
2π02β π₯2
2π12+ 1
2ln π12
π02
ML estimate: π¦οΏ½ = 1 β πΏ π₯ β₯ 0 β π₯ > π‘
Threshold is π‘2 = 1π02β 1
π12
β1ln π12
π02
Example 2: ML for Two Gaussians, Different Variances
13
βπ‘ π‘ π¦οΏ½ = 0
π¦οΏ½ = 1 π¦οΏ½ = 1
Decision Theory Classification, Maximum Likelihood and Log likelihood MAP Estimation, Bayes Risk Probability of errors, ROC
Empirical Risk Minimization Problems with decision theory, empirical risk minimization Probably approximately correct learning
Curse of Dimensionality Parameter Estimation Probabilistic models for supervised and unsupervised learning ML and MAP estimation Examples
Outline
14
What if one item is more likely than the other? Introduce prior probabilities π π¦ = 0 and π(π¦ = 1) Salmon more likely than Sea bass: π π¦ = 0 > π(π¦ = 1)
Bayesβ Rule: π π¦ π₯) = π π₯ π¦)π(π¦)π(π₯)
Interested then in class with highest posterior probability p π¦ π₯) Including prior probabilities:
If π π¦ = 0 π₯) > π π¦ = 1 π₯), guess salmon; otherwise, pick sea bass
We can write π π¦ = 0 π₯) = π π₯ π¦=0 π π¦=0
π(π₯), p π¦ = 1 π₯) =
π π₯ π¦=1 π π¦=1π(π₯)
MAP classification
Including prior probabilities: If π π¦ = 0 π₯) > π π¦ = 1 π₯), guess salmon; otherwise, pick sea bass
Maximum A Posterori (MAP) Estimation: π¦οΏ½MAP = πΌ(π₯) = arg max
π¦π π¦ π₯) = arg max
π¦π π₯ π¦)π(π¦)
Select class with highest posterior probability p π¦ π₯) Binary case: Select π¦οΏ½MAP = 1 if p π¦ = 1 π₯) > π π¦ = 0 π₯
From Bayes
π π¦ = 0 π₯) = π π₯ π¦=0 π π¦=0
π(π₯), p π¦ = 1 π₯) =
π π₯ π¦=1 π π¦=1π(π₯)
Wo we select class 1 if π π₯ π¦=1π π₯ π¦=0
π π¦=1π(π¦=0)
β₯ 1
MAP classification
Consider binary case: π¦ β {0,1}
MAP estimate: Select π¦οΏ½ = 1 β π π₯ π¦=1π π₯ π¦=0
π π¦=1π(π¦=0)
β₯ 1 β π π₯ π¦=1π π₯ π¦=0
β₯ π π¦=0π(π¦=1)
Log domain: select π¦οΏ½ = 1 when:
lnπ π₯ π¦ = 1π π₯ π¦ = 0
β₯ lnπ π¦ = 0π π¦ = 1
β πΏ π₯ β₯ πΎ
πΏ π₯ = ln π π₯ π¦=1π π₯ π¦=0
is the log likelihood ratio
πΎ = ln π π¦=0π π¦=1
is the threshold for the likelihood function
In special case where π π¦ = 1 = π π¦ = 0 = 12
Threshold is πΎ = 0 and MAP estimate becomes identical to ML estimate Note you solve this to get it in terms of threshold for x that we denote t
MAP Estimation via LRT
Consider binary classification: π¦ = 0,1 π π₯ π¦ = π = π π₯ ππ ,π2 ,π1 > π2
ππ = π(π¦ = π)
LLRTis:
πΏ π₯ = ln π π₯ 1π π₯ 0
= (π1βπ0)(π₯βποΏ½)π2
οΏ½Μ οΏ½ = π0+π12
MAP estimate: Let πΎ = ln π0π1
π¦οΏ½ = 1 β πΏ π₯ β₯ πΎ β π₯ β₯ οΏ½Μ οΏ½ + π2πΎπ1βπ0
Threshold is shifted by the prior probability πΎ If π π¦ = 1 > π π¦ = 0 β πΎ < 0 β π‘ is shifted to left β Estimator more likely to select π¦οΏ½ = 1
Example: MAP for Two Gaussians, Different Means
18
π‘ for π π¦ = 1 = 0.5
π‘ for π π¦ = 1 = 0.8
π‘ for π π¦ = 1 = 0.2
Two possible hypotheses for data π»0: Null hypothesis, π¦ = 0 π»1: Alternate hypothesis, π¦ = 1
Model statistically: π π₯ π»π , π = 0,1 Assume some distribution for each hypothesis
Given Likelihood π π₯ π»π , π = 0,1, Prior probabilities ππ = π(π»π)
Compute posterior π(π»π|π₯) How likely is π»π given the data and prior knowledge?
Bayesβ Rule:
π π»π π₯ =π π₯ π»π πππ(π₯)
=π π₯ π»π ππ
π π₯ π»0 π0 + π π₯ π»1 π1
Often more formally written Hypothesis Testing
Probability of error: ππππ = π π»οΏ½ β π» = π π»οΏ½ = 0 π»1 π1 + π π»οΏ½ = 1 π»0 π0
Write with integral: π π»οΏ½ β π» = β« π(π₯)π π»οΏ½ β π» π₯ ππ₯
It can be shown (you won't have to) that error is minimized with MAP estimator π»οΏ½ = 1 β π π»1 π₯ β₯ π(π»0|π₯)
Key takeaway: MAP estimator minimizes the probability of error
MAP: Minimum Probability of Error
What does it cost for a mistake? Plane with a missile, not a big bird? Define loss or cost:
πΏ πΌ π₯ , π¦ : cost of decision πΌ π₯ when state is π¦
also often denoted πΆππ
Making it more interesting, full on Bayes
Y = 0 Y = 1
πΌ π₯ = 0 Correct, cost L(0,0) Incorrect, cost L(0,1)
πΌ π₯ = 1 incorrect, cost L(1,0) Correct, cost L(1,1)
Classic: Pascal's wager
So now we have: the likelihood functions p(x|y) priors p(y) decision rule πΌ π₯ loss function πΏ πΌ π₯ ,π¦ : Risk is expected loss:
πΈ πΏ = πΏ 0,0) π(πΌ π₯ = 0,π¦ = 0
+ πΏ 0,1) π(πΌ π₯ = 0,π¦ = 1 + πΏ 1,0) π(πΌ π₯ = 1,π¦ = 0 + πΏ 1,1) π(πΌ π₯ = 1,π¦ = 1 Without loss of generality, zero cost for correct decisions
πΈ πΏ = πΏ 1,0) π πΌ π₯ = 1 π¦ = 0 π π¦ = 0
+ πΏ 0,1) π πΌ π₯ = 0 π¦ = 1 π(π¦ = 1) Bayes Decision Theory says βpick decision rule πΌ π₯ to minimize riskβ
Risk Minimization
As before, express risk as integration over π₯:
π = οΏ½οΏ½ πΆππππ
π π¦ = π π₯ 1{π¦οΏ½ π₯ =π} π π₯ ππ₯
To minimize, select π¦οΏ½ = 1 when πΆ10π π¦ = 0 π₯ + πΆ11π π¦ = 1 π₯ β€ πΆ00π π¦ = 0 π₯ + πΆ01π π¦ = 1 π₯ π(π¦ = 0|π₯) π π¦ = 1 π₯ β₯β (πΆ10 β πΆ00) (πΆ11 β πΆ01)β
By Bayes Theorem, equivalent to an LRT with π(π₯|π¦ = 1)π(π₯|π¦ = 0)
β₯πΆ10 β πΆ00 π0πΆ11 β πΆ01 π1
Bayes Risk Minimization
Decision Theory Classification, Maximum Likelihood and Log likelihood MAP Estimation, Bayes Risk Probability of errors, ROC
Empirical Risk Minimization Problems with decision theory, empirical risk minimization Probably approximately correct learning
Curse of Dimensionality Parameter Estimation Probabilistic models for supervised and unsupervised learning ML and MAP estimation Examples
Outline
26
How do we compute errors?
Suppose that decision rule is of the form: π¦οΏ½ = οΏ½1 if π π₯ > π‘0 if π π₯ β€ π‘
π π₯ is called the discriminator π‘ is the threshold
Ex: Decision rule for scalar Gaussians
π¦οΏ½ = οΏ½1 if π₯ > π‘0 if π₯ β€ π‘
Uses a linear discriminator π π₯ = π₯ Threshold π‘ will depend on estimator type
ML, MAP, Bayes risk, ..
We will compute the error as a function of π‘
Computing Error Probabilities
27
π¦οΏ½ = 1 π¦οΏ½ = 0 π‘
Consider binary case: π¦ β {0,1} Two possible errors: Type I error (False alarm or False Positive): Decide π¦οΏ½ = 1 when π¦ = 0 Type II error (Missed detection or False Negative): Decide π¦οΏ½ = 0 when π¦ = 1
The effect of the errors may be very different
Example: Disease diagnosis: π¦ = 1 patient has disease, π¦ = 0 patient is healthy Type I error: You say patient is sick when patient is healthy
Error can cause extra unnecessary tests, stress to patient, etc⦠Type II error: You say patient is fine when patient is sick
Error can miss the disease, disease could progress, β¦
Types of Errors
Type I error (False alarm or False Positive): Decide H1 when H0 Type II error (Missed detection or False Negative): Decide H0 when H1 Trade off Can work out error probabilities from conditional probabilities
Visualizing Errors
Scalar Gaussian: For π = 0,1: π π₯ π¦ = π = π π₯ ππ ,π2 ,π1 > π0
False alarm: ππΉπΉ = π π¦οΏ½ = 1 π¦ = 0 = π(π₯ β₯ π‘|π¦ = 0)
This is the area under curve, ππΉπΉ = β« π(π₯|π¦ = 0)βπ‘ ππ₯
But, we can compute this using Gaussians Given π¦ = 0, π₯~π(π0,π2)
Therefore: ππΉπΉ = π π₯ β₯ π‘ π¦ = 0 = π π‘βπ0π
Scalar Gaussian Example: False Alarm
30
π¦οΏ½ = 1 π¦οΏ½ = 0
π‘
Scalar Gaussian: For π = 0,1: π π₯ π¦ = π = π π₯ ππ ,π2 ,π1 > π0
Missed detection can be computed similarly πππ = π π¦οΏ½ = 0 π¦ = 1 = π(π₯ β€ π‘|π¦ = 1) This is the area under curve But, we can compute this using Gaussians Given π¦ = 1, π₯~π(π1,π2)
Therefore: ππΉπΉ = π π₯ β€ π‘ π¦ = 1 = 1 β π π‘βπ1π
Scalar Gaussian Example: Missed Detection
31
π¦οΏ½ = 1 π¦οΏ½ = 0
π‘
Problem: Suppose π~π(π,π2). Often must compute probabilities like π(π β₯ π‘) No closed-form expression.
Define Marcum Q-function: π π§ = π π β₯ π§ , π~π(0,1)
Let π = (π β π) πβ
Then
π π β₯ π‘ = π π β₯π‘ β ππ
= ππ‘ β ππ
Review: Gaussian Q-Function
32
We see that there is a tradeoff: Increasing threshold π‘ β Decreases ππΉπΉ But, increasing threshold π‘ β Increases πππ
What threshold value we select depends on their relative costs What is the effect of a FA vs. MD Consider medical diagnosis case
FA vs. MD Tradeoff
33
π‘ ππΉπΉ πππ
Receiver Operating Characteristic (ROC) curve For each threshold level π‘ compute ππ π‘ = 1 β πππ(π‘) and ππΉπΉ π‘ Plot ππ π‘ vs. ππΉπΉ π‘ Shows the how large the detection probability can be for a given ππΉπΉ Name βROCβ comes from communications receivers where these were first used
Comparing ROC curves Higher curve is better Random guessing gets red line: Guess π¦οΏ½ = 1 with probability π‘ So, any decent estimator should be above the red line
ROC Curve
34
Random guess
Threshold estimator
Often have multiple classes. π¦ β 1, β¦ ,πΎ Most methods easily extend: ML: Take max of πΎ likelihoods:
π¦οΏ½ = arg maxπ=1,β¦,πΎ
π(π₯|π¦ = π)
MAP: Take max of πΎ posteriors:
π¦οΏ½ = arg maxπ=1,β¦,πΎ
π(π¦ = π|π₯) = arg maxπ=1,β¦,πΎ
π π₯ π¦ = π π(π¦ = π)
LRT: Take max of πΎ weighted likelihoods: π¦οΏ½ = arg max
π=1,β¦,πΎπ(π₯|π¦ = π) πΎπ
Multiple Classes
35
Decision Theory Classification, Maximum Likelihood and Log likelihood MAP Estimation, Bayes Risk Probability of errors, ROC
Empirical Risk Minimization Problems with decision theory, empirical risk minimization Probably approximately correct learning
Curse of Dimensionality Parameter Estimation Probabilistic models for supervised and unsupervised learning ML and MAP estimation Examples
Outline
37
Bayesian formulation for classification: Requires we know π(π₯|π¦) But, we only have samples π₯π ,π¦π , π = 1, β¦ ,π, from this density What do we do?
Approach 1: Probabilistic approach Learn distributions π(π₯|π¦) from data π₯π ,π¦π Then apply Bayesian decision theory using estimated densities
Approach 2: Decision rule Use hypothesis testing to select a form for the classifier Learn parameters of the classifier directly from data
Two Approaches
Given data π₯π , π¦π , π = 1, β¦ ,π Probabilistic approach: Assume π₯π~π(π0,π2) when π¦π = 0; π₯π~π(π1,π2) when π¦π = 1 Learn sample means for two classes: οΏ½ΜοΏ½π = mean of samples π₯π in class π From decision theory, we have the decision rule:
π¦οΏ½ = πΌ π₯, π‘ = οΏ½1 π₯ > π‘ 0 π₯ < π‘ , π‘ =
οΏ½ΜοΏ½0 + οΏ½ΜοΏ½12
Empirical Risk minimization For each threshold π‘, we get decisions on the training data: π¦οΏ½π = πΌ π₯π , π‘
Look at empirical risk, e.g. training error πΏ π‘ β 1π
#{π¦οΏ½π β π¦π}
Select π‘ to minimize empirical risk οΏ½ΜοΏ½ = arg minπ‘πΏ(π‘)
Example with Scalar Data and Linear Discriminator
39
Suppose data is as shown We estimate class means: οΏ½ΜοΏ½0 β β2, οΏ½ΜοΏ½1 β 1 Decision rule from probabilistic approach
π¦οΏ½ = οΏ½1 π₯ > π‘ 0 π₯ < π‘ , π‘ = ποΏ½0+ποΏ½1
2β β0.5
Threshold misclassifies many points Empirical risk minimization Select π‘ to minimize classification errors on training data Will get π‘ β 0.5 βLeads to better rule
Why probabilistic approach failed? We assumed both distributions were Gaussian But, π(π₯|π¦ = 0) is not Gaussian. It is bimodal ERM does not require such assumptions
Why ERM may be Better
40
Threshold from probabilistic approach Does not separate classes
Threshold from ERM Separates classes well
Decision rule approach :
Assume a rule: π¦οΏ½ = πΌ π₯ = οΏ½1 π₯ > π‘ 0 π₯ < π‘
Rule has an unknown parameter π‘
Find π‘ to minimize empirical risk π emp πΌ,ππ β 1πβ 1(π¦π β πΌ π₯π )π
Minimizes error on training data
Motivation for decision rule approach over probabilistic approach Why bother learning probabilities densities if your final goal is a decision rule Assumptions on probability densities may be incorrect (see next slide) Concentrate your efforts by dealing with data that is hard to classify
Example of Decision Rule Approach
41
Needs to assume specific form of densities Ex: Suppose we assume Gaussian densities Gaussians are not robust Outlier values can make large changes in
mean and variance estimates
Risk minimization alternative: Search over planes that separates classes Only pay attention to data near boundary Good in case of limited data
Dangers of Using Probabilistic Approach
Decision Theory Classification, Maximum Likelihood and Log likelihood MAP Estimation, Bayes Risk Probability of errors, ROC
Empirical Risk Minimization Problems with decision theory, empirical risk minimization Probably approximately correct learning
Curse of Dimensionality Parameter Estimation Probabilistic models for supervised and unsupervised learning ML and MAP estimation Examples
Outline
43
Examples of Bayes Decision theory can be misleading Examples are in low dimensional spaces, 1 or 2 dim Most machine learning problems today have high dimension Often our geometric intuition in high-dimensions is wrong
Example: Consider volume of sphere of radius π = 1 in π· dimensions What is the fraction of volume in a thin shell of a sphere between 1 β π β€ π β€ 1 ?
Intuition in High-Dimensions
44
π
π = 1
Let ππ π = volume of sphere of radius π, dimension π· From geometry: ππ π = πΎπππ
Let ππ(π) = fraction of volume in a shell of thickness π
ππ π =ππ 1 β ππ 1 β π
ππ 1
=πΎπ β πΎπ 1 β π π
πΎπ= 1 β 1 β π π
For any π, we see as ππ π β 1 as π· β β All volume concentrates in a thin shell This is very different than in low dimensions
Example: Sphere Hardening
ππ(π)
π
1
π· = 1
5
10
π· = 200
Consider a Gaussian i.i.d. vector π₯ = π₯1, β¦ , π₯π , π₯π~π(0,1)
As π· β β, probability density concentrates on shell π₯ β π·2 , even though π₯ = 0 is most likely point
Let π = π₯12 + π₯22 +β―+ π₯π2 1/2 π· = 1: π π = π πβπ2/2
π· = 2: π π = π π πβπ2/2
general π·: π π = π ππβ1 πβπ2/2
Gaussian Sphere Hardening
Conclusions: As dimension increases, All volume of a sphere concentrates at its surface!
Similar example: Consider a Gaussian i.i.d. vector π₯ = π₯1, β¦ , π₯π , π₯π~π(0,1) As π β β, probability density concentrates on shell
π₯ 2 β π Even though π₯ = 0 is most likely point
Example: Sphere Hardening
In high dimensions, classifiers need large number of parameters Example: Suppose π₯ = π₯1, β¦ , π₯π , each π₯π takes on πΏ values Hence π₯ takes on πΏπ values
Consider general classifier π(π₯) Assigns each π₯ some value If there are no restrictions on π(π₯), needs πΏπ paramters
Computational Issues
Curse of dimensionality: As dimension increases Number parameters for functions grows exponentially
Most operations become computationally intractable Fitting the function, optimizing, storage
What ML is doing today Finding tractable approximate approaches for high-dimensions
Curse of Dimensionality