Download - Bayesian Linear Regression - unibas.chinformatik.unibas.ch/fileadmin/Lectures/HS2016/pattern-recognition/... · Figs: Bishop PRML, 2006 ... • Bayesian linear regression is a nice

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL

BayesianLinearRegressionPatternRecognition2016

SandroSchönborn

UniversityofBasel


Outline

• RegressionProblem• Continuouslabel:noclasses• AccessibleBayesianexample

• LeastSquaresRegression

• BayesianRegression:weightedaverageofallmodels

• Uncertainty:Bayesianinferenceandsubjectiveprobability

• Outlook:KernelridgeregressionandGaussianProcesses

2


Motivation:Regression

Notalldatainferenceproblemsareaboutclassification.Sometimes,weneedtopredictacontinuousvalue(e.g.Theprice ofafishinsteadofitsclass)

• Machinelearningproblemnowwithcontinuouslabels:Regression

Wedidwellwithprobabilisticmethods.Theydelivergoodandvaluableresults.Thediscriminativeapproachissimpler.

• Regressionasadiscriminative,probabilisticmethod

Morethanonesolutionisgood.Wewanttoaverageoverallpossibleresultsandnotselectonlythesinglebestone.

• RegressionisatractableexampleofaBayesianmethod

3


Regression

4

0 10 20 30

05

1015

Regression

x1

x 2

0 10 20 30

05

1015

Classification

x1

x 2


Regression:FormalSetup

• Data:𝒙 ∈ ℝ$,fornow:standardvectorspacedata,featurevector

• Labels:𝑦 ∈ ℝ,labelsarecontinuous

• Trainingdata:𝐷 = 𝒙(, 𝑦( (*+, knownlabelsforourtrainingdata

• Goal:Regressionontestdata• Predictagoodlabelforagivendatum𝒙

𝑦- = 𝑓(𝒙)• Machinelearningproblem:findfunction𝑓 topredictthelabel

• Learning/estimationon(limited)training data• Predictionqualitywithrespectto(unknown)test data

5


LinearRegression

• Standard method:linearleastsquaresfittodata• Knownin1dfromschool:“Ausgleichsgerade”• Knowninnd frombasiclectures

• Linearmodelforthelabelvariable𝑦:

𝑦 = 𝒘2𝒙

• Training/Learningwithadataset𝐷 = 𝑥(, 𝑦( (*+,

Howtofind𝒘,𝑤5?Howtomeasurelabel/predictionerror?

6

Weuseanoldtricktokeepitsimple:

𝒘 ≔ 𝑤5𝒘 , 𝒙 ≔ 1

𝒙


LeastSquaresSolution

Thelinearmodelshouldfitthetrainingdataoptimally.Theeasiestlossfunctiontominimizeisthesquarederror:

Training:Find𝒘,𝑤5 suchthatthesumofthesquaredreconstructionerrorsofthetrainingsetisminimal:

𝒘,𝑤5 = argmin𝒘,>?

@ 𝑦( − 𝒘2𝒙( B�

(

Well-knownsolution:𝒘 = 𝑿𝑿2 E+𝑿𝒚

7

𝐿 𝑦, 𝒙, 𝑓 = 𝑦 − 𝑓 𝒙 B


ProbabilisticSetup

Inourprobabilisticsetup,wehaveadistributionofpredictions,givenadatapoint:

• SimilartoposteriorclassprobabilitywithBayes:Butthelabelisnowcontinuous– therearemorethantwovalues!

• ThebestsinglepredictiontomakedependsonourriskfunctionVeryoftenthisistheexpectedvalue(e.g.squaredlossrisk)

• Directposteriormodel– discriminativemethod

8

𝑃 𝑦 𝒙

𝑦- = 𝐸[𝑦|𝒙]


ProbabilisticSetup

Weuseasimpleposteriormodelforthelabelgivendata:𝑃 𝑦 𝒙

• Eachobservationisaffectedbyanoisevalueε ∼ 𝑁 𝜀 0, 𝜎B

• Thesingle bestpredictionof𝑦 isstandardlinearregression

9

𝑃 𝑦 𝒙;𝒘 = 𝒩 𝑦 𝒘2𝒙, 𝜎B

𝑦 = 𝒘2𝒙 + ε

𝑦- = 𝐸 𝑦 = 𝒘2𝒙


MaximumLikelihood:Regression

• Thediscriminativeprobabilisticmodelcanbetrainedbymaximum-likelihood estimation

• TheresultisidenticaltotheknownleastsquaressolutionLeastsquaresusuallycorrespondstoGaussiannoiseassumptions

• Again:maximizeposterior ofdata(discriminativelikelihood)

𝒘,𝑤5 = argmax𝒘,>?

𝑃 𝑌 𝑿,𝒘

𝑃 𝑌 𝑿 =YYZ[𝑃 𝑦( 𝒙(

�

(

=[𝒩 𝑦( 𝒘2𝒙(, 𝜎B�

(

10


MaximumLikelihood:Regression

log 𝑃 𝑌 𝑿 =@−12𝜎B 𝑦( − 𝒘2𝒙( B −

12 log 2𝜋 − log 𝜎

�

(𝜕𝜕𝒘𝑃 𝑌 𝑿 =@

1𝜎B 𝑦( − 𝒘2𝒙( 𝒙(2

�

(

=! 0

@ 𝑦( − 𝒘2𝒙( 𝒙(2�

(

= 0

𝒘bc = @𝒙(𝒙(2�

(

E+

@𝑦(𝒙(

�

(11


DataMatrixNotation

Usingmatrixnotationtheresultbecomesmoreaccessible:

@𝒙(𝒙(2�

(

= 𝑿𝑿2

@𝑦(𝒙(

�

(

= 𝑿𝒚

𝑿 = 𝒙+, 𝒙B, … , 𝒙,

𝒚 =

𝑦+𝑦B⋮𝑦,

𝒘bc = 𝑿𝑿2 E+𝑿𝒚

Standardleastsquaressolution!Pseudo-Inverseofmatrix𝑿

12


Shortcoming1:Outliers

• Outliersaffectresults1. Leastsquares:Outliersaffectthesquaredlossmassively2. Probabilistic:Gaussianhasverylowprobabilityforlargedeviations

13

Realproblem:Illuminationestimation

Toodark:sunglasses Robustestimation

Leastsquaressolutionstendtoequalize allerrors


Shortcoming:Overfitting

• Toomanyparametersleadtoundecidablemodelsormodelswhichcanexplainthedataperfectly (overfitting)

• Ingeneral,wehavemultiplesolutionswhichfitthedata

Modeltoosimple Modelfitsdata Overfittingtoocomplex 14

Illustrationwithfittingpolynomialsofdegree𝑀 (non-linearbasisfunctions)

Figs:BishopPRML,2006


Regularization

Asasolution,weintroducepriorassumptionsaboutthesolution𝒘Actually,wemakeourpriorassumptionsexplicit – youalwayshavethem

Wewanttoprefersmall𝒘: Itshouldshowatendencytowardslowerinfluenceofafeaturewhennotenoughdataisavailable

Desiredregularization

15Figs:BishopPRML,2006


RegularizedRegression:MAP

Naturalwayofdealingwithpriorsinprobabilisticview:Maximum-a-Posteriori(MAP) estimate

𝒘ghi = argmax𝒘

𝑃 𝑌 𝑿,𝒘 𝑃 𝒘

𝑃(𝒘) = 𝒩 𝒘|0, 𝜎>B𝑰TheGaussianpriorisaverycommonchoice:Weprefersolutionswithasmallmagnitude.


[𝒩 𝑦( 𝒘2𝒙(, 𝜎B�

(

𝒩 𝒘|0, 𝜎>B𝑰

Thiswillleadtoregularizedleastsquares

16


MAPEstimatelog 𝑃 𝑌 𝑿,𝒘 𝑃 𝒘 =

@−12𝜎B 𝑦( − 𝒘2𝒙( B −

12 log 2𝜋𝜎

B −12𝜎>B

𝒘 B −𝑑2 log 2𝜋𝜎>

B�

(

𝜕𝜕𝒘 log𝑃 𝑌 𝑿,𝒘 𝑃 𝒘 =

1𝜎B@(𝑦(−𝒘2𝒙()𝒙(2

�

(

−1𝜎>B

𝒘2 =! 0

𝒘ghi = @𝒙(𝒙(2�

(

+𝜎B

𝜎>B𝑰

E+

@𝑦(𝒙(2�

(

𝒘ghi = 𝑿𝑿2 + 𝜆𝑰 E+𝑿𝒚 𝜆 ≔𝜎B

𝜎>B

17Specialname: Ridgeregression


RidgeRegression

• Parameter𝜆 needstobeadaptedtoproblemTypicallythroughcross-validation:optimizationontest/validationdataRarelythrough“real”priorknowledge

18

Desiredregularization ToostrongTooweak



BayesianLinearRegression

Westillonlyselectasinglesolution.Aprobablybetteralternativewouldbetoconsideralloftheminaproperwayofaveraging.

• ComparetologisticregressionwithmanydecisionplanesDiscussedaveragingonlyconceptually– howtoactuallydoit?

• Conceptframework:BayesianinferenceDefinestheproperwayofaveraging–marginalization

• BayesianlinearregressionisaniceapplicationexamplewhichisstillfullytractableandillustratestheconceptverywellBayesianmethodstendtobecomeintractableformorecomplexmodels

19


BayesianInferenceforRegression

Classification:Averagemanypossibledecisionplanes

Regression:Averagemanypossibleregressionlines



ProbabilisticSetup

• TheMAPestimatecanbeeasilyextendedtofullBayesiantreatment.Insteadoftakingthemaximumonly,weusethewholedistributionof𝒘


𝑃 𝒚 𝑿,𝒘 𝑃 𝒘

𝑃 𝒘|𝑿, 𝒚 ∝ 𝑃 𝒚 𝑿,𝒘 𝑃 𝒘

𝑃 𝒘|𝑿, 𝒚 =𝑃 𝒚 𝑿,𝒘 𝑃 𝒘

∫ 𝑃 𝒚 𝑿,𝒘 𝑃 𝒘 d𝒘

Thisisadistributionofvaluesof𝒘.Thisinterpretationmakes𝒘arandomvariable!

21


PosterioroftheParameter

• Calculationofposteriorofourparameter𝒘:𝑃 𝒘|𝑿, 𝑌• ApplicationofBayesrule:

𝑃 𝒘|𝑿, 𝑌 =𝑃 𝑌 𝑿,𝒘 𝑃 𝒘

∫ 𝑃 𝑌 𝑿,𝒘 𝑃 𝒘 d𝒘

Thenormalizationmeasureshowlikelythedatasetisonaverage,consideringallvaluesof𝒘:marginallikelihood𝑃 𝑌 𝑿

Theprior 𝑃 𝒘 expressesourassumptionsweholdabout𝒘beforeseeingdata

𝑃 𝑌 𝑿,𝒘 measuresthelikelihood ofthedatasetforansingle valueof𝒘

Theposterior 𝑃 𝒘|𝑿, 𝑌 expressesourcertaintywehaveaboutaspecificvalueof𝒘 – consideringdataand priorassumptions

22


PosterioroftheParameterWenowhavetheposteriordistributioninsteadofasinglebestvalue.Itcontainsourknowledgeaboutthecompatibilityofallpossiblesolutionswithourdataandassumptions.

• Whatisitgoodfor?Itexpressesourcertaintyaboutallpossiblesolutions– “Rating” foreachsolutionSinglemaximum?Peaked?Broad?– ValuableinformationSystemintegration:Down-streammethodscanaccountforregressionuncertainty

• Whattodowithit?WecanuseallthisinformationtomakemoreinformedpredictionsAnanalysis(e.g.riskfactors)of𝒘 hasmoreinformationavailable

23

𝑃 𝒘 𝑃 𝒘|𝐷


BayesianInference

𝑃 𝒘 𝐷 =1𝑍 𝑃 𝑌 𝑿,𝒘 𝑃 𝒘

=1𝑍[𝒩 𝑦( 𝒘2𝒙(, 𝜎B

�

(

𝒩 𝒘|0, 𝜎>B𝑰

=1𝑍′ exp −

12𝜎B 𝒘2𝑿 − 𝒚 B −

12𝜎>B

𝒘 B

𝑃 𝒘 𝐷 = 𝒩 𝒘|𝝁, 𝚺 𝝁 =1𝜎B 𝚺𝑿𝒚

𝚺E+ =1𝜎B 𝑿𝑿

2 +1𝜎>B

𝑰TheposteriorisagainaGaussian!

𝝁 = 𝒘ghi

Trainingdata𝐷 = 𝑿, 𝒚

24Bishop,PRML,section3.3.1,p.152– 156(eq 3.49– 3.54),Springer2006


PosteriorofLinearRegression

Nodata

N=1

N=2

N=19

𝑃 𝒘 𝐷 = 𝒩 𝒘|𝝁, 𝚺

𝝁 =1𝜎B 𝚺𝑿𝒚


2 +1𝜎>B

𝑰



PredictiveDistribution

Howtopredictalabelforanewdatapoint?Wenowhaveverymanysolutionsandknowhowwelleachonefitsourtrainingdataandourpriorassumptions.

• Predictionisprobabilistic(posteriorforprediction/classification)• Thepredictionshouldincludeallourknowledgeaboutpossiblesolutions(should“average”overparametervalues):𝑃 𝑦 𝒙, 𝐷

• Weonlyhaveapredictionforasinglevalueof𝒘:𝑃 𝑦 𝒙,𝒘

• Averagingshouldrespectdifferentqualityof𝒘:𝑃(𝒘|𝐷)Badsolutionsshouldnotcontributewhilewewanttofocusongoodones

26


PredictiveDistribution(II)

Example:polynomialfit(basisfunctions)• Blue:datapoints

• Greenline:generatingprocess/groundtruth

• Redline:bestfittobluedatapoints

• Shadedred:regionofprobableprediction

27

Tellsusabouttheoutcome’scertainty!



PredictiveDistribution:Calculation

Theaveragingmethodiscalledmarginalization

𝑃 𝑦 𝒙, 𝐷 = ∫ 𝑃 𝑦 𝒙,𝒘 𝑃 𝒘 𝐷 d𝒘PredictiveDistribution

𝑃 𝑦 𝒙, 𝐷 = v𝒩 𝑦 𝒘2𝒙, 𝜎B 𝒩 𝒘|𝝁, 𝚺 d𝒘�

�

𝑃 𝑦 𝒙, 𝐷 = 𝒩 𝑦 𝝁2𝒙, 𝜎B + 𝒙2𝚺𝒙𝝁 =

1𝜎B 𝚺𝑿𝒚


2 +1𝜎>B

𝑰

Expected/Bestpredictionstilllinear

28Bishop,PRML,section3.3.2,p.156(eq 3.57– 3.59),Springer2006


PredictiveDistribution:Result

• Predictionmeanislinear𝝁2𝒙• Predictionvarianceisaquadraticfunction𝜎B + 𝒙2𝚺𝒙

Thepredictionnowincludesaqualityestimate togetherwiththeactualprediction!• Thequalityishigherwherewehavemoredata• Thecertaintyneverincreasesbeyondourobservations’uncertainty𝜎

29


Uncertainty

• Wecalculatedmanyprobabilities.Howaretheytobeinterpreted?Theyaresometimescontradictory:Whydoesthedistributionchangewhenwehavemoredata?Shouldn’ttherebeareal distributionof𝑃 𝒘 ?

• Bayesianinferencereliesonasubjectiveperspective:Probabilityisusedtoexpressourcurrentknowledge.Itcanchangewhenwelearnorseemore:Withmoredata,wearemorecertainaboutourresult.

• Notsubjectiveinthesensethatitisarbitrary!Therearequantitativerulestofollowmathematically

• Probabilityexpressesanobserverscertainty,oftencalledbelief

30

Subjectivity:Thereisnosingle,realunderlyingdistribution.Aprobabilitydistributionexpressesourknowledge– Itisdifferentindifferentsituationsandfordifferentobserverssincetheyhavedifferentknowledge.


BayesianInference

Bayesianinferenceisthemathematicaltooltocalculatechangesincertaintywhentheunderlyingknowledgechangesthroughobservations:beliefdynamics,beliefupdate

EvolutionofbeliefsbyconditioningondataaccordingtoBayesrule:

𝑃 𝑥 → 𝑃 𝑥 𝐷 𝑃 𝑥 𝐷 =𝑃 𝐷 𝑥 𝑃(𝑥)

𝑃(𝐷)

𝑃 𝑥 → 𝑃 𝑥 𝐷+ → 𝑃 𝑥 𝐷B → ⋯

Conditioningisdonewithalikelihoodmodel:Howcandatabeexplained?

31


KernelRegression

Non-linearextensionispossibleandpowerful:

• Kernelsandsample-spaceexpansion:KernelregressionExpansion&kerneltrick

𝑤2𝑥 =@𝛼(𝒙(2𝒙�

(

𝑦 =@𝛼(𝑘 𝒙(2𝒙�

(

• Withregularization(MAPwithGaussian):KernelRidgeRegressionLeastsquaressolution:

𝜶∗ = 𝑲 + 𝜆𝑰 E+𝒚

32

𝑲(~ = 𝑘 𝒙(, 𝒙~


KernelRidgeRegression

• Non-linearfittingwithproperkernel

33

scikit-learn.org,JanHendrikMetzen

KRR:KernelRidgeRegressionGPR:GaussianProcessRegression


GaussianProcess

• 𝑃 𝒘 actuallydescribesadistributionoffunctions:Everysinglevalueof𝒘 definesalinearfunction:

• GaussianProcess:GaussiandistributionoverfunctionsDirectlymodelsthedistributionofourregressionfunctions𝑓Gaussian:Theyhaveamean 𝜇 𝒙 andacovariance 𝑘 𝒙, 𝒙�

• Covariancefunctionsareessentiallythesamethingasakernel:Theyspecifyasimilaritymeasurebetweenpoints𝑥(“Similar”→ highcorrelation)

34

𝑓 𝒙 ∼ 𝐺𝑃 𝜇 𝒙 , 𝑘 𝒙, 𝒙�

𝑓𝒘 𝒙 = 𝒘2𝒙 𝑃(𝒘) → 𝑃 𝑓𝒘


GaussianProcess(II)

• FullyBayesiantreatmentofmanyproblemsduetotheGaussianstructure:Closed-formsolutionsavailableUsuallyneedsGaussianlikelihoodsaswell

• Duetotheuseofcovariancefunctions(~kernels)thisisalsotrueforverycomplexnon-linearmodels

• Verypowerfulframework:• Non-linearBayesianregressionformachinelearning• E.g.Fullshapemodelswhichcombinestatisticalapproaches(“PCA”)withmoregeneralassumptions,like”smoothness”

35


GaussianProcess:Shapes

36

Lüthi,Marcel,etal."GaussianProcessMorphable Models."arXiv preprintarXiv:1603.07254 (2016).

http://shapemodelling.cs.unibas.ch/https://www.futurelearn.com/courses/statistical-shape-modelling


BayesianModelSelection

• Bayesianmethodsaverageoverwholemodelclasses𝑀e.g.averageoverall𝒘 valuesforagivenpolynomialdegree𝑀

• Themarginallikelihoodcapturestheaveragefitofamodel𝑀 togivendata𝐷,evidence foragivenmodelclass:

• Differentmodelscanbecomparedwithrespecttotheirmarginallikelihoods:Findmodelswhichfitadatasetwellonaverage

• Modelselection:selectbestone• Model“averaging”:predictwithweightedaverageoverallmodels

37

𝑃 𝐷 𝑀 = ∫ 𝑃 𝐷 𝒘,𝑀 𝑃 𝒘 𝑀 d𝒘


BayesianModelSelection(II)

• Themarginallikelihoodisanormalized distribution:Itmeasuresthedegreeoffittodatavs.complexity ofthemodelNormalizationhasanaturalregularization effect:Incomplexmodels,manydatasetsareverylikelybecausethereisasuitableparameterwhichexplainsthedatawell,e.g.highdegreepolynomials.Butiftherearemany datasetswithahighlikelihood,thelikelihoodforanindividual datasetisratherlowbecausethemarginallikelihoodisnormalized.

38

𝑀+:simplemodel𝑀B:intermediatemodel𝑀�:complexmodel

“Area“isalways1



BayesianModelSelection(III)

• Evidenceforthepolynomialexample:

39

Mod

elevidence(lo

g)



Summary:Regression

• Regression:Machinelearningwithcontinuouslabel

• Leastsquaresregression:MLestimationCorrespondstoGaussianobservationerror

• Regularization:MAPestimation• Reducesoverfitting• Regularizedleastsquares:Ridgeregression

• BayesianRegression• Posteriordistributionofregressionmodels:𝑃 𝒘 𝐷• Averageallmodelsforaprediction:predictivedistribution 𝑃 𝑦 𝒙, 𝐷• UncertaintytreatmentwithBayesianinference:beliefupdates

40


Summary:ProbabilisticMethods(I)

BayesClassifier

NaïveBayes

LogisticRegression

BayesianRegressionLeastSquaresRegression

SVM,DecisionTree,ANN,Perceptron

Classification

Regressio

n

ProbabilisticMethods

DiscriminativeMethods

GaussianProcessGaussianProcess

bold:partofthisblockitalic:notinthislecture 41


Summary:ProbabilisticMethods(II)

• Build&constructmodel,accordingtoideaandconcept𝑃 𝑦 𝒙,𝑤

• Estimateparameters:MaximumLikelihood• Theidiomaticprobabilisticwayoflearning

• Realizeshortcomingsofresultduetoalackofdata

• Estimationwithpriorknowledge:MAPestimation• MAPincludesourknowledgeabouttheproblemintoestimation

• FullBayesiantreatment• Expresscertaintybyconsideringallpossiblesolutions• Weightedaveraging:weightisdegreeoffitwithtrainingdata

42


Summary:ProbabilisticMethods(III)Method NaïveBayes,Bag-of-words LogisticRegression LinearRegression

Model 𝑃 𝑦 w ∝[𝑃 w 𝑦�

�

𝑃 𝑦

𝑃 𝑤 𝑦 = ℎ>,�

𝑃 𝑦 𝒙 =1

1 + exp −𝒘2𝒙 𝑃 𝑦 𝒙 = 𝒩 𝑦|𝒘2𝒙, 𝜎B

ML estimateℎ> =

𝑁�∑ 𝑁>�>

@ 𝑦( −𝜎 𝒘2𝒙( +𝑤5 𝒙(2�

(=! 0

IterativeReweighted LeastSquares𝒘g� = 𝑿𝑿2 E+𝑿𝒚

ShortcomingofML

Unseenwords:zerocounts

Separabledata:infinitecertainty

Underdetermined solution&overfitting

Priorknowledge Pseudocount:eachwordalready seenonce

Small𝒘: minimalinfluenceofafeature:𝑃 𝒘 = 𝒩 𝒘|0, 𝜎B𝑰

Small𝒘: minimalinfluenceofafeature

𝑃 𝒘 = 𝒩 𝒘|0, 𝜎>B𝑰MAPestimate ℎ> =

𝑁� + 1∑ (𝑁>+1)�>

@ 𝑦( − 𝜎 𝒘2𝒙( + 𝑤5 𝒙(2�

(

−1𝜎B 𝒘

2 =! 0

IterativeReweighted LeastSquares𝒘ghi = 𝑿𝑿2 + 𝜆𝑰 E+𝑿𝒚

Bayes𝑃 𝒘|𝐷𝑃 𝑦 𝒙, 𝐷

(didnotdiscuss)LatentDirichlet Allocation

Averageclassifiersresultingfromallpossibleclassificationhyperplanes

Averageoverall𝒘,weightedbytheirperformanceofexplainingourtrainingdata:Gaussianwithlinearmeanandquadraticcovariance