> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
BayesianLinearRegressionPatternRecognition2016
SandroSchönborn
UniversityofBasel
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
Outline
• RegressionProblem• Continuouslabel:noclasses• AccessibleBayesianexample
• LeastSquaresRegression
• BayesianRegression:weightedaverageofallmodels
• Uncertainty:Bayesianinferenceandsubjectiveprobability
• Outlook:KernelridgeregressionandGaussianProcesses
2
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
Motivation:Regression
Notalldatainferenceproblemsareaboutclassification.Sometimes,weneedtopredictacontinuousvalue(e.g.Theprice ofafishinsteadofitsclass)
• Machinelearningproblemnowwithcontinuouslabels:Regression
Wedidwellwithprobabilisticmethods.Theydelivergoodandvaluableresults.Thediscriminativeapproachissimpler.
• Regressionasadiscriminative,probabilisticmethod
Morethanonesolutionisgood.Wewanttoaverageoverallpossibleresultsandnotselectonlythesinglebestone.
• RegressionisatractableexampleofaBayesianmethod
3
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
Regression
4
0 10 20 30
05
1015
Regression
x1
x 2
0 10 20 30
05
1015
Classification
x1
x 2
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
Regression:FormalSetup
• Data:𝒙 ∈ ℝ$,fornow:standardvectorspacedata,featurevector
• Labels:𝑦 ∈ ℝ,labelsarecontinuous
• Trainingdata:𝐷 = 𝒙(, 𝑦( (*+, knownlabelsforourtrainingdata
• Goal:Regressionontestdata• Predictagoodlabelforagivendatum𝒙
𝑦- = 𝑓(𝒙)• Machinelearningproblem:findfunction𝑓 topredictthelabel
• Learning/estimationon(limited)training data• Predictionqualitywithrespectto(unknown)test data
5
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
LinearRegression
• Standard method:linearleastsquaresfittodata• Knownin1dfromschool:“Ausgleichsgerade”• Knowninnd frombasiclectures
• Linearmodelforthelabelvariable𝑦:
𝑦 = 𝒘2𝒙
• Training/Learningwithadataset𝐷 = 𝑥(, 𝑦( (*+,
Howtofind𝒘,𝑤5?Howtomeasurelabel/predictionerror?
6
Weuseanoldtricktokeepitsimple:
𝒘 ≔ 𝑤5𝒘 , 𝒙 ≔ 1
𝒙
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
LeastSquaresSolution
Thelinearmodelshouldfitthetrainingdataoptimally.Theeasiestlossfunctiontominimizeisthesquarederror:
Training:Find𝒘,𝑤5 suchthatthesumofthesquaredreconstructionerrorsofthetrainingsetisminimal:
𝒘,𝑤5 = argmin𝒘,>?
@ 𝑦( − 𝒘2𝒙( B�
(
Well-knownsolution:𝒘 = 𝑿𝑿2 E+𝑿𝒚
7
𝐿 𝑦, 𝒙, 𝑓 = 𝑦 − 𝑓 𝒙 B
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
ProbabilisticSetup
Inourprobabilisticsetup,wehaveadistributionofpredictions,givenadatapoint:
• SimilartoposteriorclassprobabilitywithBayes:Butthelabelisnowcontinuous– therearemorethantwovalues!
• ThebestsinglepredictiontomakedependsonourriskfunctionVeryoftenthisistheexpectedvalue(e.g.squaredlossrisk)
• Directposteriormodel– discriminativemethod
8
𝑃 𝑦 𝒙
𝑦- = 𝐸[𝑦|𝒙]
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
ProbabilisticSetup
Weuseasimpleposteriormodelforthelabelgivendata:𝑃 𝑦 𝒙
• Eachobservationisaffectedbyanoisevalueε ∼ 𝑁 𝜀 0, 𝜎B
• Thesingle bestpredictionof𝑦 isstandardlinearregression
9
𝑃 𝑦 𝒙;𝒘 = 𝒩 𝑦 𝒘2𝒙, 𝜎B
𝑦 = 𝒘2𝒙 + ε
𝑦- = 𝐸 𝑦 = 𝒘2𝒙
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
MaximumLikelihood:Regression
• Thediscriminativeprobabilisticmodelcanbetrainedbymaximum-likelihood estimation
• TheresultisidenticaltotheknownleastsquaressolutionLeastsquaresusuallycorrespondstoGaussiannoiseassumptions
• Again:maximizeposterior ofdata(discriminativelikelihood)
𝒘,𝑤5 = argmax𝒘,>?
𝑃 𝑌 𝑿,𝒘
𝑃 𝑌 𝑿 =YYZ[𝑃 𝑦( 𝒙(
�
(
=[𝒩 𝑦( 𝒘2𝒙(, 𝜎B�
(
10
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
MaximumLikelihood:Regression
log 𝑃 𝑌 𝑿 =@−12𝜎B 𝑦( − 𝒘2𝒙( B −
12 log 2𝜋 − log 𝜎
�
(𝜕𝜕𝒘𝑃 𝑌 𝑿 =@
1𝜎B 𝑦( − 𝒘2𝒙( 𝒙(2
�
(
=! 0
@ 𝑦( − 𝒘2𝒙( 𝒙(2�
(
= 0
𝒘bc = @𝒙(𝒙(2�
(
E+
@𝑦(𝒙(
�
(11
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
DataMatrixNotation
Usingmatrixnotationtheresultbecomesmoreaccessible:
@𝒙(𝒙(2�
(
= 𝑿𝑿2
@𝑦(𝒙(
�
(
= 𝑿𝒚
𝑿 = 𝒙+, 𝒙B, … , 𝒙,
𝒚 =
𝑦+𝑦B⋮𝑦,
𝒘bc = 𝑿𝑿2 E+𝑿𝒚
Standardleastsquaressolution!Pseudo-Inverseofmatrix𝑿
12
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
Shortcoming1:Outliers
• Outliersaffectresults1. Leastsquares:Outliersaffectthesquaredlossmassively2. Probabilistic:Gaussianhasverylowprobabilityforlargedeviations
13
Realproblem:Illuminationestimation
Toodark:sunglasses Robustestimation
Leastsquaressolutionstendtoequalize allerrors
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
Shortcoming:Overfitting
• Toomanyparametersleadtoundecidablemodelsormodelswhichcanexplainthedataperfectly (overfitting)
• Ingeneral,wehavemultiplesolutionswhichfitthedata
Modeltoosimple Modelfitsdata Overfittingtoocomplex 14
Illustrationwithfittingpolynomialsofdegree𝑀 (non-linearbasisfunctions)
Figs:BishopPRML,2006
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
Regularization
Asasolution,weintroducepriorassumptionsaboutthesolution𝒘Actually,wemakeourpriorassumptionsexplicit – youalwayshavethem
Wewanttoprefersmall𝒘: Itshouldshowatendencytowardslowerinfluenceofafeaturewhennotenoughdataisavailable
Desiredregularization
15Figs:BishopPRML,2006
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
RegularizedRegression:MAP
Naturalwayofdealingwithpriorsinprobabilisticview:Maximum-a-Posteriori(MAP) estimate
𝒘ghi = argmax𝒘
𝑃 𝑌 𝑿,𝒘 𝑃 𝒘
𝑃(𝒘) = 𝒩 𝒘|0, 𝜎>B𝑰TheGaussianpriorisaverycommonchoice:Weprefersolutionswithasmallmagnitude.
𝒘ghi = argmax𝒘
[𝒩 𝑦( 𝒘2𝒙(, 𝜎B�
(
𝒩 𝒘|0, 𝜎>B𝑰
Thiswillleadtoregularizedleastsquares
16
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
MAPEstimatelog 𝑃 𝑌 𝑿,𝒘 𝑃 𝒘 =
@−12𝜎B 𝑦( − 𝒘2𝒙( B −
12 log 2𝜋𝜎
B −12𝜎>B
𝒘 B −𝑑2 log 2𝜋𝜎>
B�
(
𝜕𝜕𝒘 log𝑃 𝑌 𝑿,𝒘 𝑃 𝒘 =
1𝜎B@(𝑦(−𝒘2𝒙()𝒙(2
�
(
−1𝜎>B
𝒘2 =! 0
𝒘ghi = @𝒙(𝒙(2�
(
+𝜎B
𝜎>B𝑰
E+
@𝑦(𝒙(2�
(
𝒘ghi = 𝑿𝑿2 + 𝜆𝑰 E+𝑿𝒚 𝜆 ≔𝜎B
𝜎>B
17Specialname: Ridgeregression
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
RidgeRegression
• Parameter𝜆 needstobeadaptedtoproblemTypicallythroughcross-validation:optimizationontest/validationdataRarelythrough“real”priorknowledge
18
Desiredregularization ToostrongTooweak
Figs:BishopPRML,2006
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
BayesianLinearRegression
Westillonlyselectasinglesolution.Aprobablybetteralternativewouldbetoconsideralloftheminaproperwayofaveraging.
• ComparetologisticregressionwithmanydecisionplanesDiscussedaveragingonlyconceptually– howtoactuallydoit?
• Conceptframework:BayesianinferenceDefinestheproperwayofaveraging–marginalization
• BayesianlinearregressionisaniceapplicationexamplewhichisstillfullytractableandillustratestheconceptverywellBayesianmethodstendtobecomeintractableformorecomplexmodels
19
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
BayesianInferenceforRegression
Classification:Averagemanypossibledecisionplanes
Regression:Averagemanypossibleregressionlines
20Figs:BishopPRML,2006
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
ProbabilisticSetup
• TheMAPestimatecanbeeasilyextendedtofullBayesiantreatment.Insteadoftakingthemaximumonly,weusethewholedistributionof𝒘
𝒘ghi = argmax𝒘
𝑃 𝒚 𝑿,𝒘 𝑃 𝒘
𝑃 𝒘|𝑿, 𝒚 ∝ 𝑃 𝒚 𝑿,𝒘 𝑃 𝒘
𝑃 𝒘|𝑿, 𝒚 =𝑃 𝒚 𝑿,𝒘 𝑃 𝒘
∫ 𝑃 𝒚 𝑿,𝒘 𝑃 𝒘 d𝒘
Thisisadistributionofvaluesof𝒘.Thisinterpretationmakes𝒘arandomvariable!
21
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
PosterioroftheParameter
• Calculationofposteriorofourparameter𝒘:𝑃 𝒘|𝑿, 𝑌• ApplicationofBayesrule:
𝑃 𝒘|𝑿, 𝑌 =𝑃 𝑌 𝑿,𝒘 𝑃 𝒘
∫ 𝑃 𝑌 𝑿,𝒘 𝑃 𝒘 d𝒘
Thenormalizationmeasureshowlikelythedatasetisonaverage,consideringallvaluesof𝒘:marginallikelihood𝑃 𝑌 𝑿
Theprior 𝑃 𝒘 expressesourassumptionsweholdabout𝒘beforeseeingdata
𝑃 𝑌 𝑿,𝒘 measuresthelikelihood ofthedatasetforansingle valueof𝒘
Theposterior 𝑃 𝒘|𝑿, 𝑌 expressesourcertaintywehaveaboutaspecificvalueof𝒘 – consideringdataand priorassumptions
22
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
PosterioroftheParameterWenowhavetheposteriordistributioninsteadofasinglebestvalue.Itcontainsourknowledgeaboutthecompatibilityofallpossiblesolutionswithourdataandassumptions.
• Whatisitgoodfor?Itexpressesourcertaintyaboutallpossiblesolutions– “Rating” foreachsolutionSinglemaximum?Peaked?Broad?– ValuableinformationSystemintegration:Down-streammethodscanaccountforregressionuncertainty
• Whattodowithit?WecanuseallthisinformationtomakemoreinformedpredictionsAnanalysis(e.g.riskfactors)of𝒘 hasmoreinformationavailable
23
𝑃 𝒘 𝑃 𝒘|𝐷
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
BayesianInference
𝑃 𝒘 𝐷 =1𝑍 𝑃 𝑌 𝑿,𝒘 𝑃 𝒘
=1𝑍[𝒩 𝑦( 𝒘2𝒙(, 𝜎B
�
(
𝒩 𝒘|0, 𝜎>B𝑰
=1𝑍′ exp −
12𝜎B 𝒘2𝑿 − 𝒚 B −
12𝜎>B
𝒘 B
𝑃 𝒘 𝐷 = 𝒩 𝒘|𝝁, 𝚺 𝝁 =1𝜎B 𝚺𝑿𝒚
𝚺E+ =1𝜎B 𝑿𝑿
2 +1𝜎>B
𝑰TheposteriorisagainaGaussian!
𝝁 = 𝒘ghi
Trainingdata𝐷 = 𝑿, 𝒚
24Bishop,PRML,section3.3.1,p.152– 156(eq 3.49– 3.54),Springer2006
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
PosteriorofLinearRegression
Nodata
N=1
N=2
N=19
𝑃 𝒘 𝐷 = 𝒩 𝒘|𝝁, 𝚺
𝝁 =1𝜎B 𝚺𝑿𝒚
𝚺E+ =1𝜎B 𝑿𝑿
2 +1𝜎>B
𝑰
25Figs:BishopPRML,2006
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
PredictiveDistribution
Howtopredictalabelforanewdatapoint?Wenowhaveverymanysolutionsandknowhowwelleachonefitsourtrainingdataandourpriorassumptions.
• Predictionisprobabilistic(posteriorforprediction/classification)• Thepredictionshouldincludeallourknowledgeaboutpossiblesolutions(should“average”overparametervalues):𝑃 𝑦 𝒙, 𝐷
• Weonlyhaveapredictionforasinglevalueof𝒘:𝑃 𝑦 𝒙,𝒘
• Averagingshouldrespectdifferentqualityof𝒘:𝑃(𝒘|𝐷)Badsolutionsshouldnotcontributewhilewewanttofocusongoodones
26
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
PredictiveDistribution(II)
Example:polynomialfit(basisfunctions)• Blue:datapoints
• Greenline:generatingprocess/groundtruth
• Redline:bestfittobluedatapoints
• Shadedred:regionofprobableprediction
27
Tellsusabouttheoutcome’scertainty!
Figs:BishopPRML,2006
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
PredictiveDistribution:Calculation
Theaveragingmethodiscalledmarginalization
𝑃 𝑦 𝒙, 𝐷 = ∫ 𝑃 𝑦 𝒙,𝒘 𝑃 𝒘 𝐷 d𝒘PredictiveDistribution
𝑃 𝑦 𝒙, 𝐷 = v𝒩 𝑦 𝒘2𝒙, 𝜎B 𝒩 𝒘|𝝁, 𝚺 d𝒘�
�
𝑃 𝑦 𝒙, 𝐷 = 𝒩 𝑦 𝝁2𝒙, 𝜎B + 𝒙2𝚺𝒙𝝁 =
1𝜎B 𝚺𝑿𝒚
𝚺E+ =1𝜎B 𝑿𝑿
2 +1𝜎>B
𝑰
Expected/Bestpredictionstilllinear
28Bishop,PRML,section3.3.2,p.156(eq 3.57– 3.59),Springer2006
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
PredictiveDistribution:Result
• Predictionmeanislinear𝝁2𝒙• Predictionvarianceisaquadraticfunction𝜎B + 𝒙2𝚺𝒙
Thepredictionnowincludesaqualityestimate togetherwiththeactualprediction!• Thequalityishigherwherewehavemoredata• Thecertaintyneverincreasesbeyondourobservations’uncertainty𝜎
29
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
Uncertainty
• Wecalculatedmanyprobabilities.Howaretheytobeinterpreted?Theyaresometimescontradictory:Whydoesthedistributionchangewhenwehavemoredata?Shouldn’ttherebeareal distributionof𝑃 𝒘 ?
• Bayesianinferencereliesonasubjectiveperspective:Probabilityisusedtoexpressourcurrentknowledge.Itcanchangewhenwelearnorseemore:Withmoredata,wearemorecertainaboutourresult.
• Notsubjectiveinthesensethatitisarbitrary!Therearequantitativerulestofollowmathematically
• Probabilityexpressesanobserverscertainty,oftencalledbelief
30
Subjectivity:Thereisnosingle,realunderlyingdistribution.Aprobabilitydistributionexpressesourknowledge– Itisdifferentindifferentsituationsandfordifferentobserverssincetheyhavedifferentknowledge.
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
BayesianInference
Bayesianinferenceisthemathematicaltooltocalculatechangesincertaintywhentheunderlyingknowledgechangesthroughobservations:beliefdynamics,beliefupdate
EvolutionofbeliefsbyconditioningondataaccordingtoBayesrule:
𝑃 𝑥 → 𝑃 𝑥 𝐷 𝑃 𝑥 𝐷 =𝑃 𝐷 𝑥 𝑃(𝑥)
𝑃(𝐷)
𝑃 𝑥 → 𝑃 𝑥 𝐷+ → 𝑃 𝑥 𝐷B → ⋯
Conditioningisdonewithalikelihoodmodel:Howcandatabeexplained?
31
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
KernelRegression
Non-linearextensionispossibleandpowerful:
• Kernelsandsample-spaceexpansion:KernelregressionExpansion&kerneltrick
𝑤2𝑥 =@𝛼(𝒙(2𝒙�
(
𝑦 =@𝛼(𝑘 𝒙(2𝒙�
(
• Withregularization(MAPwithGaussian):KernelRidgeRegressionLeastsquaressolution:
𝜶∗ = 𝑲 + 𝜆𝑰 E+𝒚
32
𝑲(~ = 𝑘 𝒙(, 𝒙~
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
KernelRidgeRegression
• Non-linearfittingwithproperkernel
33
scikit-learn.org,JanHendrikMetzen
KRR:KernelRidgeRegressionGPR:GaussianProcessRegression
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
GaussianProcess
• 𝑃 𝒘 actuallydescribesadistributionoffunctions:Everysinglevalueof𝒘 definesalinearfunction:
• GaussianProcess:GaussiandistributionoverfunctionsDirectlymodelsthedistributionofourregressionfunctions𝑓Gaussian:Theyhaveamean 𝜇 𝒙 andacovariance 𝑘 𝒙, 𝒙�
• Covariancefunctionsareessentiallythesamethingasakernel:Theyspecifyasimilaritymeasurebetweenpoints𝑥(“Similar”→ highcorrelation)
34
𝑓 𝒙 ∼ 𝐺𝑃 𝜇 𝒙 , 𝑘 𝒙, 𝒙�
𝑓𝒘 𝒙 = 𝒘2𝒙 𝑃(𝒘) → 𝑃 𝑓𝒘
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
GaussianProcess(II)
• FullyBayesiantreatmentofmanyproblemsduetotheGaussianstructure:Closed-formsolutionsavailableUsuallyneedsGaussianlikelihoodsaswell
• Duetotheuseofcovariancefunctions(~kernels)thisisalsotrueforverycomplexnon-linearmodels
• Verypowerfulframework:• Non-linearBayesianregressionformachinelearning• E.g.Fullshapemodelswhichcombinestatisticalapproaches(“PCA”)withmoregeneralassumptions,like”smoothness”
35
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
GaussianProcess:Shapes
36
Lüthi,Marcel,etal."GaussianProcessMorphable Models."arXiv preprintarXiv:1603.07254 (2016).
http://shapemodelling.cs.unibas.ch/https://www.futurelearn.com/courses/statistical-shape-modelling
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
BayesianModelSelection
• Bayesianmethodsaverageoverwholemodelclasses𝑀e.g.averageoverall𝒘 valuesforagivenpolynomialdegree𝑀
• Themarginallikelihoodcapturestheaveragefitofamodel𝑀 togivendata𝐷,evidence foragivenmodelclass:
• Differentmodelscanbecomparedwithrespecttotheirmarginallikelihoods:Findmodelswhichfitadatasetwellonaverage
• Modelselection:selectbestone• Model“averaging”:predictwithweightedaverageoverallmodels
37
𝑃 𝐷 𝑀 = ∫ 𝑃 𝐷 𝒘,𝑀 𝑃 𝒘 𝑀 d𝒘
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
BayesianModelSelection(II)
• Themarginallikelihoodisanormalized distribution:Itmeasuresthedegreeoffittodatavs.complexity ofthemodelNormalizationhasanaturalregularization effect:Incomplexmodels,manydatasetsareverylikelybecausethereisasuitableparameterwhichexplainsthedatawell,e.g.highdegreepolynomials.Butiftherearemany datasetswithahighlikelihood,thelikelihoodforanindividual datasetisratherlowbecausethemarginallikelihoodisnormalized.
38
𝑀+:simplemodel𝑀B:intermediatemodel𝑀�:complexmodel
“Area“isalways1
Figs:BishopPRML,2006
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
BayesianModelSelection(III)
• Evidenceforthepolynomialexample:
39
Mod
elevidence(lo
g)
Figs:BishopPRML,2006
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
Summary:Regression
• Regression:Machinelearningwithcontinuouslabel
• Leastsquaresregression:MLestimationCorrespondstoGaussianobservationerror
• Regularization:MAPestimation• Reducesoverfitting• Regularizedleastsquares:Ridgeregression
• BayesianRegression• Posteriordistributionofregressionmodels:𝑃 𝒘 𝐷• Averageallmodelsforaprediction:predictivedistribution 𝑃 𝑦 𝒙, 𝐷• UncertaintytreatmentwithBayesianinference:beliefupdates
40
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
Summary:ProbabilisticMethods(I)
BayesClassifier
NaïveBayes
LogisticRegression
BayesianRegressionLeastSquaresRegression
SVM,DecisionTree,ANN,Perceptron
Classification
Regressio
n
ProbabilisticMethods
DiscriminativeMethods
GaussianProcessGaussianProcess
bold:partofthisblockitalic:notinthislecture 41
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
Summary:ProbabilisticMethods(II)
• Build&constructmodel,accordingtoideaandconcept𝑃 𝑦 𝒙,𝑤
• Estimateparameters:MaximumLikelihood• Theidiomaticprobabilisticwayoflearning
• Realizeshortcomingsofresultduetoalackofdata
• Estimationwithpriorknowledge:MAPestimation• MAPincludesourknowledgeabouttheproblemintoestimation
• FullBayesiantreatment• Expresscertaintybyconsideringallpossiblesolutions• Weightedaveraging:weightisdegreeoffitwithtrainingdata
42
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 | BASEL
Summary:ProbabilisticMethods(III)Method NaïveBayes,Bag-of-words LogisticRegression LinearRegression
Model 𝑃 𝑦 w ∝[𝑃 w 𝑦�
�
𝑃 𝑦
𝑃 𝑤 𝑦 = ℎ>,�
𝑃 𝑦 𝒙 =1
1 + exp −𝒘2𝒙 𝑃 𝑦 𝒙 = 𝒩 𝑦|𝒘2𝒙, 𝜎B
ML estimateℎ> =
𝑁�∑ 𝑁>�>
@ 𝑦( −𝜎 𝒘2𝒙( +𝑤5 𝒙(2�
(=! 0
IterativeReweighted LeastSquares𝒘g� = 𝑿𝑿2 E+𝑿𝒚
ShortcomingofML
Unseenwords:zerocounts
Separabledata:infinitecertainty
Underdetermined solution&overfitting
Priorknowledge Pseudocount:eachwordalready seenonce
Small𝒘: minimalinfluenceofafeature:𝑃 𝒘 = 𝒩 𝒘|0, 𝜎B𝑰
Small𝒘: minimalinfluenceofafeature
𝑃 𝒘 = 𝒩 𝒘|0, 𝜎>B𝑰MAPestimate ℎ> =
𝑁� + 1∑ (𝑁>+1)�>
@ 𝑦( − 𝜎 𝒘2𝒙( + 𝑤5 𝒙(2�
(
−1𝜎B 𝒘
2 =! 0
IterativeReweighted LeastSquares𝒘ghi = 𝑿𝑿2 + 𝜆𝑰 E+𝑿𝒚
Bayes𝑃 𝒘|𝐷𝑃 𝑦 𝒙, 𝐷
(didnotdiscuss)LatentDirichlet Allocation
Averageclassifiersresultingfromallpossibleclassificationhyperplanes
Averageoverall𝒘,weightedbytheirperformanceofexplainingourtrainingdata:Gaussianwithlinearmeanandquadraticcovariance
Top Related