Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor...

Copyright © Andrew W. Moore Slide 1

Probabilistic and Bayesian Analytics

Andrew W. MooreProfessor

School of Computer ScienceCarnegie Mellon University

www.cs.cmu.edu/[email protected]


Probability• The world is a very uncertain place• 30 years of Artificial Intelligence and

Database research danced around this fact

• And then a few AI researchers decided to use some ideas from the eighteenth century.

• We will review the fundamentals of probability.


Discrete Random Variables. Caso Binario

• A is a Boolean-valued random variable if A denotes an event, and there is some degree of uncertainty as to whether A occurs.

• Examples• A = The US president in 2023 will be

male• A = You wake up tomorrow with a

headache• A = You have Ebola


Probabilidades• Denotamos P(A) , la probabilidad de A,

como “la fraccion de todas las ocurrencias en las cuales A es cierta”.

• Existen varias maneras de asignar probabilidades pero no hablaremos de eso ahora.


Diagrama de Venn para visualizar A

Espacio muestral de todas las ocurrencias posibles.

El area es 1Ocurrencias en las cuales A es Falsa

Ocurrencias en las cuales A es cierta

P(A) = Area dela parte ovalada


The Axioms of Probability• 0 <= P(A) <= 1• P(A es cierta en todas las ocurrencias) = 1• P(en ninguna ocurrencia A es cierta) = 0• P(A or B) = P(A) + P(B) si A y B son

disjuntos.

Estos axiomas fueron introducidos por la escuela rusa al principio de los 1900: Kolmogorov, Liapunov, Kinthchine, Chebychev


Interpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B)

The area of A can’t get any smaller than 0

And a zero area would mean no world could ever have A true


Interpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B)

The area of A can’t get any bigger than 1

And an area of 1 would mean all worlds will have A true


Interpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)

A

B


Interpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)

A

B

P(A or B)

BP(A and B)

Simple addition and subtraction


Theorems from the Axioms• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)

From these we can prove:P(not A) = P(~A) = 1-P(A)

• How?


Another important theorem• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)

From these we can prove:P(A) = P(A ^ B) + P(A ^ ~B)

• How?


Multivalued Random Variables

• Suppose A can take on more than 2 values• A is a random variable with arity k if it can take on

exactly one value out of {v1,v2, .. vk}• Thus…

jivAvAP ji if 0)(

1)....( 21 kvAvAvAP


An easy fact about Multivalued Random

Variables:• Using the axioms of probability…

0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)

• And assuming that A obeys…

• It’s easy to prove that

jivAvAP ji if 0)(1)....( 21 kvAvAvAP

)()...(1

21

i

jji vAPvAvAvAP


An easy fact about Multivalued Random





jivAvAP ji if 0)(1)...( 21 kvAvAvAP

)()...(1

21

i

jji vAPvAvAvAP

• And thus we can prove

1)(1

k

jjvAP


Another fact about Multivalued Random






)(])...[(1

21

i

jji vABPvAvAvABP


Another fact about Multivalued Random






)(])...[(1

21

i

jji vABPvAvAvABP

• And thus we can prove

)()(1

k

jjvABPBP


Elementary Probability in Pictures

• P(~A) + P(A) = 1



• P(B) = P(B ^ A) + P(B ^ ~A)



1)(1

k

jjvAP



)()(1

k

jjvABPBP


Conditional Probability• P(A|B) = fraccion de ocurrencias en las cuales B

es cierta y en las cuales A es tambien cierta.

F

H

H = “Have a headache”F = “Coming down with Flu”

P(H) = 1/10P(F) = 1/40P(H|F) = 1/2

“Headaches are rare and flu is rarer, but if you’re coming down with ‘flu there’s a 50-50 chance you’ll have a headache.”


Conditional ProbabilityF

H


P(H) = 1/10P(F) = 1/40P(H|F) = 1/2

P(H|F) = fraccion de los que tiene flu que tienen headache

= #ocurrencias con flu and headache ------------------------------------ #ocurrencias with flu

= Area of “H and F” region ------------------------------ Area of “F” region

= P(H ^ F) ----------- P(F)


Definition of Conditional Probability

P(A ^ B) P(A|B) = ----------- P(B)

Corollary: The Chain Rule

P(A ^ B) = P(A|B) P(B)


Probabilistic Inference

F

H


P(H) = 1/10P(F) = 1/40P(H|F) = 1/2

One day you wake up with a headache. You think: “Drat! 50% of flus are associated with headaches so I must have a 50-50 chance of coming down with flu”

Is this reasoning good?


Probabilistic Inference

F

H


P(H) = 1/10P(F) = 1/40P(H|F) = 1/2

P(F ^ H) =P(F)P(H/F)=(1/40)(1/2)=1/80

P(F|H) =P(F^H)/P(H)=(1/80)/(1/10)=1/8=.125


Another way to understand the intuition

Thanks to Jahanzeb Sherwani for contributing this explanation:


What we just did… P(A ^ B) P(A|B) P(B)P(B|A) = ----------- = --------------- P(A) P(A)

This is Bayes Rule. P(B) es llamada la probabilidad apriori, P(A/B) es llamada la veosimiltud, y P(B/A) es la probabilidad posterior

Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418


More General Forms of Bayes Rule

)(~)|~()()|(

)()|()|(

APABPAPABP

APABPBAP

)(

)()|()|(

XBP

XAPXABPXBAP


More General Forms of Bayes Rule

An

kkk

iii

vAPvABP

vAPvABPBvAP

1

)()|(

)()|()|(


Useful Easy-to-prove facts1)|(~)|( BAPBAP

1)|(1

An

kk BvAP


The Joint Distribution

Recipe for making a joint distribution of M variables:

Example: Boolean variables A, B, C


Usando la Regla de Bayes en Juegos

El sobre “Ganador” tiene un dolar y 4 bolitas (2 rojas y 2 negras).

$1.00

El sobre “Perdedor” tiene 3 bolitas ( una roja y dos negras) y no tiene dinero.

Pregunta: Alguien extrae un sobre al azar y te lo ofrece en venta? Cuanto deberias pagar por el sobre?


Using Bayes Rule to Gamble

El sobre “Ganador” tiene un dolar y 4 bolitas (2 rojas y 2 negras).

$1.00

El sobre “Perdedor” tiene 3 bolitas ( una roja y dos negras) y no tiene dinero.

Pregunta interesante: Antes de decidir, se le permite a Ud. Que vea una bolita extraida del sobre.

Suponga que es negra cuanto deberia Ud pagar por el sobre? Suponga que es roja, cuanto deberia Ud. Pagar por el sobre?


Calculo…$1.00

P(Ganador/Bola negra)=P(Ganador^BN)/P(BN)

=P(Ganador)P(BN/Ganador)/P(BN)

P(BN)=P(BN^Ganador)+P(BN^Perdedor)

=P(Ganador)P(BN/Ganador)/P(BN)+P(Perdedor)P(BN/Perdedor)

=(1/2)(2/4)+(1/2)(2/3)=7/12

P(Ganador/Bola negra)=(1/4)/(7/12)=3/7=.43

No pagaria nada por el sobre.

Ahora P(Ganador/Bola Roja)=P(Ganador ^ BR)/P(BR)

=.40


Distribuciones conjuntas


1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).


A B C0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

A=1 si la persona es varon y 0 si es mujer, B=1 si la persona esta casada y 0 si no esta y C =1 si la persona

esta enferma y 0 si no lo esta





2. For each combination of values, say how probable it is.


A B C Prob0 0 0 0.30

0 0 1 0.05

0 1 0 0.10

0 1 1 0.05

1 0 0 0.05

1 0 1 0.10

1 1 0 0.25

1 1 1 0.10





2. For each combination of values, say how probable it is.

3. If you subscribe to the axioms of probability, those numbers must sum to 1.


A B C Prob0 0 0 0.30

0 0 1 0.05

0 1 0 0.10

0 1 1 0.05

1 0 0 0.05

1 0 1 0.10

1 1 0 0.25

1 1 1 0.10

A

B

C0.050.25

0.10 0.050.05

0.10

0.100.30


Datos para el ejemplo de la JD

• The Census dataset:48842 instances, mix of continuous and

discrete features (train=32561, test=16281)45222.

if instances with unknown values are removed (train=30162,test=15060)

Disponible en: http://ftp.ics.uci.edu/pub/machine-learning-databases

Donantes: Ronny Kohavi y Barry Becker (1996).


Variables en Census• 1-age: continuous.• 2-workclass: Private, Self-emp-not-inc, Self-emp-inc,

Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

• 3-fnlwgt (final weight) : continuous.• 4-education: Bachelors, Some-college, 11th, HS-

grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

• 5-education-num: continuous.• 6-marital-status: Married-civ-spouse, Divorced,

Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

• 7-occupation: • 8-relationship: Wife, Own-child, Husband, Not-in-

family, Other-relative, Unmarried.


Variables en Census• 9-race: White, Asian-Pac-Islander,

Amer-Indian-Eskimo, Other, Black.• 10-sex: Female[0], Male[1].• 11-capital-gain: continuous.• 12-capital-loss: continuous.• 13-hours-per-week: continuous.• 14-native-country: nominal• 15 Salary: >50K (rich)[2], <=50K

(poor)[1].


Primeros datos de census(train)

> traincensus[1:5,] age workcl fwgt educ educn marital occup relat race sex gain

loss hpwk1 39 6 77516 13 13 3 9 4 1 1 2174 0 402 50 2 83311 13 13 1 5 3 1 1 0 0 133 38 1 215646 9 9 2 7 4 1 1 0 0 404 53 1 234721 7 7 1 7 3 5 1 0 0 405 28 1 338409 13 13 1 6 1 5 0 0 0 40 country salary1 1 12 1 13 1 14 1 15 13 1>


Calculando frecuencias para sexo

> countsex=table(traincensus$sex)> probsex=countsex/32561> countsex

0 1 10771 21790 > probsex

0 1 0.3307945 0.6692055 >


Calculando las probabilidades conjuntas

> jdcensus=traincensus[,c(10,13,15)]> dim(jdcensus)[1] 32561 3> jdcensus[1:3,] sex hpwk salary1 1 40 12 1 13 13 1 40 1> dim(jdcensus[jdcensus$sex==0 &jdcensus$hpwk<=40.5

&jdcensus$salary==1,])[1][1] 8220> dim(jdcensus[jdcensus$sex==0 &jdcensus$hpwk<=40.5

&jdcensus$salary==1,])[1]/dim(jdcensus)[1][1] 0.2524492> dim(jdcensus[jdcensus$sex==0 &jdcensus$hpwk<=40.5

&jdcensus$salary==2,])[1]/dim(jdcensus)[1][1] 0.02484567


> dim(jdcensus[jdcensus$sex==0 &jdcensus$hpwk>40.5 &jdcensus$salary==1,])[1]/dim(jdcensus)[1]

[1] 0.0421363> dim(jdcensus[jdcensus$sex==0 &jdcensus$hpwk>40.5

&jdcensus$salary==2,])[1]/dim(jdcensus)[1][1] 0.01136329 > dim(jdcensus[jdcensus$sex==1 &jdcensus$hpwk<40.5

&jdcensus$salary==1,])[1]/dim(jdcensus)[1][1] 0.3309174> dim(jdcensus[jdcensus$sex==1 &jdcensus$hpwk<=40.5

&jdcensus$salary==2,])[1]/dim(jdcensus)[1][1] 0.09754> dim(jdcensus[jdcensus$sex==1 &jdcensus$hpwk>40.5

&jdcensus$salary==1,])[1]/dim(jdcensus)[1][1] 0.1336875> dim(jdcensus[jdcensus$sex==1 &jdcensus$hpwk>40.5

&jdcensus$salary==2,])[1]/dim(jdcensus)[1][1] 0.1070606


Using the Joint

One you have the JD you can ask for the probability of any logical expression involving your attribute

E

PEP matching rows

)row()(


Using the Joint

P(Poor Male) = 0.4654 E

PEP matching rows

)row()(


Using the Joint

P(Poor) = 0.7604 E

PEP matching rows

)row()(


Inference with the

Joint

2

2 1

matching rows

and matching rows

2

2121 )row(

)row(

)(

)()|(

E

EE

P

P

EP

EEPEEP


Inference with the

Joint

2

2 1

matching rows

and matching rows

2

2121 )row(

)row(

)(

)()|(

E

EE

P

P

EP

EEPEEP

P(Male | Poor) = 0.4654 / 0.7604 = 0.612


Inferencia es un asunto importante

• Tengo esta evidencia, cual es la probabilidad de que esta conclusion sea cierta?• Tengo la nuca hinchada: Cual es la probabilidad

de que tenga meningitis?• Veo luces en mi casa y son las 9pm. Cual es la

probabilidad de que mi esposa ya se haya quedado dormida?

• Hay bastante aplicaciones de Inferencia Bayesiana en Medicina, Farmacia, Diagnostico de Falla de Maquinas, etc.


Where do Joint Distributions come from?

• Idea One: Expert Humans• Idea Two: Simpler probabilistic facts and some

algebraExample: Suppose you knew

P(A) = 0.7

P(B|A) = 0.2P(B|~A) = 0.1

P(C|A^B) = 0.1P(C|A^~B) = 0.8P(C|~A^B) = 0.3P(C|~A^~B) = 0.1

Then you can automatically compute the JD using the chain rule

P(A=x ^ B=y ^ C=z) =P(C=z|A=x^ B=y) P(B=y|A=x) P(A=x)

In another lecture: Bayes Nets, a systematic way to do this.


Where do Joint Distributions come from?

• Idea Three: Learn them from data!

Prepare to see one of the most impressive learning algorithms you’ll come across in the entire course….


Learning a joint distributionBuild a JD table for your attributes in which the probabilities are unspecified

The fill in each row with

records ofnumber total

row matching records)row(ˆ P

A B C Prob0 0 0 ?

0 0 1 ?

0 1 0 ?

0 1 1 ?

1 0 0 ?

1 0 1 ?

1 1 0 ?

1 1 1 ?

A B C Prob0 0 0 0.30

0 0 1 0.05

0 1 0 0.10

0 1 1 0.05

1 0 0 0.05

1 0 1 0.10

1 1 0 0.25

1 1 1 0.10Fraction of all records in whichA and B are True but C is False


Example of Learning a Joint• This Joint was

obtained by learning from three attributes in the UCI “Adult” Census Database [Kohavi 1995]


Where are we?• We have recalled the fundamentals of

probability• We have become content with what

JDs are and how to use them• And we even know how to learn JDs

from data.


Density Estimation• Our Joint Distribution learner is our first

example of something called Density Estimation

• A Density Estimator learns a mapping from a set of attributes to a Probability

DensityEstimator

ProbabilityInput

Attributes


Density Estimation• Comparado con los otros dos modelos

principales:

Regressor Prediction ofreal-valued output

InputAttributes

DensityEstimator

ProbabilityInput

Attributes

Classifier Prediction ofcategorical output

InputAttributes


Evaluacion de la estimacion de densidad

Regressor Prediction ofreal-valued output

InputAttributes

DensityEstimator

ProbabilityInput

Attributes

Classifier Prediction ofcategorical output

InputAttributes

Test set Accuracy

?

Test set Accuracy

Test-set criterion for estimating performance on future data** See the Decision Tree or Cross Validation lecture for more detail


• Given a record x, a density estimator M can tell you how likely the record is:

• Given a dataset with R records, a density estimator can tell you how likely the dataset is:(Under the assumption that all records were

independently generated from the Density Estimator’s JD)

Evaluating a density estimator

R

kkR |MP|MP|MP

121 )(ˆ)(ˆ)dataset(ˆ xxxx

)(ˆ |MP x


The Auto-mpg datasetDonor: Quinlan,R. (1993) Number of Instances: 398 minus 6 missing=392 (training: 196

test: 196): Number of Attributes: 9 including the class attribute7. Attribute

Information: 1. mpg: continuous (discretizado bad<=25,good>25) 2. cylinders: multi-valued discrete 3. displacement: continuous (discretizado low<=200,

high>200) 4. horsepower: continuous ((discretizado low<=90, high>90) 5. weight: continuous (discretizado low<=3000,

high>3000) 6. acceleration: continuous (discretizado low<=15, high>15) 7. model year: multi-valued discrete (discretizado 70-74,75-

77,78-82 8. origin: multi-valued discrete 9. car name: string (unique for each instance) Note: horsepower has 6 missing values


The auto-mpg dataset18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle

malibu" 15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320" 18.0 8 318.0 150.0 3436. 11.0 70 1 "plymouth satellite" 16.0 8 304.0 150.0 3433. 12.0 70 1 "amc rebel sst" 17.0 8 302.0 140.0 3449. 10.5 70 1 "ford torino" …………………………………………………………………..27.0 4 140.0 86.00 2790. 15.6 82 1 "ford mustang gl" 44.0 4 97.00 52.00 2130. 24.6 82 2 "vw pickup" 32.0 4 135.0 84.00 2295. 11.6 82 1 "dodge rampage" 28.0 4 120.0 79.00 2625. 18.6 82 1 "ford ranger" 31.0 4 119.0 82.00 2720. 19.4 82 1 "chevy s-10"


A subset of auto-mpg levelmpg modelyear maker [1,] "bad" "70-74" "america" [2,] "bad" "70-74" "america" [3,] "bad" "70-74" "america" [4,] "bad" "70-74" "america" [5,] "bad" "70-74" "america" [6,] "bad" "70-74" "america“…………………………………………………………………………………………….. [7,] "good" "78-82" "america" [8,] "good" "78-82" "europe" [9,] "good" "78-82" "america"[10,] "good" "78-82" "america"[11,] "good" "78-82" "america"


Mpg: Train test and test set indices_samples=sample(1:392,196)mpgtrain=smallmpg[indices_samples,]dim(mpgtrain)[1] 196 3mpgtest=smallmpg[-indices_samples,]#computing the first joint probabilitya1=dim(mpgtrain[mpgtrain[,1]=="bad"

& mpgtrain[,2]=="70-74" & mpgtrain[,3]=="america",])[1]/192


The joint density estimatelevelmpg modelyear maker probability

bad 70-74 America 0.2397959

bad 70-74 Asia 0.01530612

Bad 70-74 Europe 0.03061224

Bad 75-77 America 0.2091837

Bad 75-77 Asia 0.01020408

Bad 75-77 Europe 0.04591837

Bad 78-82 America 0.04591837

Bad 78-82 Asia 0.005102041

Bad 78-82 Europe 0

Good 70-74 America 0.02040816

Good 70-74 Asia 0.01530612

Good 70-74 Europe 0.03061224

Good 75-77 America 0.04081633

Good 75-77 Asia 0.04081633

Good 75-77 Europe 0.02551020

Good 78-82 America 0.1071429

Good 78-82 Asia 0.07142857

Good 78-82 Europe 0.04591837


222-

196

119621

10 1.45 case) (in this

)(ˆ)(ˆ)mpgtest(ˆ

k

k|MP|MP|MP xxxx

This probability has been computed considering that any joint probability is at least greater than 1/1020=10-20


A small dataset: Miles Per Gallon

From the UCI repository (thanks to Ross Quinlan)

192 Training Set Records

mpg modelyear maker

good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe




mpg modelyear maker





mpg modelyear maker


203-

196

119621

10 3.4 case) (in this

)(ˆ)(ˆ)dataset /(ˆ

k

k|MP|MP|MP xxxx

Dataset puede ser mpgtrain o mpgtest


Log Probabilities

Since probabilities of datasets get so small we usually use log probabilities

R

kk

R

kk |MP|MP|MP

11

)(ˆlog)(ˆlog)dataset(ˆlog xx




mpg modelyear maker


466.19 case) (in this

)(ˆlog)(ˆlog)dataset(ˆlog11

R

kk

R

kk |MP|MP|MP xx

In our case= -510.79


Resumen: The Good News• Hemos visto una manera de aprender

un estimador de densidad a partir de datos.

• Estimadores de densidad pueden ser usados para• Ordenar los records segun la probabilidad,

y detectar asi casos anomalos (“outliers”).• Para hacer inferencia: calcular P(E1|E2)• Para calcular clasificadores Bayesianos (se

vera mas tarde).


Summary: The Bad News• Density estimation by directly learning

the joint is trivial, mindless and dangerous


Using a test set

An independent test set with 196 cars has a worse log likelihood

(actually it’s a billion quintillion quintillion quintillion quintillion times less likely)

….Density estimators can overfit. And the full joint density estimator is the overfittiest of them all!


Overfitting Density Estimators

If this ever happens, it means there are certain combinations that we learn are impossible

0)(ˆ any for if

)(ˆlog)(ˆlog)testset(ˆlog11

|MPk

|MP|MP|MP

k

R

kk

R

kk

x

xx


Using a test set

The only reason that our test set didn’t score -infinity is that my code is hard-wired to always predict a probability of at least one in 1020

We need Density Estimators that are less prone to overfitting


Naïve Density Estimation

El problema con el estimador de densidad conjunto es que simplemente refleja la muestra de entrenamiento.

Necesitamos algo que generalize en forma mas util.

El modelo naïve model generaliza mas eficientemente:

Asume que cada atributo esta distribuido independientemente uno del otro.


A note about independence• Assume A and B are Boolean Random

Variables. Then“A and B are independent”

if and only ifP(A|B) = P(A)

• “A and B are independent” is often notated as

BA


Independence Theorems• Assume P(A|B) = P(A)• Then P(A^B) =

= P(A) P(B)

• Assume P(A|B) = P(A)• Then P(B|A) =

• Por el resultado anterior

• P(B/A)=P(A)P(B)/P(A)

= P(B)

P(A/B)=P(A^B)/P(B)

Por Independencia,

P(A)=P(A^B)/P(B)

P(A^B)/P(A)


Independence Theorems• Assume P(A|B) =

P(A)• Then P(~A|B) =• P(~A^B)/P(B)=• [P(B)-P(A^B)]/P(B)• 1-P(A^B)/P(B)=• 1-P(A), por Indep.• P(~A)

= P(~A)

• Assume P(A|B) = P(A)

• Then P(A|~B) =• P(A^~B)/P(~B)• [P(A)-P(A^B)]/1-P(B)• P(A)-P(A)P(B)/1-P(B)• P(A)[1-P(B)]/1-P(B)

= P(A)


Multivalued Independence

For multivalued Random Variables A and B,

BAif and only if

)()|(:, uAPvBuAPvu from which you can then prove things like…

)()()(:, vBPuAPvBuAPvu )()|(:, vBPuAvBPvu


Independently Distributed Data

• Let x[i] denote the i’th field of record x.• The independently distributed assumption

says that for any i,v, u1 u2… ui-1 ui+1… uM

)][(

)][,]1[,]1[,]2[,]1[|][( 1121

vixP

uMxuixuixuxuxvixP Mii

• Or in other words, x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…x[M]}

• This is often written as ]}[],1[],1[],2[],1[{][ Mxixixxxix


Back to Naïve Density Estimation

• Let x[i] denote the i’th field of record x:• Naïve DE assumes x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…

x[M]}• Example:

• Suppose that each record is generated by randomly shaking a green dice and a red dice

• Dataset 1: A = red value, B = green value

• Dataset 2: A = red value, B = sum of values

• Dataset 3: A = sum of values, B = difference of values

• Which of these datasets violates the naïve assumption?


Using the Naïve Distribution• Once you have a Naïve Distribution you can

easily compute any row of the joint distribution.

• Suppose A, B, C and D are independently distributed. What is P(A^~B^C^~D)?


Using the Naïve Distribution• Once you have a Naïve Distribution you can

easily compute any row of the joint distribution.

• Suppose A, B, C and D are independently distributed. What is P(A^~B^C^~D)?

= P(A|~B^C^~D) P(~B^C^~D)= P(A) P(~B^C^~D)= P(A) P(~B|C^~D) P(C^~D)= P(A) P(~B) P(C^~D)= P(A) P(~B) P(C|~D) P(~D)= P(A) P(~B) P(C) P(~D)


Naïve Distribution General Case

• Suppose x[1], x[2], … x[M] are independently distributed.

M

kkM ukxPuMxuxuxP

121 )][()][,]2[,]1[(

• So if we have a Naïve Distribution we can construct any row of the implied Joint Distribution on demand.

• So we can do any inference • But how do we learn a Naïve Density

Estimator?


Learning a Naïve Density Estimator

records ofnumber total

][in which records#)][(ˆ uixuixP

Another trivial learning algorithm!


ConparacionJoint DE Naïve DE

Can model anything Can model only very boring distributions

Puede modelar una copia ruidosa de la muestra original

No lo puede hacer

Given 100 records and more than 6 Boolean attributes will screw up badly

Given 100 records and 10,000 multivalued attributes will be fine


A tiny part of the DE

learned by “Joint”

Empirical Results: “MPG”The “MPG” dataset consists of 392 records and 8 attributes

The DE learned by

“Naive”


The DE learned by

“Joint”

Empirical Results: “Weight vs. MPG”Suppose we train only from the “Weight” and “MPG” attributes

The DE learned by

“Naive”


The DE learned by

“Joint”

“Weight vs. MPG”: The best that Naïve can do

The DE learned by

“Naive”


Reminder: The Good News• Tenemos dos maneras de aprender un

estimador de densidad a partir de datos. • *In otras clases veremos estimadores de

densidad mas complicados. (Mixture Models, Bayesian Networks, Density Trees, Kernel Densities and many more)

• Los estimadores de densidad pueden ser usados para:• Deteccion de anomalias (“outliers”).• Para ser inferencias: Calcular P(E1|E2)• Para calcular clasificadores bayesianos.

Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor...

Documents

Transcript of Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor...