Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor...
-
Upload
chad-bettes -
Category
Documents
-
view
214 -
download
1
Transcript of Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor...
Copyright © Andrew W. Moore Slide 1
Probabilistic and Bayesian Analytics
Andrew W. MooreProfessor
School of Computer ScienceCarnegie Mellon University
www.cs.cmu.edu/[email protected]
Copyright © Andrew W. Moore Slide 2
Probability• The world is a very uncertain place• 30 years of Artificial Intelligence and
Database research danced around this fact
• And then a few AI researchers decided to use some ideas from the eighteenth century.
• We will review the fundamentals of probability.
Copyright © Andrew W. Moore Slide 3
Discrete Random Variables. Caso Binario
• A is a Boolean-valued random variable if A denotes an event, and there is some degree of uncertainty as to whether A occurs.
• Examples• A = The US president in 2023 will be
male• A = You wake up tomorrow with a
headache• A = You have Ebola
Copyright © Andrew W. Moore Slide 4
Probabilidades• Denotamos P(A) , la probabilidad de A,
como “la fraccion de todas las ocurrencias en las cuales A es cierta”.
• Existen varias maneras de asignar probabilidades pero no hablaremos de eso ahora.
Copyright © Andrew W. Moore Slide 5
Diagrama de Venn para visualizar A
Espacio muestral de todas las ocurrencias posibles.
El area es 1Ocurrencias en las cuales A es Falsa
Ocurrencias en las cuales A es cierta
P(A) = Area dela parte ovalada
Copyright © Andrew W. Moore Slide 6
The Axioms of Probability• 0 <= P(A) <= 1• P(A es cierta en todas las ocurrencias) = 1• P(en ninguna ocurrencia A es cierta) = 0• P(A or B) = P(A) + P(B) si A y B son
disjuntos.
Estos axiomas fueron introducidos por la escuela rusa al principio de los 1900: Kolmogorov, Liapunov, Kinthchine, Chebychev
Copyright © Andrew W. Moore Slide 7
Interpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B)
The area of A can’t get any smaller than 0
And a zero area would mean no world could ever have A true
Copyright © Andrew W. Moore Slide 8
Interpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B)
The area of A can’t get any bigger than 1
And an area of 1 would mean all worlds will have A true
Copyright © Andrew W. Moore Slide 9
Interpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)
A
B
Copyright © Andrew W. Moore Slide 10
Interpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)
A
B
P(A or B)
BP(A and B)
Simple addition and subtraction
Copyright © Andrew W. Moore Slide 11
Theorems from the Axioms• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)
From these we can prove:P(not A) = P(~A) = 1-P(A)
• How?
Copyright © Andrew W. Moore Slide 12
Another important theorem• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)
From these we can prove:P(A) = P(A ^ B) + P(A ^ ~B)
• How?
Copyright © Andrew W. Moore Slide 13
Multivalued Random Variables
• Suppose A can take on more than 2 values• A is a random variable with arity k if it can take on
exactly one value out of {v1,v2, .. vk}• Thus…
jivAvAP ji if 0)(
1)....( 21 kvAvAvAP
Copyright © Andrew W. Moore Slide 14
An easy fact about Multivalued Random
Variables:• Using the axioms of probability…
0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)
• And assuming that A obeys…
• It’s easy to prove that
jivAvAP ji if 0)(1)....( 21 kvAvAvAP
)()...(1
21
i
jji vAPvAvAvAP
Copyright © Andrew W. Moore Slide 15
An easy fact about Multivalued Random
Variables:• Using the axioms of probability…
0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)
• And assuming that A obeys…
• It’s easy to prove that
jivAvAP ji if 0)(1)...( 21 kvAvAvAP
)()...(1
21
i
jji vAPvAvAvAP
• And thus we can prove
1)(1
k
jjvAP
Copyright © Andrew W. Moore Slide 16
Another fact about Multivalued Random
Variables:• Using the axioms of probability…
0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)
• And assuming that A obeys…
• It’s easy to prove that
jivAvAP ji if 0)(1)...( 21 kvAvAvAP
)(])...[(1
21
i
jji vABPvAvAvABP
Copyright © Andrew W. Moore Slide 17
Another fact about Multivalued Random
Variables:• Using the axioms of probability…
0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)
• And assuming that A obeys…
• It’s easy to prove that
jivAvAP ji if 0)(1)...( 21 kvAvAvAP
)(])...[(1
21
i
jji vABPvAvAvABP
• And thus we can prove
)()(1
k
jjvABPBP
Copyright © Andrew W. Moore Slide 18
Elementary Probability in Pictures
• P(~A) + P(A) = 1
Copyright © Andrew W. Moore Slide 19
Elementary Probability in Pictures
• P(B) = P(B ^ A) + P(B ^ ~A)
Copyright © Andrew W. Moore Slide 20
Elementary Probability in Pictures
1)(1
k
jjvAP
Copyright © Andrew W. Moore Slide 21
Elementary Probability in Pictures
)()(1
k
jjvABPBP
Copyright © Andrew W. Moore Slide 22
Conditional Probability• P(A|B) = fraccion de ocurrencias en las cuales B
es cierta y en las cuales A es tambien cierta.
F
H
H = “Have a headache”F = “Coming down with Flu”
P(H) = 1/10P(F) = 1/40P(H|F) = 1/2
“Headaches are rare and flu is rarer, but if you’re coming down with ‘flu there’s a 50-50 chance you’ll have a headache.”
Copyright © Andrew W. Moore Slide 23
Conditional ProbabilityF
H
H = “Have a headache”F = “Coming down with Flu”
P(H) = 1/10P(F) = 1/40P(H|F) = 1/2
P(H|F) = fraccion de los que tiene flu que tienen headache
= #ocurrencias con flu and headache ------------------------------------ #ocurrencias with flu
= Area of “H and F” region ------------------------------ Area of “F” region
= P(H ^ F) ----------- P(F)
Copyright © Andrew W. Moore Slide 24
Definition of Conditional Probability
P(A ^ B) P(A|B) = ----------- P(B)
Corollary: The Chain Rule
P(A ^ B) = P(A|B) P(B)
Copyright © Andrew W. Moore Slide 25
Probabilistic Inference
F
H
H = “Have a headache”F = “Coming down with Flu”
P(H) = 1/10P(F) = 1/40P(H|F) = 1/2
One day you wake up with a headache. You think: “Drat! 50% of flus are associated with headaches so I must have a 50-50 chance of coming down with flu”
Is this reasoning good?
Copyright © Andrew W. Moore Slide 26
Probabilistic Inference
F
H
H = “Have a headache”F = “Coming down with Flu”
P(H) = 1/10P(F) = 1/40P(H|F) = 1/2
P(F ^ H) =P(F)P(H/F)=(1/40)(1/2)=1/80
P(F|H) =P(F^H)/P(H)=(1/80)/(1/10)=1/8=.125
Copyright © Andrew W. Moore Slide 27
Another way to understand the intuition
Thanks to Jahanzeb Sherwani for contributing this explanation:
Copyright © Andrew W. Moore Slide 28
What we just did… P(A ^ B) P(A|B) P(B)P(B|A) = ----------- = --------------- P(A) P(A)
This is Bayes Rule. P(B) es llamada la probabilidad apriori, P(A/B) es llamada la veosimiltud, y P(B/A) es la probabilidad posterior
Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418
Copyright © Andrew W. Moore Slide 29
More General Forms of Bayes Rule
)(~)|~()()|(
)()|()|(
APABPAPABP
APABPBAP
)(
)()|()|(
XBP
XAPXABPXBAP
Copyright © Andrew W. Moore Slide 30
More General Forms of Bayes Rule
An
kkk
iii
vAPvABP
vAPvABPBvAP
1
)()|(
)()|()|(
Copyright © Andrew W. Moore Slide 31
Useful Easy-to-prove facts1)|(~)|( BAPBAP
1)|(1
An
kk BvAP
Copyright © Andrew W. Moore Slide 32
The Joint Distribution
Recipe for making a joint distribution of M variables:
Example: Boolean variables A, B, C
Copyright © Andrew W. Moore Slide 33
Usando la Regla de Bayes en Juegos
El sobre “Ganador” tiene un dolar y 4 bolitas (2 rojas y 2 negras).
$1.00
El sobre “Perdedor” tiene 3 bolitas ( una roja y dos negras) y no tiene dinero.
Pregunta: Alguien extrae un sobre al azar y te lo ofrece en venta? Cuanto deberias pagar por el sobre?
Copyright © Andrew W. Moore Slide 34
Using Bayes Rule to Gamble
El sobre “Ganador” tiene un dolar y 4 bolitas (2 rojas y 2 negras).
$1.00
El sobre “Perdedor” tiene 3 bolitas ( una roja y dos negras) y no tiene dinero.
Pregunta interesante: Antes de decidir, se le permite a Ud. Que vea una bolita extraida del sobre.
Suponga que es negra cuanto deberia Ud pagar por el sobre? Suponga que es roja, cuanto deberia Ud. Pagar por el sobre?
Copyright © Andrew W. Moore Slide 35
Calculo…$1.00
P(Ganador/Bola negra)=P(Ganador^BN)/P(BN)
=P(Ganador)P(BN/Ganador)/P(BN)
P(BN)=P(BN^Ganador)+P(BN^Perdedor)
=P(Ganador)P(BN/Ganador)/P(BN)+P(Perdedor)P(BN/Perdedor)
=(1/2)(2/4)+(1/2)(2/3)=7/12
P(Ganador/Bola negra)=(1/4)/(7/12)=3/7=.43
No pagaria nada por el sobre.
Ahora P(Ganador/Bola Roja)=P(Ganador ^ BR)/P(BR)
=.40
Copyright © Andrew W. Moore Slide 36
Distribuciones conjuntas
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).
Example: Boolean variables A, B, C
A B C0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
A=1 si la persona es varon y 0 si es mujer, B=1 si la persona esta casada y 0 si no esta y C =1 si la persona
esta enferma y 0 si no lo esta
Copyright © Andrew W. Moore Slide 37
The Joint Distribution
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).
2. For each combination of values, say how probable it is.
Example: Boolean variables A, B, C
A B C Prob0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10
Copyright © Andrew W. Moore Slide 38
The Joint Distribution
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).
2. For each combination of values, say how probable it is.
3. If you subscribe to the axioms of probability, those numbers must sum to 1.
Example: Boolean variables A, B, C
A B C Prob0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10
A
B
C0.050.25
0.10 0.050.05
0.10
0.100.30
Copyright © Andrew W. Moore Slide 39
Datos para el ejemplo de la JD
• The Census dataset:48842 instances, mix of continuous and
discrete features (train=32561, test=16281)45222.
if instances with unknown values are removed (train=30162,test=15060)
Disponible en: http://ftp.ics.uci.edu/pub/machine-learning-databases
Donantes: Ronny Kohavi y Barry Becker (1996).
Copyright © Andrew W. Moore Slide 40
Variables en Census• 1-age: continuous.• 2-workclass: Private, Self-emp-not-inc, Self-emp-inc,
Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
• 3-fnlwgt (final weight) : continuous.• 4-education: Bachelors, Some-college, 11th, HS-
grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
• 5-education-num: continuous.• 6-marital-status: Married-civ-spouse, Divorced,
Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
• 7-occupation: • 8-relationship: Wife, Own-child, Husband, Not-in-
family, Other-relative, Unmarried.
Copyright © Andrew W. Moore Slide 41
Variables en Census• 9-race: White, Asian-Pac-Islander,
Amer-Indian-Eskimo, Other, Black.• 10-sex: Female[0], Male[1].• 11-capital-gain: continuous.• 12-capital-loss: continuous.• 13-hours-per-week: continuous.• 14-native-country: nominal• 15 Salary: >50K (rich)[2], <=50K
(poor)[1].
Copyright © Andrew W. Moore Slide 42
Primeros datos de census(train)
> traincensus[1:5,] age workcl fwgt educ educn marital occup relat race sex gain
loss hpwk1 39 6 77516 13 13 3 9 4 1 1 2174 0 402 50 2 83311 13 13 1 5 3 1 1 0 0 133 38 1 215646 9 9 2 7 4 1 1 0 0 404 53 1 234721 7 7 1 7 3 5 1 0 0 405 28 1 338409 13 13 1 6 1 5 0 0 0 40 country salary1 1 12 1 13 1 14 1 15 13 1>
Copyright © Andrew W. Moore Slide 43
Calculando frecuencias para sexo
> countsex=table(traincensus$sex)> probsex=countsex/32561> countsex
0 1 10771 21790 > probsex
0 1 0.3307945 0.6692055 >
Copyright © Andrew W. Moore Slide 44
Calculando las probabilidades conjuntas
> jdcensus=traincensus[,c(10,13,15)]> dim(jdcensus)[1] 32561 3> jdcensus[1:3,] sex hpwk salary1 1 40 12 1 13 13 1 40 1> dim(jdcensus[jdcensus$sex==0 &jdcensus$hpwk<=40.5
&jdcensus$salary==1,])[1][1] 8220> dim(jdcensus[jdcensus$sex==0 &jdcensus$hpwk<=40.5
&jdcensus$salary==1,])[1]/dim(jdcensus)[1][1] 0.2524492> dim(jdcensus[jdcensus$sex==0 &jdcensus$hpwk<=40.5
&jdcensus$salary==2,])[1]/dim(jdcensus)[1][1] 0.02484567
Copyright © Andrew W. Moore Slide 45
> dim(jdcensus[jdcensus$sex==0 &jdcensus$hpwk>40.5 &jdcensus$salary==1,])[1]/dim(jdcensus)[1]
[1] 0.0421363> dim(jdcensus[jdcensus$sex==0 &jdcensus$hpwk>40.5
&jdcensus$salary==2,])[1]/dim(jdcensus)[1][1] 0.01136329 > dim(jdcensus[jdcensus$sex==1 &jdcensus$hpwk<40.5
&jdcensus$salary==1,])[1]/dim(jdcensus)[1][1] 0.3309174> dim(jdcensus[jdcensus$sex==1 &jdcensus$hpwk<=40.5
&jdcensus$salary==2,])[1]/dim(jdcensus)[1][1] 0.09754> dim(jdcensus[jdcensus$sex==1 &jdcensus$hpwk>40.5
&jdcensus$salary==1,])[1]/dim(jdcensus)[1][1] 0.1336875> dim(jdcensus[jdcensus$sex==1 &jdcensus$hpwk>40.5
&jdcensus$salary==2,])[1]/dim(jdcensus)[1][1] 0.1070606
Copyright © Andrew W. Moore Slide 46
Using the Joint
One you have the JD you can ask for the probability of any logical expression involving your attribute
E
PEP matching rows
)row()(
Copyright © Andrew W. Moore Slide 47
Using the Joint
P(Poor Male) = 0.4654 E
PEP matching rows
)row()(
Copyright © Andrew W. Moore Slide 48
Using the Joint
P(Poor) = 0.7604 E
PEP matching rows
)row()(
Copyright © Andrew W. Moore Slide 49
Inference with the
Joint
2
2 1
matching rows
and matching rows
2
2121 )row(
)row(
)(
)()|(
E
EE
P
P
EP
EEPEEP
Copyright © Andrew W. Moore Slide 50
Inference with the
Joint
2
2 1
matching rows
and matching rows
2
2121 )row(
)row(
)(
)()|(
E
EE
P
P
EP
EEPEEP
P(Male | Poor) = 0.4654 / 0.7604 = 0.612
Copyright © Andrew W. Moore Slide 51
Inferencia es un asunto importante
• Tengo esta evidencia, cual es la probabilidad de que esta conclusion sea cierta?• Tengo la nuca hinchada: Cual es la probabilidad
de que tenga meningitis?• Veo luces en mi casa y son las 9pm. Cual es la
probabilidad de que mi esposa ya se haya quedado dormida?
• Hay bastante aplicaciones de Inferencia Bayesiana en Medicina, Farmacia, Diagnostico de Falla de Maquinas, etc.
Copyright © Andrew W. Moore Slide 52
Where do Joint Distributions come from?
• Idea One: Expert Humans• Idea Two: Simpler probabilistic facts and some
algebraExample: Suppose you knew
P(A) = 0.7
P(B|A) = 0.2P(B|~A) = 0.1
P(C|A^B) = 0.1P(C|A^~B) = 0.8P(C|~A^B) = 0.3P(C|~A^~B) = 0.1
Then you can automatically compute the JD using the chain rule
P(A=x ^ B=y ^ C=z) =P(C=z|A=x^ B=y) P(B=y|A=x) P(A=x)
In another lecture: Bayes Nets, a systematic way to do this.
Copyright © Andrew W. Moore Slide 53
Where do Joint Distributions come from?
• Idea Three: Learn them from data!
Prepare to see one of the most impressive learning algorithms you’ll come across in the entire course….
Copyright © Andrew W. Moore Slide 54
Learning a joint distributionBuild a JD table for your attributes in which the probabilities are unspecified
The fill in each row with
records ofnumber total
row matching records)row(ˆ P
A B C Prob0 0 0 ?
0 0 1 ?
0 1 0 ?
0 1 1 ?
1 0 0 ?
1 0 1 ?
1 1 0 ?
1 1 1 ?
A B C Prob0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10Fraction of all records in whichA and B are True but C is False
Copyright © Andrew W. Moore Slide 55
Example of Learning a Joint• This Joint was
obtained by learning from three attributes in the UCI “Adult” Census Database [Kohavi 1995]
Copyright © Andrew W. Moore Slide 56
Where are we?• We have recalled the fundamentals of
probability• We have become content with what
JDs are and how to use them• And we even know how to learn JDs
from data.
Copyright © Andrew W. Moore Slide 57
Density Estimation• Our Joint Distribution learner is our first
example of something called Density Estimation
• A Density Estimator learns a mapping from a set of attributes to a Probability
DensityEstimator
ProbabilityInput
Attributes
Copyright © Andrew W. Moore Slide 58
Density Estimation• Comparado con los otros dos modelos
principales:
Regressor Prediction ofreal-valued output
InputAttributes
DensityEstimator
ProbabilityInput
Attributes
Classifier Prediction ofcategorical output
InputAttributes
Copyright © Andrew W. Moore Slide 59
Evaluacion de la estimacion de densidad
Regressor Prediction ofreal-valued output
InputAttributes
DensityEstimator
ProbabilityInput
Attributes
Classifier Prediction ofcategorical output
InputAttributes
Test set Accuracy
?
Test set Accuracy
Test-set criterion for estimating performance on future data** See the Decision Tree or Cross Validation lecture for more detail
Copyright © Andrew W. Moore Slide 60
• Given a record x, a density estimator M can tell you how likely the record is:
• Given a dataset with R records, a density estimator can tell you how likely the dataset is:(Under the assumption that all records were
independently generated from the Density Estimator’s JD)
Evaluating a density estimator
R
kkR |MP|MP|MP
121 )(ˆ)(ˆ)dataset(ˆ xxxx
)(ˆ |MP x
Copyright © Andrew W. Moore Slide 61
The Auto-mpg datasetDonor: Quinlan,R. (1993) Number of Instances: 398 minus 6 missing=392 (training: 196
test: 196): Number of Attributes: 9 including the class attribute7. Attribute
Information: 1. mpg: continuous (discretizado bad<=25,good>25) 2. cylinders: multi-valued discrete 3. displacement: continuous (discretizado low<=200,
high>200) 4. horsepower: continuous ((discretizado low<=90, high>90) 5. weight: continuous (discretizado low<=3000,
high>3000) 6. acceleration: continuous (discretizado low<=15, high>15) 7. model year: multi-valued discrete (discretizado 70-74,75-
77,78-82 8. origin: multi-valued discrete 9. car name: string (unique for each instance) Note: horsepower has 6 missing values
Copyright © Andrew W. Moore Slide 62
The auto-mpg dataset18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle
malibu" 15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320" 18.0 8 318.0 150.0 3436. 11.0 70 1 "plymouth satellite" 16.0 8 304.0 150.0 3433. 12.0 70 1 "amc rebel sst" 17.0 8 302.0 140.0 3449. 10.5 70 1 "ford torino" …………………………………………………………………..27.0 4 140.0 86.00 2790. 15.6 82 1 "ford mustang gl" 44.0 4 97.00 52.00 2130. 24.6 82 2 "vw pickup" 32.0 4 135.0 84.00 2295. 11.6 82 1 "dodge rampage" 28.0 4 120.0 79.00 2625. 18.6 82 1 "ford ranger" 31.0 4 119.0 82.00 2720. 19.4 82 1 "chevy s-10"
Copyright © Andrew W. Moore Slide 63
A subset of auto-mpg levelmpg modelyear maker [1,] "bad" "70-74" "america" [2,] "bad" "70-74" "america" [3,] "bad" "70-74" "america" [4,] "bad" "70-74" "america" [5,] "bad" "70-74" "america" [6,] "bad" "70-74" "america“…………………………………………………………………………………………….. [7,] "good" "78-82" "america" [8,] "good" "78-82" "europe" [9,] "good" "78-82" "america"[10,] "good" "78-82" "america"[11,] "good" "78-82" "america"
Copyright © Andrew W. Moore Slide 64
Mpg: Train test and test set indices_samples=sample(1:392,196)mpgtrain=smallmpg[indices_samples,]dim(mpgtrain)[1] 196 3mpgtest=smallmpg[-indices_samples,]#computing the first joint probabilitya1=dim(mpgtrain[mpgtrain[,1]=="bad"
& mpgtrain[,2]=="70-74" & mpgtrain[,3]=="america",])[1]/192
Copyright © Andrew W. Moore Slide 65
The joint density estimatelevelmpg modelyear maker probability
bad 70-74 America 0.2397959
bad 70-74 Asia 0.01530612
Bad 70-74 Europe 0.03061224
Bad 75-77 America 0.2091837
Bad 75-77 Asia 0.01020408
Bad 75-77 Europe 0.04591837
Bad 78-82 America 0.04591837
Bad 78-82 Asia 0.005102041
Bad 78-82 Europe 0
Good 70-74 America 0.02040816
Good 70-74 Asia 0.01530612
Good 70-74 Europe 0.03061224
Good 75-77 America 0.04081633
Good 75-77 Asia 0.04081633
Good 75-77 Europe 0.02551020
Good 78-82 America 0.1071429
Good 78-82 Asia 0.07142857
Good 78-82 Europe 0.04591837
Copyright © Andrew W. Moore Slide 66
222-
196
119621
10 1.45 case) (in this
)(ˆ)(ˆ)mpgtest(ˆ
k
k|MP|MP|MP xxxx
This probability has been computed considering that any joint probability is at least greater than 1/1020=10-20
Copyright © Andrew W. Moore Slide 67
A small dataset: Miles Per Gallon
From the UCI repository (thanks to Ross Quinlan)
192 Training Set Records
mpg modelyear maker
good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe
Copyright © Andrew W. Moore Slide 68
A small dataset: Miles Per Gallon
192 Training Set Records
mpg modelyear maker
good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe
Copyright © Andrew W. Moore Slide 69
A small dataset: Miles Per Gallon
192 Training Set Records
mpg modelyear maker
good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe
203-
196
119621
10 3.4 case) (in this
)(ˆ)(ˆ)dataset /(ˆ
k
k|MP|MP|MP xxxx
Dataset puede ser mpgtrain o mpgtest
Copyright © Andrew W. Moore Slide 70
Log Probabilities
Since probabilities of datasets get so small we usually use log probabilities
R
kk
R
kk |MP|MP|MP
11
)(ˆlog)(ˆlog)dataset(ˆlog xx
Copyright © Andrew W. Moore Slide 71
A small dataset: Miles Per Gallon
192 Training Set Records
mpg modelyear maker
good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe
466.19 case) (in this
)(ˆlog)(ˆlog)dataset(ˆlog11
R
kk
R
kk |MP|MP|MP xx
In our case= -510.79
Copyright © Andrew W. Moore Slide 72
Resumen: The Good News• Hemos visto una manera de aprender
un estimador de densidad a partir de datos.
• Estimadores de densidad pueden ser usados para• Ordenar los records segun la probabilidad,
y detectar asi casos anomalos (“outliers”).• Para hacer inferencia: calcular P(E1|E2)• Para calcular clasificadores Bayesianos (se
vera mas tarde).
Copyright © Andrew W. Moore Slide 73
Summary: The Bad News• Density estimation by directly learning
the joint is trivial, mindless and dangerous
Copyright © Andrew W. Moore Slide 74
Using a test set
An independent test set with 196 cars has a worse log likelihood
(actually it’s a billion quintillion quintillion quintillion quintillion times less likely)
….Density estimators can overfit. And the full joint density estimator is the overfittiest of them all!
Copyright © Andrew W. Moore Slide 75
Overfitting Density Estimators
If this ever happens, it means there are certain combinations that we learn are impossible
0)(ˆ any for if
)(ˆlog)(ˆlog)testset(ˆlog11
|MPk
|MP|MP|MP
k
R
kk
R
kk
x
xx
Copyright © Andrew W. Moore Slide 76
Using a test set
The only reason that our test set didn’t score -infinity is that my code is hard-wired to always predict a probability of at least one in 1020
We need Density Estimators that are less prone to overfitting
Copyright © Andrew W. Moore Slide 77
Naïve Density Estimation
El problema con el estimador de densidad conjunto es que simplemente refleja la muestra de entrenamiento.
Necesitamos algo que generalize en forma mas util.
El modelo naïve model generaliza mas eficientemente:
Asume que cada atributo esta distribuido independientemente uno del otro.
Copyright © Andrew W. Moore Slide 78
A note about independence• Assume A and B are Boolean Random
Variables. Then“A and B are independent”
if and only ifP(A|B) = P(A)
• “A and B are independent” is often notated as
BA
Copyright © Andrew W. Moore Slide 79
Independence Theorems• Assume P(A|B) = P(A)• Then P(A^B) =
= P(A) P(B)
• Assume P(A|B) = P(A)• Then P(B|A) =
• Por el resultado anterior
• P(B/A)=P(A)P(B)/P(A)
= P(B)
P(A/B)=P(A^B)/P(B)
Por Independencia,
P(A)=P(A^B)/P(B)
P(A^B)/P(A)
Copyright © Andrew W. Moore Slide 80
Independence Theorems• Assume P(A|B) =
P(A)• Then P(~A|B) =• P(~A^B)/P(B)=• [P(B)-P(A^B)]/P(B)• 1-P(A^B)/P(B)=• 1-P(A), por Indep.• P(~A)
= P(~A)
• Assume P(A|B) = P(A)
• Then P(A|~B) =• P(A^~B)/P(~B)• [P(A)-P(A^B)]/1-P(B)• P(A)-P(A)P(B)/1-P(B)• P(A)[1-P(B)]/1-P(B)
= P(A)
Copyright © Andrew W. Moore Slide 81
Multivalued Independence
For multivalued Random Variables A and B,
BAif and only if
)()|(:, uAPvBuAPvu from which you can then prove things like…
)()()(:, vBPuAPvBuAPvu )()|(:, vBPuAvBPvu
Copyright © Andrew W. Moore Slide 82
Independently Distributed Data
• Let x[i] denote the i’th field of record x.• The independently distributed assumption
says that for any i,v, u1 u2… ui-1 ui+1… uM
)][(
)][,]1[,]1[,]2[,]1[|][( 1121
vixP
uMxuixuixuxuxvixP Mii
• Or in other words, x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…x[M]}
• This is often written as ]}[],1[],1[],2[],1[{][ Mxixixxxix
Copyright © Andrew W. Moore Slide 83
Back to Naïve Density Estimation
• Let x[i] denote the i’th field of record x:• Naïve DE assumes x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…
x[M]}• Example:
• Suppose that each record is generated by randomly shaking a green dice and a red dice
• Dataset 1: A = red value, B = green value
• Dataset 2: A = red value, B = sum of values
• Dataset 3: A = sum of values, B = difference of values
• Which of these datasets violates the naïve assumption?
Copyright © Andrew W. Moore Slide 84
Using the Naïve Distribution• Once you have a Naïve Distribution you can
easily compute any row of the joint distribution.
• Suppose A, B, C and D are independently distributed. What is P(A^~B^C^~D)?
Copyright © Andrew W. Moore Slide 85
Using the Naïve Distribution• Once you have a Naïve Distribution you can
easily compute any row of the joint distribution.
• Suppose A, B, C and D are independently distributed. What is P(A^~B^C^~D)?
= P(A|~B^C^~D) P(~B^C^~D)= P(A) P(~B^C^~D)= P(A) P(~B|C^~D) P(C^~D)= P(A) P(~B) P(C^~D)= P(A) P(~B) P(C|~D) P(~D)= P(A) P(~B) P(C) P(~D)
Copyright © Andrew W. Moore Slide 86
Naïve Distribution General Case
• Suppose x[1], x[2], … x[M] are independently distributed.
M
kkM ukxPuMxuxuxP
121 )][()][,]2[,]1[(
• So if we have a Naïve Distribution we can construct any row of the implied Joint Distribution on demand.
• So we can do any inference • But how do we learn a Naïve Density
Estimator?
Copyright © Andrew W. Moore Slide 87
Learning a Naïve Density Estimator
records ofnumber total
][in which records#)][(ˆ uixuixP
Another trivial learning algorithm!
Copyright © Andrew W. Moore Slide 88
ConparacionJoint DE Naïve DE
Can model anything Can model only very boring distributions
Puede modelar una copia ruidosa de la muestra original
No lo puede hacer
Given 100 records and more than 6 Boolean attributes will screw up badly
Given 100 records and 10,000 multivalued attributes will be fine
Copyright © Andrew W. Moore Slide 89
A tiny part of the DE
learned by “Joint”
Empirical Results: “MPG”The “MPG” dataset consists of 392 records and 8 attributes
The DE learned by
“Naive”
Copyright © Andrew W. Moore Slide 90
A tiny part of the DE
learned by “Joint”
Empirical Results: “MPG”The “MPG” dataset consists of 392 records and 8 attributes
The DE learned by
“Naive”
Copyright © Andrew W. Moore Slide 91
The DE learned by
“Joint”
Empirical Results: “Weight vs. MPG”Suppose we train only from the “Weight” and “MPG” attributes
The DE learned by
“Naive”
Copyright © Andrew W. Moore Slide 92
The DE learned by
“Joint”
Empirical Results: “Weight vs. MPG”Suppose we train only from the “Weight” and “MPG” attributes
The DE learned by
“Naive”
Copyright © Andrew W. Moore Slide 93
The DE learned by
“Joint”
“Weight vs. MPG”: The best that Naïve can do
The DE learned by
“Naive”
Copyright © Andrew W. Moore Slide 94
Reminder: The Good News• Tenemos dos maneras de aprender un
estimador de densidad a partir de datos. • *In otras clases veremos estimadores de
densidad mas complicados. (Mixture Models, Bayesian Networks, Density Trees, Kernel Densities and many more)
• Los estimadores de densidad pueden ser usados para:• Deteccion de anomalias (“outliers”).• Para ser inferencias: Calcular P(E1|E2)• Para calcular clasificadores bayesianos.