copyright © andrew w. moore slide 1 probabilistic and bayesian analytics andrew w. moore professor...
TRANSCRIPT
![Page 1: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/1.jpg)
Copyright © Andrew W. Moore Slide 1
Probabilistic and Bayesian Analytics
Andrew W. MooreProfessor
School of Computer ScienceCarnegie Mellon University
www.cs.cmu.edu/[email protected]
![Page 2: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/2.jpg)
Copyright © Andrew W. Moore Slide 2
Probability• The world is a very uncertain place• 30 years of Artificial Intelligence and
Database research danced around this fact
• And then a few AI researchers decided to use some ideas from the eighteenth century.
• We will review the fundamentals of probability.
![Page 3: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/3.jpg)
Copyright © Andrew W. Moore Slide 3
Discrete Random Variables. Caso Binario
• A is a Boolean-valued random variable if A denotes an event, and there is some degree of uncertainty as to whether A occurs.
• Examples• A = The US president in 2023 will be
male• A = You wake up tomorrow with a
headache• A = You have Ebola
![Page 4: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/4.jpg)
Copyright © Andrew W. Moore Slide 4
Probabilidades• Denotamos P(A) , la probabilidad de A,
como “la fraccion de todas las ocurrencias en las cuales A es cierta”.
• Existen varias maneras de asignar probabilidades pero no hablaremos de eso ahora.
![Page 5: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/5.jpg)
Copyright © Andrew W. Moore Slide 5
Diagrama de Venn para visualizar A
Espacio muestral de todas las ocurrencias posibles.
El area es 1Ocurrencias en las cuales A es Falsa
Ocurrencias en las cuales A es cierta
P(A) = Area dela parte ovalada
![Page 6: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/6.jpg)
Copyright © Andrew W. Moore Slide 6
The Axioms of Probability• 0 <= P(A) <= 1• P(A es cierta en todas las ocurrencias) = 1• P(en ninguna ocurrencia A es cierta) = 0• P(A or B) = P(A) + P(B) si A y B son
disjuntos.
Estos axiomas fueron introducidos por la escuela rusa al principio de los 1900: Kolmogorov, Liapunov, Kinthchine, Chebychev
![Page 7: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/7.jpg)
Copyright © Andrew W. Moore Slide 7
Interpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B)
The area of A can’t get any smaller than 0
And a zero area would mean no world could ever have A true
![Page 8: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/8.jpg)
Copyright © Andrew W. Moore Slide 8
Interpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B)
The area of A can’t get any bigger than 1
And an area of 1 would mean all worlds will have A true
![Page 9: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/9.jpg)
Copyright © Andrew W. Moore Slide 9
Interpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)
A
B
![Page 10: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/10.jpg)
Copyright © Andrew W. Moore Slide 10
Interpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)
A
B
P(A or B)
BP(A and B)
Simple addition and subtraction
![Page 11: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/11.jpg)
Copyright © Andrew W. Moore Slide 11
Theorems from the Axioms• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)
From these we can prove:P(not A) = P(~A) = 1-P(A)
• How?
![Page 12: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/12.jpg)
Copyright © Andrew W. Moore Slide 12
Another important theorem• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)
From these we can prove:P(A) = P(A ^ B) + P(A ^ ~B)
• How?
![Page 13: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/13.jpg)
Copyright © Andrew W. Moore Slide 13
Multivalued Random Variables
• Suppose A can take on more than 2 values• A is a random variable with arity k if it can take on
exactly one value out of {v1,v2, .. vk}• Thus…
jivAvAP ji if 0)(
1)....( 21 kvAvAvAP
![Page 14: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/14.jpg)
Copyright © Andrew W. Moore Slide 14
An easy fact about Multivalued Random
Variables:• Using the axioms of probability…
0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)
• And assuming that A obeys…
• It’s easy to prove that
jivAvAP ji if 0)(1)....( 21 kvAvAvAP
)()...(1
21
i
jji vAPvAvAvAP
![Page 15: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/15.jpg)
Copyright © Andrew W. Moore Slide 15
An easy fact about Multivalued Random
Variables:• Using the axioms of probability…
0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)
• And assuming that A obeys…
• It’s easy to prove that
jivAvAP ji if 0)(1)...( 21 kvAvAvAP
)()...(1
21
i
jji vAPvAvAvAP
• And thus we can prove
1)(1
k
jjvAP
![Page 16: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/16.jpg)
Copyright © Andrew W. Moore Slide 16
Another fact about Multivalued Random
Variables:• Using the axioms of probability…
0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)
• And assuming that A obeys…
• It’s easy to prove that
jivAvAP ji if 0)(1)...( 21 kvAvAvAP
)(])...[(1
21
i
jji vABPvAvAvABP
![Page 17: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/17.jpg)
Copyright © Andrew W. Moore Slide 17
Another fact about Multivalued Random
Variables:• Using the axioms of probability…
0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)
• And assuming that A obeys…
• It’s easy to prove that
jivAvAP ji if 0)(1)...( 21 kvAvAvAP
)(])...[(1
21
i
jji vABPvAvAvABP
• And thus we can prove
)()(1
k
jjvABPBP
![Page 18: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/18.jpg)
Copyright © Andrew W. Moore Slide 18
Elementary Probability in Pictures
• P(~A) + P(A) = 1
![Page 19: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/19.jpg)
Copyright © Andrew W. Moore Slide 19
Elementary Probability in Pictures
• P(B) = P(B ^ A) + P(B ^ ~A)
![Page 20: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/20.jpg)
Copyright © Andrew W. Moore Slide 20
Elementary Probability in Pictures
1)(1
k
jjvAP
![Page 21: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/21.jpg)
Copyright © Andrew W. Moore Slide 21
Elementary Probability in Pictures
)()(1
k
jjvABPBP
![Page 22: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/22.jpg)
Copyright © Andrew W. Moore Slide 22
Conditional Probability• P(A|B) = fraccion de ocurrencias en las cuales B
es cierta y en las cuales A es tambien cierta.
F
H
H = “Have a headache”F = “Coming down with Flu”
P(H) = 1/10P(F) = 1/40P(H|F) = 1/2
“Headaches are rare and flu is rarer, but if you’re coming down with ‘flu there’s a 50-50 chance you’ll have a headache.”
![Page 23: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/23.jpg)
Copyright © Andrew W. Moore Slide 23
Conditional ProbabilityF
H
H = “Have a headache”F = “Coming down with Flu”
P(H) = 1/10P(F) = 1/40P(H|F) = 1/2
P(H|F) = fraccion de los que tiene flu que tienen headache
= #ocurrencias con flu and headache ------------------------------------ #ocurrencias with flu
= Area of “H and F” region ------------------------------ Area of “F” region
= P(H ^ F) ----------- P(F)
![Page 24: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/24.jpg)
Copyright © Andrew W. Moore Slide 24
Definition of Conditional Probability
P(A ^ B) P(A|B) = ----------- P(B)
Corollary: The Chain Rule
P(A ^ B) = P(A|B) P(B)
![Page 25: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/25.jpg)
Copyright © Andrew W. Moore Slide 25
Probabilistic Inference
F
H
H = “Have a headache”F = “Coming down with Flu”
P(H) = 1/10P(F) = 1/40P(H|F) = 1/2
One day you wake up with a headache. You think: “Drat! 50% of flus are associated with headaches so I must have a 50-50 chance of coming down with flu”
Is this reasoning good?
![Page 26: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/26.jpg)
Copyright © Andrew W. Moore Slide 26
Probabilistic Inference
F
H
H = “Have a headache”F = “Coming down with Flu”
P(H) = 1/10P(F) = 1/40P(H|F) = 1/2
P(F ^ H) =P(F)P(H/F)=(1/40)(1/2)=1/80
P(F|H) =P(F^H)/P(H)=(1/80)/(1/10)=1/8=.125
![Page 27: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/27.jpg)
Copyright © Andrew W. Moore Slide 27
Another way to understand the intuition
Thanks to Jahanzeb Sherwani for contributing this explanation:
![Page 28: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/28.jpg)
Copyright © Andrew W. Moore Slide 28
What we just did… P(A ^ B) P(A|B) P(B)P(B|A) = ----------- = --------------- P(A) P(A)
This is Bayes Rule. P(B) es llamada la probabilidad apriori, P(A/B) es llamada la veosimiltud, y P(B/A) es la probabilidad posterior
Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418
![Page 29: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/29.jpg)
Copyright © Andrew W. Moore Slide 29
More General Forms of Bayes Rule
)(~)|~()()|(
)()|()|(
APABPAPABP
APABPBAP
)(
)()|()|(
XBP
XAPXABPXBAP
![Page 30: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/30.jpg)
Copyright © Andrew W. Moore Slide 30
More General Forms of Bayes Rule
An
kkk
iii
vAPvABP
vAPvABPBvAP
1
)()|(
)()|()|(
![Page 31: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/31.jpg)
Copyright © Andrew W. Moore Slide 31
Useful Easy-to-prove facts1)|(~)|( BAPBAP
1)|(1
An
kk BvAP
![Page 32: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/32.jpg)
Copyright © Andrew W. Moore Slide 32
The Joint Distribution
Recipe for making a joint distribution of M variables:
Example: Boolean variables A, B, C
![Page 33: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/33.jpg)
Copyright © Andrew W. Moore Slide 33
Usando la Regla de Bayes en Juegos
El sobre “Ganador” tiene un dolar y 4 bolitas (2 rojas y 2 negras).
$1.00
El sobre “Perdedor” tiene 3 bolitas ( una roja y dos negras) y no tiene dinero.
Pregunta: Alguien extrae un sobre al azar y te lo ofrece en venta? Cuanto deberias pagar por el sobre?
![Page 34: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/34.jpg)
Copyright © Andrew W. Moore Slide 34
Using Bayes Rule to Gamble
El sobre “Ganador” tiene un dolar y 4 bolitas (2 rojas y 2 negras).
$1.00
El sobre “Perdedor” tiene 3 bolitas ( una roja y dos negras) y no tiene dinero.
Pregunta interesante: Antes de decidir, se le permite a Ud. Que vea una bolita extraida del sobre.
Suponga que es negra cuanto deberia Ud pagar por el sobre? Suponga que es roja, cuanto deberia Ud. Pagar por el sobre?
![Page 35: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/35.jpg)
Copyright © Andrew W. Moore Slide 35
Calculo…$1.00
P(Ganador/Bola negra)=P(Ganador^BN)/P(BN)
=P(Ganador)P(BN/Ganador)/P(BN)
P(BN)=P(BN^Ganador)+P(BN^Perdedor)
=P(Ganador)P(BN/Ganador)/P(BN)+P(Perdedor)P(BN/Perdedor)
=(1/2)(2/4)+(1/2)(2/3)=7/12
P(Ganador/Bola negra)=(1/4)/(7/12)=3/7=.43
No pagaria nada por el sobre.
Ahora P(Ganador/Bola Roja)=P(Ganador ^ BR)/P(BR)
=.40
![Page 36: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/36.jpg)
Copyright © Andrew W. Moore Slide 36
Distribuciones conjuntas
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).
Example: Boolean variables A, B, C
A B C0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
A=1 si la persona es varon y 0 si es mujer, B=1 si la persona esta casada y 0 si no esta y C =1 si la persona
esta enferma y 0 si no lo esta
![Page 37: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/37.jpg)
Copyright © Andrew W. Moore Slide 37
The Joint Distribution
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).
2. For each combination of values, say how probable it is.
Example: Boolean variables A, B, C
A B C Prob0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10
![Page 38: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/38.jpg)
Copyright © Andrew W. Moore Slide 38
The Joint Distribution
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).
2. For each combination of values, say how probable it is.
3. If you subscribe to the axioms of probability, those numbers must sum to 1.
Example: Boolean variables A, B, C
A B C Prob0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10
A
B
C0.050.25
0.10 0.050.05
0.10
0.100.30
![Page 39: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/39.jpg)
Copyright © Andrew W. Moore Slide 39
Datos para el ejemplo de la JD
• The Census dataset:48842 instances, mix of continuous and
discrete features (train=32561, test=16281)45222.
if instances with unknown values are removed (train=30162,test=15060)
Disponible en: http://ftp.ics.uci.edu/pub/machine-learning-databases
Donantes: Ronny Kohavi y Barry Becker (1996).
![Page 40: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/40.jpg)
Copyright © Andrew W. Moore Slide 40
Variables en Census• 1-age: continuous.• 2-workclass: Private, Self-emp-not-inc, Self-emp-inc,
Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
• 3-fnlwgt (final weight) : continuous.• 4-education: Bachelors, Some-college, 11th, HS-
grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
• 5-education-num: continuous.• 6-marital-status: Married-civ-spouse, Divorced,
Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
• 7-occupation: • 8-relationship: Wife, Own-child, Husband, Not-in-
family, Other-relative, Unmarried.
![Page 41: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/41.jpg)
Copyright © Andrew W. Moore Slide 41
Variables en Census• 9-race: White, Asian-Pac-Islander,
Amer-Indian-Eskimo, Other, Black.• 10-sex: Female[0], Male[1].• 11-capital-gain: continuous.• 12-capital-loss: continuous.• 13-hours-per-week: continuous.• 14-native-country: nominal• 15 Salary: >50K (rich)[2], <=50K
(poor)[1].
![Page 42: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/42.jpg)
Copyright © Andrew W. Moore Slide 42
Primeros datos de census(train)
> traincensus[1:5,] age workcl fwgt educ educn marital occup relat race sex gain
loss hpwk1 39 6 77516 13 13 3 9 4 1 1 2174 0 402 50 2 83311 13 13 1 5 3 1 1 0 0 133 38 1 215646 9 9 2 7 4 1 1 0 0 404 53 1 234721 7 7 1 7 3 5 1 0 0 405 28 1 338409 13 13 1 6 1 5 0 0 0 40 country salary1 1 12 1 13 1 14 1 15 13 1>
![Page 43: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/43.jpg)
Copyright © Andrew W. Moore Slide 43
Calculando frecuencias para sexo
> countsex=table(traincensus$sex)> probsex=countsex/32561> countsex
0 1 10771 21790 > probsex
0 1 0.3307945 0.6692055 >
![Page 44: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/44.jpg)
Copyright © Andrew W. Moore Slide 44
Calculando las probabilidades conjuntas
> jdcensus=traincensus[,c(10,13,15)]> dim(jdcensus)[1] 32561 3> jdcensus[1:3,] sex hpwk salary1 1 40 12 1 13 13 1 40 1> dim(jdcensus[jdcensus$sex==0 &jdcensus$hpwk<=40.5
&jdcensus$salary==1,])[1][1] 8220> dim(jdcensus[jdcensus$sex==0 &jdcensus$hpwk<=40.5
&jdcensus$salary==1,])[1]/dim(jdcensus)[1][1] 0.2524492> dim(jdcensus[jdcensus$sex==0 &jdcensus$hpwk<=40.5
&jdcensus$salary==2,])[1]/dim(jdcensus)[1][1] 0.02484567
![Page 45: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/45.jpg)
Copyright © Andrew W. Moore Slide 45
> dim(jdcensus[jdcensus$sex==0 &jdcensus$hpwk>40.5 &jdcensus$salary==1,])[1]/dim(jdcensus)[1]
[1] 0.0421363> dim(jdcensus[jdcensus$sex==0 &jdcensus$hpwk>40.5
&jdcensus$salary==2,])[1]/dim(jdcensus)[1][1] 0.01136329 > dim(jdcensus[jdcensus$sex==1 &jdcensus$hpwk<40.5
&jdcensus$salary==1,])[1]/dim(jdcensus)[1][1] 0.3309174> dim(jdcensus[jdcensus$sex==1 &jdcensus$hpwk<=40.5
&jdcensus$salary==2,])[1]/dim(jdcensus)[1][1] 0.09754> dim(jdcensus[jdcensus$sex==1 &jdcensus$hpwk>40.5
&jdcensus$salary==1,])[1]/dim(jdcensus)[1][1] 0.1336875> dim(jdcensus[jdcensus$sex==1 &jdcensus$hpwk>40.5
&jdcensus$salary==2,])[1]/dim(jdcensus)[1][1] 0.1070606
![Page 46: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/46.jpg)
Copyright © Andrew W. Moore Slide 46
Using the Joint
One you have the JD you can ask for the probability of any logical expression involving your attribute
E
PEP matching rows
)row()(
![Page 47: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/47.jpg)
Copyright © Andrew W. Moore Slide 47
Using the Joint
P(Poor Male) = 0.4654 E
PEP matching rows
)row()(
![Page 48: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/48.jpg)
Copyright © Andrew W. Moore Slide 48
Using the Joint
P(Poor) = 0.7604 E
PEP matching rows
)row()(
![Page 49: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/49.jpg)
Copyright © Andrew W. Moore Slide 49
Inference with the
Joint
2
2 1
matching rows
and matching rows
2
2121 )row(
)row(
)(
)()|(
E
EE
P
P
EP
EEPEEP
![Page 50: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/50.jpg)
Copyright © Andrew W. Moore Slide 50
Inference with the
Joint
2
2 1
matching rows
and matching rows
2
2121 )row(
)row(
)(
)()|(
E
EE
P
P
EP
EEPEEP
P(Male | Poor) = 0.4654 / 0.7604 = 0.612
![Page 51: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/51.jpg)
Copyright © Andrew W. Moore Slide 51
Inferencia es un asunto importante
• Tengo esta evidencia, cual es la probabilidad de que esta conclusion sea cierta?• Tengo la nuca hinchada: Cual es la probabilidad
de que tenga meningitis?• Veo luces en mi casa y son las 9pm. Cual es la
probabilidad de que mi esposa ya se haya quedado dormida?
• Hay bastante aplicaciones de Inferencia Bayesiana en Medicina, Farmacia, Diagnostico de Falla de Maquinas, etc.
![Page 52: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/52.jpg)
Copyright © Andrew W. Moore Slide 52
Where do Joint Distributions come from?
• Idea One: Expert Humans• Idea Two: Simpler probabilistic facts and some
algebraExample: Suppose you knew
P(A) = 0.7
P(B|A) = 0.2P(B|~A) = 0.1
P(C|A^B) = 0.1P(C|A^~B) = 0.8P(C|~A^B) = 0.3P(C|~A^~B) = 0.1
Then you can automatically compute the JD using the chain rule
P(A=x ^ B=y ^ C=z) =P(C=z|A=x^ B=y) P(B=y|A=x) P(A=x)
In another lecture: Bayes Nets, a systematic way to do this.
![Page 53: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/53.jpg)
Copyright © Andrew W. Moore Slide 53
Where do Joint Distributions come from?
• Idea Three: Learn them from data!
Prepare to see one of the most impressive learning algorithms you’ll come across in the entire course….
![Page 54: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/54.jpg)
Copyright © Andrew W. Moore Slide 54
Learning a joint distributionBuild a JD table for your attributes in which the probabilities are unspecified
The fill in each row with
records ofnumber total
row matching records)row(ˆ P
A B C Prob0 0 0 ?
0 0 1 ?
0 1 0 ?
0 1 1 ?
1 0 0 ?
1 0 1 ?
1 1 0 ?
1 1 1 ?
A B C Prob0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10Fraction of all records in whichA and B are True but C is False
![Page 55: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/55.jpg)
Copyright © Andrew W. Moore Slide 55
Example of Learning a Joint• This Joint was
obtained by learning from three attributes in the UCI “Adult” Census Database [Kohavi 1995]
![Page 56: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/56.jpg)
Copyright © Andrew W. Moore Slide 56
Where are we?• We have recalled the fundamentals of
probability• We have become content with what
JDs are and how to use them• And we even know how to learn JDs
from data.
![Page 57: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/57.jpg)
Copyright © Andrew W. Moore Slide 57
Density Estimation• Our Joint Distribution learner is our first
example of something called Density Estimation
• A Density Estimator learns a mapping from a set of attributes to a Probability
DensityEstimator
ProbabilityInput
Attributes
![Page 58: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/58.jpg)
Copyright © Andrew W. Moore Slide 58
Density Estimation• Comparado con los otros dos modelos
principales:
Regressor Prediction ofreal-valued output
InputAttributes
DensityEstimator
ProbabilityInput
Attributes
Classifier Prediction ofcategorical output
InputAttributes
![Page 59: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/59.jpg)
Copyright © Andrew W. Moore Slide 59
Evaluacion de la estimacion de densidad
Regressor Prediction ofreal-valued output
InputAttributes
DensityEstimator
ProbabilityInput
Attributes
Classifier Prediction ofcategorical output
InputAttributes
Test set Accuracy
?
Test set Accuracy
Test-set criterion for estimating performance on future data** See the Decision Tree or Cross Validation lecture for more detail
![Page 60: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/60.jpg)
Copyright © Andrew W. Moore Slide 60
• Given a record x, a density estimator M can tell you how likely the record is:
• Given a dataset with R records, a density estimator can tell you how likely the dataset is:(Under the assumption that all records were
independently generated from the Density Estimator’s JD)
Evaluating a density estimator
R
kkR |MP|MP|MP
121 )(ˆ)(ˆ)dataset(ˆ xxxx
)(ˆ |MP x
![Page 61: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/61.jpg)
Copyright © Andrew W. Moore Slide 61
The Auto-mpg datasetDonor: Quinlan,R. (1993) Number of Instances: 398 minus 6 missing=392 (training: 196
test: 196): Number of Attributes: 9 including the class attribute7. Attribute
Information: 1. mpg: continuous (discretizado bad<=25,good>25) 2. cylinders: multi-valued discrete 3. displacement: continuous (discretizado low<=200,
high>200) 4. horsepower: continuous ((discretizado low<=90, high>90) 5. weight: continuous (discretizado low<=3000,
high>3000) 6. acceleration: continuous (discretizado low<=15, high>15) 7. model year: multi-valued discrete (discretizado 70-74,75-
77,78-82 8. origin: multi-valued discrete 9. car name: string (unique for each instance) Note: horsepower has 6 missing values
![Page 62: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/62.jpg)
Copyright © Andrew W. Moore Slide 62
The auto-mpg dataset18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle
malibu" 15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320" 18.0 8 318.0 150.0 3436. 11.0 70 1 "plymouth satellite" 16.0 8 304.0 150.0 3433. 12.0 70 1 "amc rebel sst" 17.0 8 302.0 140.0 3449. 10.5 70 1 "ford torino" …………………………………………………………………..27.0 4 140.0 86.00 2790. 15.6 82 1 "ford mustang gl" 44.0 4 97.00 52.00 2130. 24.6 82 2 "vw pickup" 32.0 4 135.0 84.00 2295. 11.6 82 1 "dodge rampage" 28.0 4 120.0 79.00 2625. 18.6 82 1 "ford ranger" 31.0 4 119.0 82.00 2720. 19.4 82 1 "chevy s-10"
![Page 63: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/63.jpg)
Copyright © Andrew W. Moore Slide 63
A subset of auto-mpg levelmpg modelyear maker [1,] "bad" "70-74" "america" [2,] "bad" "70-74" "america" [3,] "bad" "70-74" "america" [4,] "bad" "70-74" "america" [5,] "bad" "70-74" "america" [6,] "bad" "70-74" "america“…………………………………………………………………………………………….. [7,] "good" "78-82" "america" [8,] "good" "78-82" "europe" [9,] "good" "78-82" "america"[10,] "good" "78-82" "america"[11,] "good" "78-82" "america"
![Page 64: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/64.jpg)
Copyright © Andrew W. Moore Slide 64
Mpg: Train test and test set indices_samples=sample(1:392,196)mpgtrain=smallmpg[indices_samples,]dim(mpgtrain)[1] 196 3mpgtest=smallmpg[-indices_samples,]#computing the first joint probabilitya1=dim(mpgtrain[mpgtrain[,1]=="bad"
& mpgtrain[,2]=="70-74" & mpgtrain[,3]=="america",])[1]/192
![Page 65: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/65.jpg)
Copyright © Andrew W. Moore Slide 65
The joint density estimatelevelmpg modelyear maker probability
bad 70-74 America 0.2397959
bad 70-74 Asia 0.01530612
Bad 70-74 Europe 0.03061224
Bad 75-77 America 0.2091837
Bad 75-77 Asia 0.01020408
Bad 75-77 Europe 0.04591837
Bad 78-82 America 0.04591837
Bad 78-82 Asia 0.005102041
Bad 78-82 Europe 0
Good 70-74 America 0.02040816
Good 70-74 Asia 0.01530612
Good 70-74 Europe 0.03061224
Good 75-77 America 0.04081633
Good 75-77 Asia 0.04081633
Good 75-77 Europe 0.02551020
Good 78-82 America 0.1071429
Good 78-82 Asia 0.07142857
Good 78-82 Europe 0.04591837
![Page 66: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/66.jpg)
Copyright © Andrew W. Moore Slide 66
222-
196
119621
10 1.45 case) (in this
)(ˆ)(ˆ)mpgtest(ˆ
k
k|MP|MP|MP xxxx
This probability has been computed considering that any joint probability is at least greater than 1/1020=10-20
![Page 67: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/67.jpg)
Copyright © Andrew W. Moore Slide 67
A small dataset: Miles Per Gallon
From the UCI repository (thanks to Ross Quinlan)
192 Training Set Records
mpg modelyear maker
good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe
![Page 68: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/68.jpg)
Copyright © Andrew W. Moore Slide 68
A small dataset: Miles Per Gallon
192 Training Set Records
mpg modelyear maker
good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe
![Page 69: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/69.jpg)
Copyright © Andrew W. Moore Slide 69
A small dataset: Miles Per Gallon
192 Training Set Records
mpg modelyear maker
good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe
203-
196
119621
10 3.4 case) (in this
)(ˆ)(ˆ)dataset /(ˆ
k
k|MP|MP|MP xxxx
Dataset puede ser mpgtrain o mpgtest
![Page 70: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/70.jpg)
Copyright © Andrew W. Moore Slide 70
Log Probabilities
Since probabilities of datasets get so small we usually use log probabilities
R
kk
R
kk |MP|MP|MP
11
)(ˆlog)(ˆlog)dataset(ˆlog xx
![Page 71: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/71.jpg)
Copyright © Andrew W. Moore Slide 71
A small dataset: Miles Per Gallon
192 Training Set Records
mpg modelyear maker
good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe
466.19 case) (in this
)(ˆlog)(ˆlog)dataset(ˆlog11
R
kk
R
kk |MP|MP|MP xx
In our case= -510.79
![Page 72: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/72.jpg)
Copyright © Andrew W. Moore Slide 72
Resumen: The Good News• Hemos visto una manera de aprender
un estimador de densidad a partir de datos.
• Estimadores de densidad pueden ser usados para• Ordenar los records segun la probabilidad,
y detectar asi casos anomalos (“outliers”).• Para hacer inferencia: calcular P(E1|E2)• Para calcular clasificadores Bayesianos (se
vera mas tarde).
![Page 73: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/73.jpg)
Copyright © Andrew W. Moore Slide 73
Summary: The Bad News• Density estimation by directly learning
the joint is trivial, mindless and dangerous
![Page 74: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/74.jpg)
Copyright © Andrew W. Moore Slide 74
Using a test set
An independent test set with 196 cars has a worse log likelihood
(actually it’s a billion quintillion quintillion quintillion quintillion times less likely)
….Density estimators can overfit. And the full joint density estimator is the overfittiest of them all!
![Page 75: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/75.jpg)
Copyright © Andrew W. Moore Slide 75
Overfitting Density Estimators
If this ever happens, it means there are certain combinations that we learn are impossible
0)(ˆ any for if
)(ˆlog)(ˆlog)testset(ˆlog11
|MPk
|MP|MP|MP
k
R
kk
R
kk
x
xx
![Page 76: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/76.jpg)
Copyright © Andrew W. Moore Slide 76
Using a test set
The only reason that our test set didn’t score -infinity is that my code is hard-wired to always predict a probability of at least one in 1020
We need Density Estimators that are less prone to overfitting
![Page 77: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/77.jpg)
Copyright © Andrew W. Moore Slide 77
Naïve Density Estimation
El problema con el estimador de densidad conjunto es que simplemente refleja la muestra de entrenamiento.
Necesitamos algo que generalize en forma mas util.
El modelo naïve model generaliza mas eficientemente:
Asume que cada atributo esta distribuido independientemente uno del otro.
![Page 78: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/78.jpg)
Copyright © Andrew W. Moore Slide 78
A note about independence• Assume A and B are Boolean Random
Variables. Then“A and B are independent”
if and only ifP(A|B) = P(A)
• “A and B are independent” is often notated as
BA
![Page 79: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/79.jpg)
Copyright © Andrew W. Moore Slide 79
Independence Theorems• Assume P(A|B) = P(A)• Then P(A^B) =
= P(A) P(B)
• Assume P(A|B) = P(A)• Then P(B|A) =
• Por el resultado anterior
• P(B/A)=P(A)P(B)/P(A)
= P(B)
P(A/B)=P(A^B)/P(B)
Por Independencia,
P(A)=P(A^B)/P(B)
P(A^B)/P(A)
![Page 80: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/80.jpg)
Copyright © Andrew W. Moore Slide 80
Independence Theorems• Assume P(A|B) =
P(A)• Then P(~A|B) =• P(~A^B)/P(B)=• [P(B)-P(A^B)]/P(B)• 1-P(A^B)/P(B)=• 1-P(A), por Indep.• P(~A)
= P(~A)
• Assume P(A|B) = P(A)
• Then P(A|~B) =• P(A^~B)/P(~B)• [P(A)-P(A^B)]/1-P(B)• P(A)-P(A)P(B)/1-P(B)• P(A)[1-P(B)]/1-P(B)
= P(A)
![Page 81: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/81.jpg)
Copyright © Andrew W. Moore Slide 81
Multivalued Independence
For multivalued Random Variables A and B,
BAif and only if
)()|(:, uAPvBuAPvu from which you can then prove things like…
)()()(:, vBPuAPvBuAPvu )()|(:, vBPuAvBPvu
![Page 82: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/82.jpg)
Copyright © Andrew W. Moore Slide 82
Independently Distributed Data
• Let x[i] denote the i’th field of record x.• The independently distributed assumption
says that for any i,v, u1 u2… ui-1 ui+1… uM
)][(
)][,]1[,]1[,]2[,]1[|][( 1121
vixP
uMxuixuixuxuxvixP Mii
• Or in other words, x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…x[M]}
• This is often written as ]}[],1[],1[],2[],1[{][ Mxixixxxix
![Page 83: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/83.jpg)
Copyright © Andrew W. Moore Slide 83
Back to Naïve Density Estimation
• Let x[i] denote the i’th field of record x:• Naïve DE assumes x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…
x[M]}• Example:
• Suppose that each record is generated by randomly shaking a green dice and a red dice
• Dataset 1: A = red value, B = green value
• Dataset 2: A = red value, B = sum of values
• Dataset 3: A = sum of values, B = difference of values
• Which of these datasets violates the naïve assumption?
![Page 84: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/84.jpg)
Copyright © Andrew W. Moore Slide 84
Using the Naïve Distribution• Once you have a Naïve Distribution you can
easily compute any row of the joint distribution.
• Suppose A, B, C and D are independently distributed. What is P(A^~B^C^~D)?
![Page 85: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/85.jpg)
Copyright © Andrew W. Moore Slide 85
Using the Naïve Distribution• Once you have a Naïve Distribution you can
easily compute any row of the joint distribution.
• Suppose A, B, C and D are independently distributed. What is P(A^~B^C^~D)?
= P(A|~B^C^~D) P(~B^C^~D)= P(A) P(~B^C^~D)= P(A) P(~B|C^~D) P(C^~D)= P(A) P(~B) P(C^~D)= P(A) P(~B) P(C|~D) P(~D)= P(A) P(~B) P(C) P(~D)
![Page 86: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/86.jpg)
Copyright © Andrew W. Moore Slide 86
Naïve Distribution General Case
• Suppose x[1], x[2], … x[M] are independently distributed.
M
kkM ukxPuMxuxuxP
121 )][()][,]2[,]1[(
• So if we have a Naïve Distribution we can construct any row of the implied Joint Distribution on demand.
• So we can do any inference • But how do we learn a Naïve Density
Estimator?
![Page 87: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/87.jpg)
Copyright © Andrew W. Moore Slide 87
Learning a Naïve Density Estimator
records ofnumber total
][in which records#)][(ˆ uixuixP
Another trivial learning algorithm!
![Page 88: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/88.jpg)
Copyright © Andrew W. Moore Slide 88
ConparacionJoint DE Naïve DE
Can model anything Can model only very boring distributions
Puede modelar una copia ruidosa de la muestra original
No lo puede hacer
Given 100 records and more than 6 Boolean attributes will screw up badly
Given 100 records and 10,000 multivalued attributes will be fine
![Page 89: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/89.jpg)
Copyright © Andrew W. Moore Slide 89
A tiny part of the DE
learned by “Joint”
Empirical Results: “MPG”The “MPG” dataset consists of 392 records and 8 attributes
The DE learned by
“Naive”
![Page 90: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/90.jpg)
Copyright © Andrew W. Moore Slide 90
A tiny part of the DE
learned by “Joint”
Empirical Results: “MPG”The “MPG” dataset consists of 392 records and 8 attributes
The DE learned by
“Naive”
![Page 91: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/91.jpg)
Copyright © Andrew W. Moore Slide 91
The DE learned by
“Joint”
Empirical Results: “Weight vs. MPG”Suppose we train only from the “Weight” and “MPG” attributes
The DE learned by
“Naive”
![Page 92: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/92.jpg)
Copyright © Andrew W. Moore Slide 92
The DE learned by
“Joint”
Empirical Results: “Weight vs. MPG”Suppose we train only from the “Weight” and “MPG” attributes
The DE learned by
“Naive”
![Page 93: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/93.jpg)
Copyright © Andrew W. Moore Slide 93
The DE learned by
“Joint”
“Weight vs. MPG”: The best that Naïve can do
The DE learned by
“Naive”
![Page 94: Copyright © Andrew W. Moore Slide 1 Probabilistic and Bayesian Analytics Andrew W. Moore Professor School of Computer Science Carnegie Mellon University](https://reader035.vdocuments.net/reader035/viewer/2022081602/551c41f1550346a0458b45b4/html5/thumbnails/94.jpg)
Copyright © Andrew W. Moore Slide 94
Reminder: The Good News• Tenemos dos maneras de aprender un
estimador de densidad a partir de datos. • *In otras clases veremos estimadores de
densidad mas complicados. (Mixture Models, Bayesian Networks, Density Trees, Kernel Densities and many more)
• Los estimadores de densidad pueden ser usados para:• Deteccion de anomalias (“outliers”).• Para ser inferencias: Calcular P(E1|E2)• Para calcular clasificadores bayesianos.