-
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
1/48
Lecture 1.2: Probability and StatisticsCSC 84020 - Machine Learning
Andrew Rosenberg
January 29, 2009
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
2/48
Today
Probability and Statistics
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
3/48
Background
What exposure have you had to probability and statistics?
Conditional probabilities?Bayes rule?The difference between a posterior, a conditional and a prior?
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
4/48
Articial Intelligence
Classical Articial Intelligence
Expert SystemsTheorem ProversShakeyChess
Largely characterised by determinism.
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
5/48
Articial Intelligence
Modern Articial Intelligence
Fingerprint IDInternet SearchVision facial ID, etc.Speech RecognitionAsimoJeopardy http://www.research.ibm.com/deepqa/
Statistical modeling to generalize from data.
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
6/48
Natural Intelligence?
Brief Tangent
Is there a role of probability and statistics in Natural Intelligence?
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
7/48
Caveats about Probability and Statistics
Black Swans and The Long Tail.
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
8/48
Black Swans
In the 17th century, all observed swans were white .
Therefore, based on evidence, it was deemed impossible for aswan to be anything other than white.
l k
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
9/48
Black Swans
In the 17th century, all observed swans were white .
Therefore, based on evidence, it was deemed impossible for aswan to be anything other than white.
In the early 18th century, black swans were discovered in WesternAustralia.
Black Swans are rare, sometimes unpredictable, events that haveextreme impact.
Almost all Statistical models underestimate the likelihood of unseen events.
Th L T il
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
10/48
The Long Tail
Many events follow an exponential distribution.
These distributions typically have a very long tail. That is, a longregion with relatively low probability mass.
Often, interesting events occur in the Long Tail, but its difficult toaccurately model the behavior in this region of the distribution
P b bili Th
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
11/48
Probability Theory
Example: Boxes and Balls.
Two boxes: 1 red, 1 blue
In the red box there are 2 apples and 6 oranges. In the blue box
there are 3 apples and 1 orange.
B d F it
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
12/48
Boxes and Fruit
Suppose we draw from the Red box 60% of the time and the BlueBox 40% of the time.
We are equally likely to draw any piece of fruit once the box isselected.
The identity of the Box is a random variable B . The identity of the fruit is a random variable , F .
B can take one of two values: r (red box) or b (blue box).F can take one of two values: a (apple) or o (orange).
We want to answer questions like what is the total probability of picking an apple? and given that I chose an orange, what is theprobability that it was drawn from the blue box?.
Some basics
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
13/48
Some basics
The probability of an event is the fraction of times that anevent occurs out of some number of trials, as the number of trials approaches innity.Probabilities lie in the range of [0,1].
Mutually exclusive events are those events that cannotsimultaneously occur.The sum of the probabilities of all mutually exclusive eventsmust equal 1.
If two events are independent, p (X , Y ) = p (X )p (Y ) andp (X |Y ) = p (X )
Joint Probability
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
14/48
Joint Probability
Joint probability table of the example.
o ablue 1 3 4red 6 2 8
7 5 12Let nij be the number of times event i and event j simultaneously
occur. For example, selecting an orange from the blue box.
p (X = x i , Y = y j ) =nij N
Joint Probability
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
15/48
Joint Probability
A more generalized representation of ajoint probability
.r j = i nij
n ij
c i = j nij N = i j nij Let nij be the number of times event i and event j simultaneously
occur. For example, selecting an orange from the blue box.
p (X = x i , Y = y j ) =nij N
Marginalization
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
16/48
Marginalization
Now consider the probability of X irrespective of Y .
p (X = x i ) =c i N
The number of instances in column i is the sum of the instances in
each cell.
c i =L
j =1
nij
Therefore, we can marginalize or sum over Y:
p (X = x i ) =L
j =1
p (X = x i , Y = y j )
Conditional Probability
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
17/48
Conditional Probability
Now consider only instances whereX = x i . The fraction of these instances where Y = y j is written p (Y = y j |X = x i ).This is a conditional probability the probability of y given x .
p (Y = y j |X = x i ) =nij c i
Also,
p (X = x i , Y = y j ) =nij N
=nij c i
c i N
= p (Y = y j |X = x i )p (X = x i )
Sum and Product Rules
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
18/48
Sum and Product Rules
In general we will usep (X ) to refer to a distribution over a randomvariable, and p (x i ) to refer to the distribution evaluated at aparticular value.
Sum Rule
p (X ) =Y
p (X , Y )
Product Rule
p (X , Y ) = p (Y |X )p (X )
Bayes Theorem
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
19/48
Bayes Theorem
p (Y |X ) =p (X |Y )p (Y )
p (X )
The denominator can be viewed as a normalization term:
p (X ) =Y
p (X |Y )p (Y )
Return to Boxes and Fruit
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
20/48
Return to Boxes and Fruit
Now we can return to the question If an orange was chosen, whatbox did it come from, or dene the distribution, p (B |F = o ).
p (B = r |F = o ) =p (F = o |B = r )p (B = r )
p (F = o )
=34
4109
20
=34
410
209
=23
p (B = b |F = o ) = 1 23 =13
Interpretation of Bayes Rule
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
21/48
Interpretation of Bayes Rule
p (B |F ) =p (F |B )p (B )
p (F )
p (B ) is called the prior of B . This is information we have before observing anything about the fruit that was drawn.
p (B |F ) is call the posterior probability , or simply the posterior .This is the distribution of B after observing F .In our example, the prior probability of B = r was 410 , but theposterior was 23 .
The probability that the box was red increased after observation of F .
Continuous Probabilities
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
22/48
Continuous Probabilities
So far we have been dealing with discrete probabilities, whereX
can take one of M discrete values. What if X could takecontinuous values?
(Enter calculus.)
The probability of a real-valued random variable falling within(x , x + x ) is p (x )x as x .p(x) is the probability density or probability density functionover x .Thus the probability that x will lie in an interval (a,b) is given by:
p (x (a , b )) = a
b p (x )dx
Graphical Example of continuous probabilities.
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
23/48
Graphical Example of continuous probabilities.
Continuous Probability Identities
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
24/48
y
p (x ) 1
p (x )dx = 1
Sum Rulep (x ) = p (x , y )dy
Product Rule
p (x , y ) = p (y |x )p (x )
Expected Values
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
25/48
p
Given a random variable x characterized by a distribution p (x ),what is the expected value of x ?
Expected Values
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
26/48
p
Given a random variable x characterized by a distribution p (x ),what is the expected value of x ?
The expectation of x .
E [x ] =x
p (x )x
or
E [x ] = p (x )x dx
Expected Values Example 1
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
27/48
p p
What is the expected value when rolling one die?
x p(x)
1
2
3
4
56
Expected Values Example 1
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
28/48
p p
What is the expected value when rolling one die?
x p(x)
1 16
2 16
3 16
4 16
516
6 16
Distribution of Dice values
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
29/48
Expected Values Example 1
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
30/48
E [x ] =x
p (x )x
= 1 1
6 + 2 1
6 + 3 1
6 + 4 1
6 + 5 1
6 + 6 1
6=
216
= 3 .5
Expected Values Example 1
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
31/48
E [x ] =x
p (x )x
= 1 1
6 + 2 1
6 + 3 1
6 + 4 1
6 + 5 1
6 + 6 1
6=
216
= 3 .5
E [x ] =1
N
N
i
x i
Expected Values Example 2
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
32/48
What is the expected value when rolling two dice?
x p(x)234
56789101112
Expected Values Example 2
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
33/48
What is the expected value when rolling two dice?
x p(x)2 1363 2364 3365 4366 5367 6368 5369 43610 33611 23612 136
Expected Values Example 2
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
34/48
E [x ] =x
p (x )x
= 2 1
36 + 3 2
36 + 4 3
36 + 5 436 + 6
536 + 7
636
+8 5
36+ 9
436
+ 10 336
+ 11 2
36+ 12
136
=25236
= 7
Distribution of Dice values
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
35/48
Distribution of values of one die
Distribution of Dice values
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
36/48
Distribution of values of two dice
Distribution of Dice values
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
37/48
Distribution of values of three dice
Distribution of Dice values
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
38/48
Distribution of values of four dice
Multinomial Distribution
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
39/48
Multinomial DistributionIf a variable, x , can take 1-of-K states, we can represent thisvariable as being drawn from amultinomial distribution .
We say the probability of x being a member of state k is k ,elements of a vector .
K
k =1
k = 1
p (x | ) =K
k =1
x k k
Expected Value of a Multinomial Distribution
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
40/48
Expectation
E [x| ] =x
p (x | )x = ( 0 , 1 , . . . , K 1 )T
Gaussian Distribution
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
41/48
As the number of dice increases, the multinomial distributionapproaches a Gaussian Distribution , or Normal Distribution .
One dimensional
N (x |, 2
) =1
2 2 exp 1
2 2 (x )2
D-dimensional
N (x
|, ) =
1
(2 )D / 2 | |1 / 2exp
1
2(x
)T 1 (x
)
Gaussian Example
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
42/48
Image from wikipedia.
Gaussian Example
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
43/48
Image from wikipedia.
Expectation of a Gaussian
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
44/48
E [x |, 2 ] = N (x |, 2 )xdx
= 12 2 exp 12 2 (x )2 xdx or
E [x| , ] = N (x| , ) xdx =
1(2 )D / 2 | |1 / 2 exp
12(x )
T
1
(x ) xdx Well need some calculus for this, so next time.
Variances
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
45/48
The variance of x describes how much variability around the mean,E [x ].
var[f ] = E [(f (x ) E [f (x )])2 ]var[f ] = E [f (x )2 ] E [f (x )]
2
Covariance
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
46/48
The covariance of two random variables, x and y , expresses towhat extent the two vary together.
cov[x , y ] = E x ,y [(x E (x ))( y E [y ])]= E x ,y [xy ] E [x ]E [y ]If two variables are independent their covariance equals zero.(Know how to prove this.)
How does Machine Learning use Probabilities
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
47/48
The expectation of a function is the guess.The covariance is the condence in this guess.
These are simple operations. . .
But how can we nd the best estimate of p (x )?
Bye
http://find/http://goback/ -
8/3/2019 Andrew Rosenberg- Lecture 1.2: Probability and Statistics CSC 84020 - Machine Learning
48/48
Next
Linear AlgebraVector Calculus
http://find/http://goback/