random variables and distributions comp5318 knowledge discovery and data mining
Post on 19-Dec-2015
224 views
TRANSCRIPT
Random Variables and Distributions
COMP5318
Knowledge Discovery and Data Mining
Examples
Examples
• We have heard of statements like “Height is Normally Distributed”
mean
Standarddeviation
Why distributions are important
• Distribution capture the essence of data associated with a particular variable(s) (e.g., height).
• If we know height is Normally distributed then a small random sample is enough to provide a very good idea about the general population.
• Can answer questions like: what is the probability of finding a 2 meter tall Australian?
• Need to understand the concept of random variable.
Random Variable
• Let S be the sample space.
• A random variable X is a function
X: SReal
Suppose we toss a coin twice. Let X be the random variable number of heads
Random Variable(Number of Heads in two coin tosses)
S X
TT 0
TH 1
HT 1
HH 2
We also associate a probability with X attaining that value.
Random Variable(Number of Heads in two coin tosses)
S Prob X
TT 1/4 0
TH 1/4 1
HT 1/4 1
HH 1/4 2
X P(X=x)
0 1/4
1 1/2
2 1/4
Random Variables follow a Distribution
• The height of Australian soldiers is a random variable which follows a Normal distribution with mean 180 cm and standard deviation 15 cm.
• The frequency of words in a text is a random variable which follows a Zipf distribution.
• The speed of a hurricane is a random variable which follows a Cauchy distribution.
• The number of car accidents in a fixed time duration is a random variable which follows a Poisson distribution.
• The number of heads in a sequence of coin tosses is a random variable which follows a Binomial distribution.
• The number of web hits in a given time period is a r.v. which follows a Pareto distribution.
• Many times we don’t know what named distribution a r.v. follows or whether it follows any named distribution at all!
Distribution Definitions
• Discrete Probability Distribution
• Continuous Probability Distribution
• Cumulative Distribution Function
Discrete Distribution
• A r.v. X is discrete if it takes countably many values {x1,x2,….}
• The probability function or probability mass function for X is given by – fX(x)= P(X=x)
• From previous example
otherwise
x
x
x
xf X
0
24/1
12/1
04/1
)(
Continuous Distributions
• A r.v. X is continuous if there exists a function fX such that
b
a
X
X
X
dxxfbxaP
dxxf
f
)()(
1)(
0
Example: Continuous Distribution
• Suppose X has the pdf
• This is the Uniform (0,1) distribution
otherwise
xxf X 0
101)(
Binomial Distribution
• A coin flips Heads with probability p. Flip it n times and let X be the number of Heads. Assume flips are independent.
• Let f(x) =P(X=x), then
otherwise
nxppx
nxf
xnx
0
,...1,0)1()(
Binomial Example
• Let p =0.5; n = 5 then
• In Matlab >>binopdf(4,5,0.5)
1562.0)5.01(5.04
5)4( 454
XP
Normal Distribution
• X has a Normal (Gaussian) distribution with parameters μ and σ if
• X is standard Normal if μ =0 and σ =1. It is denoted as Z.
• If X ~ N(μ, σ2) then
2
2)(
2
1exp
2
1)(
xxf
ZX
~
Normal Example
• The number of spam emails received by a email server in a day follows a Normal Distribution N(1000,500). What is the probability of receiving 2000 spam emails in a day?
• Let X be the number of spam emails received in a day. We want P(X = 2000)?
• The answer is P(X=2000) = 0;
• It is more meaningful to ask P(X >= 2000);
Normal Example
• This is
• In Matlab: >> 1 –normcdf(2000,1000,500)
• The answer is 1 – 0.9772 = 0.0228 or 2.28%
• This type of analysis is so common that there is a special name for it: cumulative distribution function F.
2000
)(1)2000(1)2000( dxxfXPXP
x
dxxfxXPxF )()()(
Outliers
• In data mining we are often interested in outliers– especially in high dimensional data which we
cannot easily visualize
• A knowledge of distributions can be very useful in this context.
• Lets see how?
Outliers in Normal Distribution
• Conventionally something is considered an outlier if it is at least three standard deviations away from the mean:
• Lets assume we have a standard Normal Distribution: N(0,1)
• We want P(X < -3) + P(X >3)• = normcdf(-3,0,1) + 1 – normcdf(3,0,1)=0.0027
Outliers using Univariate Normal Distribution
• Typically we are given data and we want to find outliers in the data –if any.
• Here are the steps:
1. Make the assumption that the data come from a Normal distribution.
2. Estimate the parameters of the Normal distribution.
3. Find all data points which are more than three standard deviations away from the mean.
Outliers in Multidimensional Data
• Recall, in the Iris data, we have four attributes and one class label.
• This is an example of multidimensional data set.• Look at the exponent of the Normal distribution.
• This is the square of the distance from a point x to the mean μ in units of standard deviation σ
)()( 22
xxx
Outliers in Multidimensional Data
• In multidimensional data this can be generalized to:
• This is called the Mahalanobis Distance (squared)
• Σ is d x d matrix called the variance-covariance matrix
)'()( 1 xx
Variance-Covariance Matrix
ddd
d
d
..
..
..
1
212
111
If the Data set is an N x d matrix then
In Matlab
• Suppose we generate a random 100x5 data >> data = rand(100,5);
• The covariance matrix is >>cv =cov(data)
0.0998 -0.0022 0.0006 -0.0080 -0.0025
-0.0022 0.0933 -0.0051 -0.0100 -0.0010
0.0006 -0.0051 0.0810 -0.0085 0.0083
-0.0080 -0.0100 -0.0085 0.0820 0.0071
-0.0025 -0.0010 0.0083 0.0071 0.0859
Intuitive: Mahalanobis Distance
Distribution of Mahalanobis Distance
• It turns out that if an N x d data set A if from a multivariate Normal Distribution then the Mahalanobis distance follows a a Chi-Square distribution with d degrees of freedom.
Chi-Square Distribution
Curse of dimensionality
Algorithm for Finding Outliers
>>chi2inv(.975,d)
Homework
• Define first, second, third quantile in terms of cumulative distribution function?
• Use that to understand the previous algorithm.
• Start looking up Matlab help files in the Statistics toolbox.
• Also, figure out what is the meaning of “estimating the parameter of a distribution from data”.