random variables and distributions comp5318 knowledge discovery and data mining

Random Variables and Distributions

COMP5318

Knowledge Discovery and Data Mining

Examples

Examples

• We have heard of statements like “Height is Normally Distributed”

mean

Standarddeviation

Why distributions are important

• Distribution capture the essence of data associated with a particular variable(s) (e.g., height).

• If we know height is Normally distributed then a small random sample is enough to provide a very good idea about the general population.

• Can answer questions like: what is the probability of finding a 2 meter tall Australian?

• Need to understand the concept of random variable.

Random Variable

• Let S be the sample space.

• A random variable X is a function

X: SReal

Suppose we toss a coin twice. Let X be the random variable number of heads

Random Variable(Number of Heads in two coin tosses)

S X

TT 0

TH 1

HT 1

HH 2

We also associate a probability with X attaining that value.

Random Variable(Number of Heads in two coin tosses)

S Prob X

TT 1/4 0

TH 1/4 1

HT 1/4 1

HH 1/4 2

X P(X=x)

0 1/4

1 1/2

2 1/4

Random Variables follow a Distribution

• The height of Australian soldiers is a random variable which follows a Normal distribution with mean 180 cm and standard deviation 15 cm.

• The frequency of words in a text is a random variable which follows a Zipf distribution.

• The speed of a hurricane is a random variable which follows a Cauchy distribution.

• The number of car accidents in a fixed time duration is a random variable which follows a Poisson distribution.

• The number of heads in a sequence of coin tosses is a random variable which follows a Binomial distribution.

• The number of web hits in a given time period is a r.v. which follows a Pareto distribution.

• Many times we don’t know what named distribution a r.v. follows or whether it follows any named distribution at all!

Distribution Definitions

• Discrete Probability Distribution

• Continuous Probability Distribution

• Cumulative Distribution Function

Discrete Distribution

• A r.v. X is discrete if it takes countably many values {x1,x2,….}

• The probability function or probability mass function for X is given by – fX(x)= P(X=x)

• From previous example

otherwise

x

x

x

xf X

0

24/1

12/1

04/1

)(

Continuous Distributions

• A r.v. X is continuous if there exists a function fX such that

b

a

X

X

X

dxxfbxaP

dxxf

f

)()(

1)(

0

Example: Continuous Distribution

• Suppose X has the pdf

• This is the Uniform (0,1) distribution

otherwise

xxf X 0

101)(

Binomial Distribution

• A coin flips Heads with probability p. Flip it n times and let X be the number of Heads. Assume flips are independent.

• Let f(x) =P(X=x), then

otherwise

nxppx

nxf

xnx

0

,...1,0)1()(

Binomial Example

• Let p =0.5; n = 5 then

• In Matlab >>binopdf(4,5,0.5)

1562.0)5.01(5.04

5)4( 454

XP

Normal Distribution

• X has a Normal (Gaussian) distribution with parameters μ and σ if

• X is standard Normal if μ =0 and σ =1. It is denoted as Z.

• If X ~ N(μ, σ2) then

2

2)(

2

1exp

2

1)(

xxf

ZX

~

Normal Example

• The number of spam emails received by a email server in a day follows a Normal Distribution N(1000,500). What is the probability of receiving 2000 spam emails in a day?

• Let X be the number of spam emails received in a day. We want P(X = 2000)?

• The answer is P(X=2000) = 0;

• It is more meaningful to ask P(X >= 2000);

Normal Example

• This is

• In Matlab: >> 1 –normcdf(2000,1000,500)

• The answer is 1 – 0.9772 = 0.0228 or 2.28%

• This type of analysis is so common that there is a special name for it: cumulative distribution function F.

2000

)(1)2000(1)2000( dxxfXPXP

x

dxxfxXPxF )()()(

Outliers

• In data mining we are often interested in outliers– especially in high dimensional data which we

cannot easily visualize

• A knowledge of distributions can be very useful in this context.

• Lets see how?

Outliers in Normal Distribution

• Conventionally something is considered an outlier if it is at least three standard deviations away from the mean:

• Lets assume we have a standard Normal Distribution: N(0,1)

• We want P(X < -3) + P(X >3)• = normcdf(-3,0,1) + 1 – normcdf(3,0,1)=0.0027

Outliers using Univariate Normal Distribution

• Typically we are given data and we want to find outliers in the data –if any.

• Here are the steps:

1. Make the assumption that the data come from a Normal distribution.

2. Estimate the parameters of the Normal distribution.

3. Find all data points which are more than three standard deviations away from the mean.

Outliers in Multidimensional Data

• Recall, in the Iris data, we have four attributes and one class label.

• This is an example of multidimensional data set.• Look at the exponent of the Normal distribution.

• This is the square of the distance from a point x to the mean μ in units of standard deviation σ

)()( 22

xxx

Outliers in Multidimensional Data

• In multidimensional data this can be generalized to:

• This is called the Mahalanobis Distance (squared)

• Σ is d x d matrix called the variance-covariance matrix

)'()( 1 xx

Variance-Covariance Matrix

ddd

d

d

..

..

..

1

212

111

If the Data set is an N x d matrix then

In Matlab

• Suppose we generate a random 100x5 data >> data = rand(100,5);

• The covariance matrix is >>cv =cov(data)

0.0998 -0.0022 0.0006 -0.0080 -0.0025

-0.0022 0.0933 -0.0051 -0.0100 -0.0010

0.0006 -0.0051 0.0810 -0.0085 0.0083

-0.0080 -0.0100 -0.0085 0.0820 0.0071

-0.0025 -0.0010 0.0083 0.0071 0.0859

Intuitive: Mahalanobis Distance

Distribution of Mahalanobis Distance

• It turns out that if an N x d data set A if from a multivariate Normal Distribution then the Mahalanobis distance follows a a Chi-Square distribution with d degrees of freedom.

Chi-Square Distribution

Curse of dimensionality

Algorithm for Finding Outliers

>>chi2inv(.975,d)

Homework

• Define first, second, third quantile in terms of cumulative distribution function?

• Use that to understand the previous algorithm.

• Start looking up Matlab help files in the Statistics toolbox.

• Also, figure out what is the meaning of “estimating the parameter of a distribution from data”.

random variables and distributions comp5318 knowledge discovery and data mining

Documents