probability and basic statistics with r

55
Quantitative Data Analysis Probability and basic statistics

Upload: alberto-labarga

Post on 29-Jan-2015

129 views

Category:

Education


1 download

DESCRIPTION

Quantitative Data Analysis - Part III: Probability and basic statistics- Master in Global Environmental Change -IE University

TRANSCRIPT

Page 1: Probability and basic statistics with R

Quantitative Data Analysis

Probability and basic statistics

Page 2: Probability and basic statistics with R

probabilityThe most familiar way of thinking about probability is within a framework of repeatable random experiments. In this view the probability of an event is defined as the limiting proportion of times the event would occur given many repetitions.

Page 3: Probability and basic statistics with R

ProbabilityInstead of exclusively relying on knowledge of the proportion of times an event occurs in repeated sampling, this approach allows the incorporation of subjective knowledge, so-called prior probabilities, that are then updated. The common name for this approach is Bayesian statistics.

Page 4: Probability and basic statistics with R

The Fundamental Rules of Probability

Rule 1: Probability is always positiveRule 2: For a given sample space, the sum of probabilities is 1Rule 3: For disjoint (mutually exclusive) events, P(AUB)=P (A) + P (B)

Page 5: Probability and basic statistics with R

CountingPermutations (order is important)

Combinations (order is not important)

Page 6: Probability and basic statistics with R

Probability functionsThe factorial function

factorial(n)

gamma(n+1)

Combinations can be calculated withchoose(x,n)

Page 7: Probability and basic statistics with R

Simple statisticsmean(x) arithmetic average of the values in xmedian(x) median value in xvar(x) sample variance of xcor(x,y) correlation between vectors x and yquantile(x) vector containing the minimum, lower quartile, median, upper quartile, and maximum of xrowMeans(x) row means of dataframe or matrix xcolMeans(x) column means

Page 8: Probability and basic statistics with R

cumulative probability function

The cumulative probability function is, for any value of x, the probability of obtaining a sample value that is less than or equal to x.

curve(pnorm(x),-3,3)

Page 9: Probability and basic statistics with R

probability density functionThe probability density is the slope of this curve (its ‘derivative’).

curve(dnorm(x),-3,3)

Page 10: Probability and basic statistics with R

Continuous Probability Distributions

Page 11: Probability and basic statistics with R

Continuous Probability Distributions

R has a wide range of built-in probability distributions, for each of which four functions are available: the probability density function (which has a d prefix); the cumulative probability (p); the quantiles of the distribution (q); and random numbers generated from the distribution (r).

Page 12: Probability and basic statistics with R

Normal distributionpar(mfrow=c(2,2))x<-seq(-3,3,0.01)y<-exp(-abs(x))plot(x,y,type="l")y<-exp(-abs(x)^2)plot(x,y,type="l")y<-exp(-abs(x)^3)plot(x,y,type="l")y<-exp(-abs(x)^8)plot(x,y,type="l")

Page 13: Probability and basic statistics with R

Normal distribution

norm.R

Page 14: Probability and basic statistics with R

ExerciseSuppose we have measured the heights of 100 people. The mean height was 170 cm and the standard deviation was 8 cm. We can ask three sorts of questions about data like these: what is the probability that a randomly selected individual will be:shorter than a particular height? taller than a particular height? between one specified height and another?

Page 15: Probability and basic statistics with R

Exercise

normal.R

Page 16: Probability and basic statistics with R

The central limit theorem

If you take repeated samples from a population with finite variance and calculate their averages, then the averages will be normally distributed.

Page 17: Probability and basic statistics with R

Checking normality

fishes.R

Page 18: Probability and basic statistics with R

Checking normality

Page 19: Probability and basic statistics with R

The gamma distributionThe gamma distribution is useful for describing a wide range of processes where the data are positively skew (i.e. non-normal, with a long tail on the right).

Page 20: Probability and basic statistics with R

The gamma distributionx<-seq(0.01,4,.01)par(mfrow=c(2,2))y<-dgamma(x,.5,.5)plot(x,y,type="l")y<-dgamma(x,.8,.8)plot(x,y,type="l")y<-dgamma(x,2,2)plot(x,y,type="l")y<-dgamma(x,10,10)plot(x,y,type="l")

gammas.R

Page 21: Probability and basic statistics with R

The gamma distribution α is the shape parameter and β −1 is the scale parameter. Special cases of the gamma distribution are the exponential =1 and chi-squared =/2, =2.The mean of the distribution is αβ , the variance is αβ 2, the skewness is 2/√α and the kurtosis is 6/α.

Page 22: Probability and basic statistics with R

The gamma distribution

gammas.R

Page 23: Probability and basic statistics with R

Exercise

Page 24: Probability and basic statistics with R

Exercise

fishes2.R

Page 25: Probability and basic statistics with R

The exponential distribution

Page 26: Probability and basic statistics with R

Quantitative Data Analysis

Hypothesis testing

Page 27: Probability and basic statistics with R

cumulative probability function

The cumulative probability function is, for any value of x, the probability of obtaining a sample value that is less than or equal to x.

curve(pnorm(x),-3,3)

Page 28: Probability and basic statistics with R

probability density functionThe probability density is the slope of this curve (its ‘derivative’).

curve(dnorm(x),-3,3)

Page 29: Probability and basic statistics with R

ExerciseSuppose we have measured the heights of 100 people. The mean height was 170 cm and the standard deviation was 8 cm. We can ask three sorts of questions about data like these: what is the probability that a randomly selected individual will be:shorter than a particular height? taller than a particular height? between one specified height and another?

Page 30: Probability and basic statistics with R

Exercise

normal.R

Page 31: Probability and basic statistics with R

Why Test?Statistics is an experimental science, not really a branch of mathematics.It’s a tool that can tell you whether data are accidentally or really similar.It does not give you certainty.

Page 32: Probability and basic statistics with R

Steps in hypothesis testing!1. Set the null hypothesis and the alternative hypothesis. 2. Calculate the p-value. 3. Decision rule: If the p-value is less than 5% then reject the null

hypothesis otherwise the null hypothesis remains valid. In any case, you must give the p-value as a justification for your decision.

Page 33: Probability and basic statistics with R

11.33

Types of Errors…A Type I error occurs when we reject a true null hypothesis (i.e. Reject H0 when it is TRUE)

A Type II error occurs when we don’t reject a false null hypothesis (i.e. Do NOT reject H0 when it is FALSE)

H0 T F

Reject I

Reject II

Page 34: Probability and basic statistics with R

Critical regions and power

The table shows schematically relation between relevant probabilities under null and alternative hypothesis.

do not reject reject

Null hypothesis is true 1- (Type I error)

Null hypothesis is false (Type II error) 1-

Page 35: Probability and basic statistics with R

Significance

It is common in hypothesis testing to set probability of Type I error, to some values called the significance levels. These levels usually set to 0.1, 0.05 and 0.01. If null hypothesis is true and probability of observing value of the current test statistic is lower than the significance levels then hypothesis is rejected. Sometimes instead of setting pre-defined significance level, p-value is reported. It is also called observed significance level.

Page 36: Probability and basic statistics with R

inference.ppt - © Aki Taanila

36

Significance LevelWhen we reject the null hypothesis there is a risk of drawing a wrong conclusionRisk of drawing a wrong conclusion (called p-value or observed significance level) can be calculatedResearcher decides the maximum risk (called significance level) he is ready to take Usual significance level is 5%

Page 37: Probability and basic statistics with R

P-valueWe start from the basic assumption: The null hypothesis is trueP-value is the probability of getting a value equal to or more extreme than the sample result, given that the null hypothesis is trueDecision rule: If p-value is less than 5% then reject the null hypothesis; if p-value is 5% or more then the null hypothesis remains validIn any case, you must give the p-value as a justification for your decision.

Page 38: Probability and basic statistics with R

Interpreting the p-value…Overwhelming Evidence(Highly Significant)

Strong Evidence(Significant)

Weak Evidence(Not Significant)

No Evidence(Not Significant)

0 .01 .05 .10

Page 39: Probability and basic statistics with R

Power analysisThe power of a test is the probability of rejecting the null hypothesis when it is false.It has to do with Type II errors: β is the probability of accepting the null hypothesis when it is false. In an ideal world, we would obviously make as small as possible. The smaller we make the probability of committing a Type II error, the greater we make the probability of committing a Type I error, and rejecting the null hypothesis when, in fact, it is correct. Most statisticians work with α=0.05 and β =0.2. Now the power of a test is defined as 1− β =0.8

Page 40: Probability and basic statistics with R

ConfidenceA confidence interval with a particular confidence level is intended to give the assurance that, if the statistical model is correct, then taken over all the data that might have been obtained, the procedure for constructing the interval would deliver a confidence interval that included the true value of the parameter the proportion of the time set by the confidence level.

Page 41: Probability and basic statistics with R

Don't Complicate Things

Use the classical tests:var.test to compare two variances (Fisher's F)t.test to compare two means (Student's t)wilcox.test to compare two means with non-normal errors (Wilcoxon's rank test)prop.test (binomial test) to compare two proportionscor.test (Pearson's or Spearman's rank correlation) to correlate two variableschisq.test (chi-square test) or fisher.test (Fisher's exact test) to test for independence in contingency tables

Page 42: Probability and basic statistics with R

Comparing Two VariancesBefore comparing means, verify that the variances are not significantly different.

var.text(set1, set2)

This performs Fisher's F testIf the variances are significantly different, you can transform the output (y) variable to equalise variances, or you can still use the t.test (Welch's modified test).

Page 43: Probability and basic statistics with R

Comparing Two Means

Student's t-test (t.test) assumes the samples are independent, the variances constant, and the errors normally distributed. It will use the Welch-Satterthwaite approximation (default, less power) if the variances are different. This test can also be used for paired data.Wilcoxon rank sum test (wilcox.test) is used for independent samples, errors not normally distributed. If you do a transform to get constant variance, you will probably have to use this test.

Page 44: Probability and basic statistics with R

Student’s tThe test statistic is the number of standard errors by which the two sample means are separated:

Page 45: Probability and basic statistics with R

Power analysisSo how many replicates do we need in each of two samples to detect a difference of 10% with power =80% when the mean is 20 (i.e. delta =20) and the standard deviation is about 3.5?

power.t.test(delta=2,sd=3.5,power=0.8)

You can work out what size of difference your sample of 30 would allow you to detect, by specifying n and omitting delta:

power.t.test(n=30,sd=3.5,power=0.8)

Page 46: Probability and basic statistics with R

Paired ObservationsThe measurements will not be independent.Use the t.test with paired=T. Now you’re doing a single sample test of the differences against 0.When you can do a paired t.test, you should always do the paired test. It’s more powerful.Deals with blocking, spatial correlation, and temporal correlation.

Page 47: Probability and basic statistics with R

Sign TestUsed when you can't measure a difference but can see it.Use the binomial test (binom.test) for this.Binomial tests can also be used to compare proportions. prop.test

Page 48: Probability and basic statistics with R

Chi-squared contingency tables

the contingencies are all the events that could possibly happen. A contingency table shows the counts of how many times each of the contingencies actually happened in a particular sample.

Page 49: Probability and basic statistics with R

Chi-square Contingency Tables

Deals with count data.Suppose there are two characteristics (hair colour and eye colour). The null hypothesis is that they are uncorrelated.Create a matrix that contains the data and apply chisq.test(matrix). This will give you a p-value for matrix values given the assumption of independence.

Page 50: Probability and basic statistics with R

Fisher's Exact TestUsed for analysis of contingency tables when one or more of the expected frequencies is less than 5.Use fisher.test(x)

Page 51: Probability and basic statistics with R

compare two proportionsIt turns out that 196 men were promoted out of 3270 candidates, compared with 4 promotions out of only 40 candidates for the women.

prop.test(c(4,196),c(40,3270))

Page 52: Probability and basic statistics with R

Correlation and covariance

covariance is a measure of how much two variables change togetherthe Pearson product-moment correlation coefficient (sometimes referred to as the PMCC, and typically denoted by r) is a measure of the correlation (linear dependence) between two variables

Page 53: Probability and basic statistics with R

Correlation and Covariance

Are two parameters correlated significantly?Create and attach the data.frameApply cor(data.frame)To determine the significance of a correlation, apply cor.test(data.frame)You have three options: Kendall's tau (method = "k"), Spearman's rank (method = "s"), or (default) Pearson's product-moment correlation (method = "p")

Page 54: Probability and basic statistics with R

Kolmogorov-Smirnov TestAre two sample distributions significantly different? orDoes a sample distribution arise from a specific distribution?

ks.test(A,B)

Page 55: Probability and basic statistics with R