data analysis statistics. inferential statistics

33
Data Analysis Statistics

Post on 18-Dec-2015

294 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Data Analysis Statistics. Inferential statistics

Data Analysis

Statistics

Page 2: Data Analysis Statistics. Inferential statistics

Inferential statistics

Page 3: Data Analysis Statistics. Inferential statistics

Hypothesis testing

Page 4: Data Analysis Statistics. Inferential statistics

Normal distribution: a probability distribution

99% of scores are within 3sd of mean

Page 5: Data Analysis Statistics. Inferential statistics

Who cares…

• The most useful distribution in inferential statistics.• We can translate any normal variable, X, into the

standardized value, Z to make assumptions about the whole population. Use when comparing means or proportions.

• Example:• Suppose you were the city police and you wanted

to know how many photo radar tickets you could expect to collect next year so that you can develop your budget... $

Page 6: Data Analysis Statistics. Inferential statistics

• Last year the mean number of tickets for all locations was 9000 with a standard deviation of 500 tickets. What is the probability that you will give out between 7500 tickets (your lowball guess) and 9625 (your highball guess)?

• Calculate Z score• …what type of scale must you have to calculate

Z scores?• …what reasons can you think of for wanting to

calculate a Z score for your research?

Page 7: Data Analysis Statistics. Inferential statistics

Z tests, another application

• You have been asked to conduct a survey on customer satisfaction at the food court. Customers indicate their perceptions on a 5 point scale where 1=very unfriendly and 5=very friendly. Assume this is an interval scale and that previous studies have shown that a normal distribution of scores is expected.

Page 8: Data Analysis Statistics. Inferential statistics

Z tests, assumptions about mean

• You think: perhaps customers think that the service is neither friendly nor unfriendly Ho: mean is equal to 3.0

• H1: mean is not equal to 3.0• Establish significance/confidence

level=0.05/95% confidence therefore Z= +/- 1.96• You do a study with a sample of 225 interviews

and the mean is 3.78. The standard deviation is 1.5.

• Do we accept or reject the null hypothesis?

Page 9: Data Analysis Statistics. Inferential statistics

LOWER LIMIT

UPPERLIMIT

A Sampling Distribution

Page 10: Data Analysis Statistics. Inferential statistics

Critical values of

Critical value - upper limit

n

SZZS X or

225

5.196.1 0.3

Page 11: Data Analysis Statistics. Inferential statistics

Critical value - lower limit

n

SZZS

X- or -

225

5.196.1- 0.3

Critical values of

Page 12: Data Analysis Statistics. Inferential statistics

3.0 3.196 2.804

3.78 sample mean, therefore reject Ho and say that the sample results are significant at .05 level of significance

Range of acceptability

Page 13: Data Analysis Statistics. Inferential statistics

Type I and Type II Errors

Accept null Reject null

Null is true

Null is false

Correct-Correct-no errorno error

Type IType Ierrorerror

Type IIType IIerrorerror

Correct-Correct-no errorno error

Page 14: Data Analysis Statistics. Inferential statistics

If sample is small…

• Small usually means less than 30

• Do a t test instead

Page 15: Data Analysis Statistics. Inferential statistics

Is this statistically significant?

• Chi-square test: a hypothesis test that allows for investigation of statistical significance in the analysis of a frequency distribution (or cross tab)

• Categorical data such as sex, education or dichotomous answers may be statistically analyzed

• Tests the “goodness of fit” of the sample with expected population results

Page 16: Data Analysis Statistics. Inferential statistics

Chi-square example

• Through observation research we have identified that of the sample of 100 people who got photo radar tickets, 60 were female and 40 were male. We expected that the proportions should be equal (.5 probability for each sex). Our null hypothesis is that the population data will be consistent with our sample data at 0.05 level of significance.

• If the calculated chi square is above the critical chi square for this level (3.84) we reject the null hypothesis. This is the case. The observed values are not comparable to expected values

Page 17: Data Analysis Statistics. Inferential statistics

Estimation of population parameters: Confidence

• The population mean and standard deviation are unknown; we do know the sample mean and standard deviation….

• We take a sample of a number of students with children and ask them to identify how much they would be willing to pay per hour for on campus childcare . Our sample size is 30. The student population with children is estimated to be 300.

Page 18: Data Analysis Statistics. Inferential statistics

• The sample mean is $2.60.• This is called a point estimate.• How close is this sample mean to the

population mean? How confident are we?

• Confidence interval: the percentage indicating the long run probability that the results will be correct. Usually 95%

Page 19: Data Analysis Statistics. Inferential statistics

Relationship between variables

Correlation and regression analysis

Page 20: Data Analysis Statistics. Inferential statistics

Types of questions

• Is employee productivity associated with pay incentives?

• Is salary level correlated with type of degree or designation?

• Is willingness to pay student fees levies for daycare correlated with whether one has a child?

• Are students grades influenced by length of term?

Page 21: Data Analysis Statistics. Inferential statistics

Measures of association

• A general term that refers to a number of bivariate statistical techniques used to measure the strength of a relationship between two variables

• Correlation coefficient (r): most popular. Is a measure of the covariation or association between two variables. It ranges from +1 to -1

Page 22: Data Analysis Statistics. Inferential statistics

Measures of association

• Coefficient of determination (r2)• The proportion of the total variance of a variable

that is accounted for by knowing the value of another variable. Often shown as a correlation matrix.

• We have calculated r=-.65 when investigating whether the number of years of university is correlated with unemployment. If r2=.38, we know that about 40% of the variance in unemployment can be explained by variance in years of university

Page 23: Data Analysis Statistics. Inferential statistics

Regression analysis

• Bivariate linear regression: a measure of linear association that investigates a straight line relationship.

• Assuming that there is an association between students’ performance and length of term, can we predict a students GPA given the distribution of their courses along semesters

• Uses interval data

Page 24: Data Analysis Statistics. Inferential statistics

Regression analysis

• Multiple regression analysis: an analysis of association that simultaneously investigates the effect of two or more variables on a single, interval-scaled dependent variable

Page 25: Data Analysis Statistics. Inferential statistics

Summary

• Chi-square allows you to test whether an observed sample distribution fits some given distribution. Are the groups in your cross tab independent?

• Z and t tests are used to determine if the means or proportions of two samples are significantly different.

• Simple correlation measures the relationship of one variable to another. Correlation coefficient (r) indicates the strength of the association and direction of the association. The coefficient of determination measures the amount of the total variance in the DV that is accounted for by knowing the value of the independent variable. The results are often shown in a correlation matrix.

• Bivariate regression investigates a straight-line relationship between one IV and one DV. This can be done by plotting a scatter diagram or least squares method. This is used to forecast values of the DV given values of the IV. The goodness of fit may be evaluated by calculating the correlation of determination. Multiple regression analysis allows for simultaneous investigation of two or more IV on the DV

Page 26: Data Analysis Statistics. Inferential statistics

Type of Scale Numerical Operation

Descriptive Statistics

Nominal Counting Frequency; cross tab

Percentage; mode

Ordinal Rank ordering (plus…)Median

Range; Percentile

Interval Arithmetic operations on intervals bet numbers

(plus…) Mean;

Standard deviation;

variance

Ratio Arithmetic operations on actual quantities

(plus…) Geometric mean; Co-efficent of variation

Page 27: Data Analysis Statistics. Inferential statistics

Selecting appropriate univariate statistical method

Scale Business Problem

Statistical question to be

asked

Possible test of statistical significance

Nominal Scale

Identify sex of key executives

Is the number of female executives equal to the number of males executives?

Chi-square test

Page 28: Data Analysis Statistics. Inferential statistics

Scale Business Problem

Statistical question to be

asked

Possible test of statistical significance

Nominal Scale

Indicate percentage of key executives who are male

Is the proportion of male executives the same as the hypothesized proportion?

Z test

Page 29: Data Analysis Statistics. Inferential statistics

Scale Business Problem

Statistical question to be

asked

Possible test of statistical significance

Ordinal scale Compare actual and expected evaluations

Does the distribution of scores for a scale with categories of poor,good, excellent differ from an expected distribution?

Chi-square test

Page 30: Data Analysis Statistics. Inferential statistics

Scale Business Problem

Statistical question to be

asked

Possible test of statistical significance

Interval or Ratio scale

Compare actual and hypothetical values of average salary

Is the sample mean significantly different from the hypothesized population mean?

Z-test (sample is large)

T-test (sample is small)

Page 31: Data Analysis Statistics. Inferential statistics

Determining Sample Size

• What data do you need to consider– Variance or heterogeneity of population– The degree of acceptable error (confidence

interval– Confidence level

– Generally, we need to make judgments on all these variables

Page 32: Data Analysis Statistics. Inferential statistics

Determining Sample Size

• Variance or heterogeneity of population– Previous studies? Industry expectations? Pilot

study?– Sequential sampling– Rule of thumb: the value of standard deviation

is expected to be 1/6 of the range.

Page 33: Data Analysis Statistics. Inferential statistics

Determining Sample Size

• Formula• N= (ZS/E)2

Z= standardization value indicating confidence level

S= sample standard deviationE= acceptable magnitude of error

Its not the size that matters….