announcements first quiz next monday (week 3) at 6:15-6:45 summary: recap first lecture:...

32
Announcements • First quiz next Monday (Week 3) at 6:15-6:45 Summary: Recap first lecture: Descriptive statistics – Measures of center and spread Normal density (Section 1.3) SAS procedures for analyzing univariate data Proc MEANS Proc UNIVARIATE CSC 323 Data analysis and Statistical software I

Upload: angela-willis

Post on 14-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

Announcements• First quiz next Monday (Week 3) at 6:15-6:45

Summary: Recap first lecture:

Descriptive statistics – Measures of center and spread Normal density (Section 1.3)

SAS procedures for analyzing univariate dataProc MEANSProc UNIVARIATE

CSC 323 Data analysis and Statistical software I

Page 2: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

Describing distributions with numbersDescribing distributions with numbers

The distribution of the data is described through its center and its spread.

For symmetric distributions use the mean and the standard deviation

For skewed distributions, use the five number summary:Min, Q1, Median, Q3, Max

The median is the midpoint of a distribution, the number such that half the

observations are below it and the other half is above it.Q1 is the first quartile or 25th percentile, the point such that 25% of the

observations are below it.

Q3 is the third quartile or 75th percentile, the point such that 25% of the observations are above it

Page 3: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

Example

Fuel economy (miles per gallon) per model year 2001 cars on highway

13 16 19 21 22 24 24 25 26 28 30 30 68

N=13

Median=?

Q1=?

Q3=?

Possible outliers?

Page 4: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

Example: Example: SAT Math score of 224 Computer Science StudentsSAT Math score of 224 Computer Science Students

In a large university, data were collected to study the academic achievements of computer science majors. We’ll consider the SAT math scores of 224 first year CS students.

The average SATM score is 595.28 with s.d. s= 86.40

Histogram of the SATM ScoresAre the average and s.d. good descriptions of the SATM scores distribution?

Roughly 68% of the students have scores between 510 and 680Roughly 95% of the students have scores between 422 and 768How did I compute these intervals?

Page 5: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

Interpreting the s.d. valueInterpreting the s.d. value

For many lists of observations – especially if their histogram is bell-shaped

1. Roughly 68% of the observations in the list lie within 1 standard deviation from the average

1. And 95% of the observations lie within 2 standard deviations from the average

AverageAve-s.d. Ave+s.d.

68%

95%

Ave-2s.d. Ave+2s.d.

Page 6: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

CS students example: Descriptive statisticsCS students example: Descriptive statistics Mean = 595.28 Std Deviation = 86.40 Max= 800 Min= 300 Q1 = 540 Median = 600.00 Q3= 650 IQR=110 1.5xIQR=165 5th percentile = 460 95th percentile = 750

Histogram of the SATM Scores

422 76895% of scores

Page 7: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

Analysis of the scores for male and female students:

Box plot:

SATM scores for men SATM scores for women

Page 8: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

Exploratory Data Analysis:Exploratory Data Analysis:

1. Always plot your data

2. Look for overall patterns & striking deviations such as outliers

3. Calculate a numerical summary to describe the center and the spread:• Symmetric distributions: Mean and standard deviation• Asymmetric distributions: 5 number summary {Min, Q1, Median,

Q3, Max}

4. NEXT STEP: sometimes the overall pattern is so regular that we can describe it through a smooth curve, called a density curve

Page 9: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

Density curves

Density curves describe the overall pattern of distributions.

A density curve Is always on or above the

horizontal axis Has area exactly 100%

underneath it.

The density curve is a mathematical model that can be used to describe empirical distributions

SAT math scores for CS students

Page 10: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

Normal distributionNormal distribution

Normal curves provide a simple compact way to describe symmetric, bell-shaped distributions.

SAT math scores for CS students

Normal curve

Page 11: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

)2

)(exp(

2

12

2

x

y

The normal curve has the following expression:

It is centered on the mean and has spread equal to the standard deviation

Page 12: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

Two normal curves with the same mean but different standard deviation.

Page 13: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

Money spent in a supermarketMoney spent in a supermarket

Is the normal curve a good approximation?

Page 14: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

The area under the histogram, i.e. the percentages of the observations, can be approximated by the corresponding area under the normal curve.

If the histogram is symmetric, we say that the data are approximately normal (or normally distributed). The approximating normal density curve is uniquely defined by the average and the standard deviation of the observations!!

SAT math scores for CS students

Page 15: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

The variable SAT math scores is normally distributed with mean = 595.28 (sample average) and std deviation = 86.40 (sample standard deviation).

SAT math scores for CS students

Page 16: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

The standard normal curveThe standard normal curve

Simple mathematical formula: )2

1exp(

2

1 2xy

The curve is perfectly symmetric around 0

The normal approximation is commonly used in statistics.

There is a special normal curve that is well known:

The standard normal distribution has mean =0 and standard deviation =1

Page 17: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

Benchmarks under the standard normal curveBenchmarks under the standard normal curve

50%

In the normal distribution N(, ):Approximately 68% of the observations are between - and + (within 1 standard deviation from the mean)Approximately 95% of the observations are between - 2 and + 2 (within 2 standard deviations from the mean)Approximately 99.7% of the observations are between - 3 and + 3 (within 3 standard deviations from the mean)

Page 18: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

Normal distribution function F(z)Normal distribution function F(z)It is defined as the area under the standard normal to the left of z,

that is F(z)=P(Z<=z) - The values of F(z) are tabulated, see Table A in your book appendix

Cumulative distribution function

0

0.2

0.4

0.6

0.8

1

1.2

-4 -2 0 2 4

z

F(z

)

Page 19: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

Standard normal probabilities F(z)=P(Z<=z)

Page 20: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

Application of the normal distribution to the dataApplication of the normal distribution to the data

The normal distribution can be used to approximate the distribution of the data, when the data have a symmetric histogram!

Result:

If X is normally distributed N(m,s) with mean m and standard deviation s, then standardized value of X given by Z=(X-m)/s is a standard normal variable N(0,1) with mean 0 and standard deviation equal to 1

Thus, we can compute the relative frequencies for any normal distribution, by standardizing and using the probability Table A.

Page 21: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

ExampleExample

Mean = 595.28 Std Dev. s = 86.40

Problem: What is the percentage of CS students that had SAT math scores less than 700?

Answer: Use the normal approximation - X is N(595.28, 86.40). The answer is the area under the normal density curve for X< 700

Standardize: subtract the average & divide by the standard deviation

X< 700 equivalent to Z=(X-595.28)/86.40<(700-595.28)/86.40=1.212

The distribution of the SATM scores for the CS students is approximately normal with mean 595.28 and s.d. 86.40:N(595.28 , 86.40)

Page 22: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

Answer: The answer is the area under the normal density curve for X< 700

Standardize: subtract the mean, then divide by the standard deviation

X< 700 equivalent to Z=(X-595.28)/86.40<(700-595.28)/86.40=1.212

Look at the Table AWe need to find the area to the left of Z=1.212

Results: 88.59% of the CS students has SATM equal to 700 or lower

Z=1.212

F(z)=.8859

Page 23: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

How do we compute it?How do we compute it?

We use the values of the standard Normal distribution function F(z)=P(Z<=z).

Problem: What is the percentage of CS students that had SAT math scores between 600 and 750?

Approximate answer:

1) Standardize

==

__

595.28

600

750 600

750

595.28 595.28

Page 24: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

Summary: Normal distribution calculationsSummary: Normal distribution calculations

Follow the following steps:1. State the problem. Calculate the sample average and the s.d. and

define the interval you are interested in

2. Standardize

3. Compute the area under the standard normal density curve using the Table A.

Page 25: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

Inverse Problem: What is the lowest SAT math score that a student must have to be in the top 25% of all CS students in the sample?

25%

?

Find the value x, such that 25% of observations fall at or above it.

Mean = 595.28 Std Dev. s = 86.40

Sample Q3=650

Page 26: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

Example: Example: During a study on machine performance, the time between machine failures was recorded for 39 similar machines. From the data, the average time = 23.35 hours and the sample standard deviation = 1.67h.

1. What is the percentage of machines that failed after 24 hours?2. What is the percentage of machines with failure time between 20 and 22 hours?3. How short should the failure time be for a machine to be in the bottom 10% ?

Page 27: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

AnswersAnswers

The observations are on the variable Time of failure X that is approximately normal N(23.35, 1.67).

What is the percentage of machines that failed after 24 hours?Compute the percentage for X>24, that is equal to the area under the normal distribution to the right of 24. Standardize: X>24 as

Or equivalently Z> 0.39Use the standard normal probability tables The area under the standard normal to the right of 0.39 is equal to

(Area to the right of 0.39)= 1- (Area to the left of 0.39) So = 1-0.6517=0.3483

The answer is 0.3483. About 35% of the machines failed after 24 hours.

39.067.1

35.2324

67.1

35.23

X

Page 28: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

2. What is the percentage of machines with failure time between 20 and 22 hours?We need to compute the area under the normal distribution for 20 <X< 22. This is computed subtracting

(Area for X<22)-(Area for X<20).StandardizeX < 22 is in standard units

X<20 is in standard units

Use the standard normal probability tables

The area under the standard normal distribution for Z<-0.81is 0.2090

The area under the standard normal distribution for Z<-2.00 is 0.0228

The answer is 0.2090-0.228=0.1862

18.62% of the machines have failure time between 20 and 22 hours.

81.076.1

35.2322

76.1

35.23

Z

X

00.276.1

35.2320

76.1

35.23

Z

X

Page 29: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

3. How short should the failure time be for a machine to be in the bottom 10% ?

We need to compute the value x* for X~N(23.35, 1.67), such that the area under the normal distribution on the left of x* is equal to 0.1.

X* 23.35

0.1

From the normal probability tables, the standard value z* that corresponds to an area P(Z<z*)=0.1 is z*=-1.28Thus, transforming the z-value back to the x-units, we have

x*=-1.28*st.dev.+mean=-1.28*1.67+23.35=21.21

So the bottom 10% of the cars have failure time equal to 21.21 hours or shorter.

Page 30: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

Normal approximationsNormal approximations

Is the normal approximation appropriate for these data?

Overestimate this areaUnderestimate this area

Use the normal approximation ONLY when the histogram of the observations is bell-shaped!

Page 31: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

Normal quantile plots

A useful tool for assessing if the data come from a normal distribution is a graph called normal quantile plot.

If the points on a normal quantile plot lie close to a straight line, the plot indicates that the data are normal. Deviations from a straight line indicates that the data are not normal.

Page 32: Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal

SAS for E.D.A.

PROC MEANS

PROC UNIVARIATE

PROC CHART (GCHART)

PROC UNIVARIATE

To compute descriptive statistics

To plot histograms

To plot histograms, normal probability plots, boxplots.