announcements first quiz next monday (week 3) at 6:15-6:45 summary: recap first lecture:...

Post on 14-Jan-2016

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Announcements• First quiz next Monday (Week 3) at 6:15-6:45

Summary: Recap first lecture:

Descriptive statistics – Measures of center and spread Normal density (Section 1.3)

SAS procedures for analyzing univariate dataProc MEANSProc UNIVARIATE

CSC 323 Data analysis and Statistical software I

Describing distributions with numbersDescribing distributions with numbers

The distribution of the data is described through its center and its spread.

For symmetric distributions use the mean and the standard deviation

For skewed distributions, use the five number summary:Min, Q1, Median, Q3, Max

The median is the midpoint of a distribution, the number such that half the

observations are below it and the other half is above it.Q1 is the first quartile or 25th percentile, the point such that 25% of the

observations are below it.

Q3 is the third quartile or 75th percentile, the point such that 25% of the observations are above it

Example

Fuel economy (miles per gallon) per model year 2001 cars on highway

13 16 19 21 22 24 24 25 26 28 30 30 68

N=13

Median=?

Q1=?

Q3=?

Possible outliers?

Example: Example: SAT Math score of 224 Computer Science StudentsSAT Math score of 224 Computer Science Students

In a large university, data were collected to study the academic achievements of computer science majors. We’ll consider the SAT math scores of 224 first year CS students.

The average SATM score is 595.28 with s.d. s= 86.40

Histogram of the SATM ScoresAre the average and s.d. good descriptions of the SATM scores distribution?

Roughly 68% of the students have scores between 510 and 680Roughly 95% of the students have scores between 422 and 768How did I compute these intervals?

Interpreting the s.d. valueInterpreting the s.d. value

For many lists of observations – especially if their histogram is bell-shaped

1. Roughly 68% of the observations in the list lie within 1 standard deviation from the average

1. And 95% of the observations lie within 2 standard deviations from the average

AverageAve-s.d. Ave+s.d.

68%

95%

Ave-2s.d. Ave+2s.d.

CS students example: Descriptive statisticsCS students example: Descriptive statistics Mean = 595.28 Std Deviation = 86.40 Max= 800 Min= 300 Q1 = 540 Median = 600.00 Q3= 650 IQR=110 1.5xIQR=165 5th percentile = 460 95th percentile = 750

Histogram of the SATM Scores

422 76895% of scores

Analysis of the scores for male and female students:

Box plot:

SATM scores for men SATM scores for women

Exploratory Data Analysis:Exploratory Data Analysis:

1. Always plot your data

2. Look for overall patterns & striking deviations such as outliers

3. Calculate a numerical summary to describe the center and the spread:• Symmetric distributions: Mean and standard deviation• Asymmetric distributions: 5 number summary {Min, Q1, Median,

Q3, Max}

4. NEXT STEP: sometimes the overall pattern is so regular that we can describe it through a smooth curve, called a density curve

Density curves

Density curves describe the overall pattern of distributions.

A density curve Is always on or above the

horizontal axis Has area exactly 100%

underneath it.

The density curve is a mathematical model that can be used to describe empirical distributions

SAT math scores for CS students

Normal distributionNormal distribution

Normal curves provide a simple compact way to describe symmetric, bell-shaped distributions.

SAT math scores for CS students

Normal curve

)2

)(exp(

2

12

2

x

y

The normal curve has the following expression:

It is centered on the mean and has spread equal to the standard deviation

Two normal curves with the same mean but different standard deviation.

Money spent in a supermarketMoney spent in a supermarket

Is the normal curve a good approximation?

The area under the histogram, i.e. the percentages of the observations, can be approximated by the corresponding area under the normal curve.

If the histogram is symmetric, we say that the data are approximately normal (or normally distributed). The approximating normal density curve is uniquely defined by the average and the standard deviation of the observations!!

SAT math scores for CS students

The variable SAT math scores is normally distributed with mean = 595.28 (sample average) and std deviation = 86.40 (sample standard deviation).

SAT math scores for CS students

The standard normal curveThe standard normal curve

Simple mathematical formula: )2

1exp(

2

1 2xy

The curve is perfectly symmetric around 0

The normal approximation is commonly used in statistics.

There is a special normal curve that is well known:

The standard normal distribution has mean =0 and standard deviation =1

Benchmarks under the standard normal curveBenchmarks under the standard normal curve

50%

In the normal distribution N(, ):Approximately 68% of the observations are between - and + (within 1 standard deviation from the mean)Approximately 95% of the observations are between - 2 and + 2 (within 2 standard deviations from the mean)Approximately 99.7% of the observations are between - 3 and + 3 (within 3 standard deviations from the mean)

Normal distribution function F(z)Normal distribution function F(z)It is defined as the area under the standard normal to the left of z,

that is F(z)=P(Z<=z) - The values of F(z) are tabulated, see Table A in your book appendix

Cumulative distribution function

0

0.2

0.4

0.6

0.8

1

1.2

-4 -2 0 2 4

z

F(z

)

Standard normal probabilities F(z)=P(Z<=z)

Application of the normal distribution to the dataApplication of the normal distribution to the data

The normal distribution can be used to approximate the distribution of the data, when the data have a symmetric histogram!

Result:

If X is normally distributed N(m,s) with mean m and standard deviation s, then standardized value of X given by Z=(X-m)/s is a standard normal variable N(0,1) with mean 0 and standard deviation equal to 1

Thus, we can compute the relative frequencies for any normal distribution, by standardizing and using the probability Table A.

ExampleExample

Mean = 595.28 Std Dev. s = 86.40

Problem: What is the percentage of CS students that had SAT math scores less than 700?

Answer: Use the normal approximation - X is N(595.28, 86.40). The answer is the area under the normal density curve for X< 700

Standardize: subtract the average & divide by the standard deviation

X< 700 equivalent to Z=(X-595.28)/86.40<(700-595.28)/86.40=1.212

The distribution of the SATM scores for the CS students is approximately normal with mean 595.28 and s.d. 86.40:N(595.28 , 86.40)

Answer: The answer is the area under the normal density curve for X< 700

Standardize: subtract the mean, then divide by the standard deviation

X< 700 equivalent to Z=(X-595.28)/86.40<(700-595.28)/86.40=1.212

Look at the Table AWe need to find the area to the left of Z=1.212

Results: 88.59% of the CS students has SATM equal to 700 or lower

Z=1.212

F(z)=.8859

How do we compute it?How do we compute it?

We use the values of the standard Normal distribution function F(z)=P(Z<=z).

Problem: What is the percentage of CS students that had SAT math scores between 600 and 750?

Approximate answer:

1) Standardize

==

__

595.28

600

750 600

750

595.28 595.28

Summary: Normal distribution calculationsSummary: Normal distribution calculations

Follow the following steps:1. State the problem. Calculate the sample average and the s.d. and

define the interval you are interested in

2. Standardize

3. Compute the area under the standard normal density curve using the Table A.

Inverse Problem: What is the lowest SAT math score that a student must have to be in the top 25% of all CS students in the sample?

25%

?

Find the value x, such that 25% of observations fall at or above it.

Mean = 595.28 Std Dev. s = 86.40

Sample Q3=650

Example: Example: During a study on machine performance, the time between machine failures was recorded for 39 similar machines. From the data, the average time = 23.35 hours and the sample standard deviation = 1.67h.

1. What is the percentage of machines that failed after 24 hours?2. What is the percentage of machines with failure time between 20 and 22 hours?3. How short should the failure time be for a machine to be in the bottom 10% ?

AnswersAnswers

The observations are on the variable Time of failure X that is approximately normal N(23.35, 1.67).

What is the percentage of machines that failed after 24 hours?Compute the percentage for X>24, that is equal to the area under the normal distribution to the right of 24. Standardize: X>24 as

Or equivalently Z> 0.39Use the standard normal probability tables The area under the standard normal to the right of 0.39 is equal to

(Area to the right of 0.39)= 1- (Area to the left of 0.39) So = 1-0.6517=0.3483

The answer is 0.3483. About 35% of the machines failed after 24 hours.

39.067.1

35.2324

67.1

35.23

X

2. What is the percentage of machines with failure time between 20 and 22 hours?We need to compute the area under the normal distribution for 20 <X< 22. This is computed subtracting

(Area for X<22)-(Area for X<20).StandardizeX < 22 is in standard units

X<20 is in standard units

Use the standard normal probability tables

The area under the standard normal distribution for Z<-0.81is 0.2090

The area under the standard normal distribution for Z<-2.00 is 0.0228

The answer is 0.2090-0.228=0.1862

18.62% of the machines have failure time between 20 and 22 hours.

81.076.1

35.2322

76.1

35.23

Z

X

00.276.1

35.2320

76.1

35.23

Z

X

3. How short should the failure time be for a machine to be in the bottom 10% ?

We need to compute the value x* for X~N(23.35, 1.67), such that the area under the normal distribution on the left of x* is equal to 0.1.

X* 23.35

0.1

From the normal probability tables, the standard value z* that corresponds to an area P(Z<z*)=0.1 is z*=-1.28Thus, transforming the z-value back to the x-units, we have

x*=-1.28*st.dev.+mean=-1.28*1.67+23.35=21.21

So the bottom 10% of the cars have failure time equal to 21.21 hours or shorter.

Normal approximationsNormal approximations

Is the normal approximation appropriate for these data?

Overestimate this areaUnderestimate this area

Use the normal approximation ONLY when the histogram of the observations is bell-shaped!

Normal quantile plots

A useful tool for assessing if the data come from a normal distribution is a graph called normal quantile plot.

If the points on a normal quantile plot lie close to a straight line, the plot indicates that the data are normal. Deviations from a straight line indicates that the data are not normal.

SAS for E.D.A.

PROC MEANS

PROC UNIVARIATE

PROC CHART (GCHART)

PROC UNIVARIATE

To compute descriptive statistics

To plot histograms

To plot histograms, normal probability plots, boxplots.

top related