descriptive - stat.duke.edukfl5/lock_rree_descriptive_2009.pdf · descriptive statistics • before...

Post on 19-Jul-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

DescriptiveDescriptiveDescriptiveDescriptiveStatisticsStatisticsStatisticsStatistics

Kari LockHarvard University

Department of StatisticsRigorous Research in Engineering Education

12/3/0912/3/09

Descriptive StatisticsDescriptive Statistics

• Before making inferences about the population, we need to describe the data observed in our sample

• visual displays: graphs, plots• summary statistics: numerical summaries of the datasummary statistics: numerical summaries of the data

• The appropriate descriptive techniques depend on the pp p p q ptype of data collected

VisualizationVisualizationOne Categorical VariableOne Categorical VariableOne Categorical VariableOne Categorical Variable

G l di l th b t f l i h t

Pie Chart Bar Chart

Goal: display the number or percent of people in each category

VisualizationVisualizationOne Quantitative VariableOne Quantitative Variable

Boxplot

One Quantitative VariableOne Quantitative Variable

Histogram

9010

0

e015

<‐ 50th percentile (median)<‐ 75th percentile

7080

Exa

m S

cor e

Freq

uenc

y

510

l

<‐ 25th percentile

6050 60 70 80 90 100

0 Outlier

Outlier

• Goal: Visually display the entire distribution of observed

Exam Score

Goal: Visually display the entire distribution of observed numbers.  Location, spread, shape and outliers are all visible.

VisualizationVisualizationOne Quantitative VariableOne Quantitative VariableOne Quantitative VariableOne Quantitative Variable

• Symmetric:

Sk d• Skewed:

• Outlier – values that fall outside an overall pattern.

DistributionDistribution

• If a histogram is theoretically known, it is smoothed over with a di t ib ti Thi di t ib ti t h t th h th ti ldistribution.   This distribution represents what the hypothetical histogram would look like if we were to take many samples

1500

1500

Freq

uenc

y

1000

Freq

uenc

y

1000

500

500

-3 -2 -1 0 1 2 3

0

-3 -2 -1 0 1 2 3

0

VisualizationVisualizationOne Categorical One QuantitativeOne Categorical One QuantitativeOne Categorical, One QuantitativeOne Categorical, One Quantitative

Side-by-Side Boxplots Side-by-Side Boxplotsy p

9010

0

s

9010

0

Exa

m S

core

s

7080

Exa

m S

core

s

070

80

5060

Males Females50

60Control Treatment

Compare the numerical distributions of two (or more)

Gender

Males Females

Experimental Group

Compare the numerical distributions of two (or more) different groups

VisualizationVisualizationOne Categorical Two QuantitativeOne Categorical Two QuantitativeOne Categorical, Two QuantitativeOne Categorical, Two Quantitative

Side-by-Side Boxplots

00

res

9010

Exa

m S

cor

7080

60

Control Treatment

• Often it’s not obvious whether two groups areExperimental Group

Often it s not obvious whether two groups are significantly different.  That’s what statistics is for!

VisualizationVisualizationTwo QuantitativeTwo QuantitativeTwo QuantitativeTwo Quantitative

0

Scatterplot

9010

0

tOutlier

080

Pos

t-Tes

60 70 80 90 100

7

• G l Vi li th l ti hi b t t tit ti i bl

60 70 80 90 100

Pre-Test

• Goal: Visualize the relationship between two quantitative variables.  Note: Both variables have to be measured on the same set of individuals

Summary StatisticsSummary StatisticsCategorical VariableCategorical VariableCategorical VariableCategorical Variable

• Proportion: Report the proportion of observations that fall in each category.  g y

Summary StatisticsSummary StatisticsOne Quantitative VariableOne Quantitative VariableOne Quantitative VariableOne Quantitative Variable

Mean – the “average”: add everything up and divide by the sample size.

Median – the “middle” point at which 50% of the data are below and 50% are abovedata are below and 50% are above.

Standard Deviation – A measure of the averageStandard Deviation  A measure of the average distance from the mean.  Indicates how “spread out” the data are.out  the data are. Variance = (Standard Deviation)2

Summary StatisticsSummary StatisticsOne Quantitative VariableOne Quantitative VariableOne Quantitative VariableOne Quantitative Variable

Lower standard deviation (0.33):Lower standard deviation (0.33):

Higher standard deviation (1.32): Same mean

Summary StatisticsSummary StatisticsTwo Quantitative VariablesTwo Quantitative Variables

• Correlation (r) a measure of linear association between

Two Quantitative VariablesTwo Quantitative Variables

• Correlation (r) a measure of linear association between two numeric variables. 

• Correlation runs between ‐1 and 1.  Values closer to ‐1 and 1 represent stronger relationships, values closer to 0 are weaker.– Positive Correlation: as one variable goes up the other goes up

N i C l i i bl h h– Negative Correlation: as one variable goes up the other goes down

– Zero Correlation means no linear associationZero Correlation means no linear association

CorrelationCorrelation10

0

r = .85

901

Test

7080

Pos

t-T

60 70 80 90 100

Pre-Test

Correlation CautionsCorrelation Cautions

Correlation CautionsCorrelation Cautions

• Correlation only measures linear associations:• Correlation only measures linear associations:

r = 06

84

6

y

3 2 1 0 1 2 3

02

-3 -2 -1 0 1 2 3

x

OutliersOutliers

• Correlation can be highly affected by outliers:

r = .78

Correlation can be highly affected by outliers:

r = .17

8090

8590

6070

y 80y2

4050

Outlier 7075

30 40 50 60 70 80 90

x

75 80 85

x2

OutliersOutliers

• Mean and standard deviation are also highly influencedMean and standard deviation are also highly influenced by outliers:

• Mean = 3, Standard Deviation = 0.21:

• Mean = 3.8, Standard Deviation = 1.86:

Outlier

OutliersOutliers

l d f h l• ALWAYS plot your data to see if you have outliers

f d h l h ld f• If you do have extreme outliers, you should first check that you haven’t made an error entering the ddata

If h li l i i h ld• If the outliers are legitimate, you should run your analyses with and without the outliers to see how 

h h li i fl h lmuch the outliers influence the results

top related