graphs & stats

51
GRAPHS & STATS 20 September 2014 Sherubtse Training More on scatterplots Exporting data Overview of statistics T-tests

Upload: burt

Post on 05-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

GRAPHS & STATS. More on scatterplots Exporting data Overview of statistics T-tests. 20 September 2014 Sherubtse Training. HtWt Data. What kinds of interesting questions can we ask? What graphs would we make to answer them?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: GRAPHS & STATS

GRAPHS & STATS

20 September 2014 Sherubtse Training

More on scatterplotsExporting data

Overview of statisticsT-tests

Page 2: GRAPHS & STATS

What kinds of interesting questions can we ask?What graphs would we make to answer them?

HtWt Data

• Is there a difference in height between UWICE & SFS personnel? Does it differ for males vs. females?

• Is there a difference in weight between UWICE & SFS personnel? Does it differ for males vs. females?

• Is there a relationship between height and weight for UWICE personnel? How about for SFS personnel?

• Is there a relationship between height and weight for males? How about for females?

Page 3: GRAPHS & STATS
Page 4: GRAPHS & STATS

Just for fun...add a column of calculated data (BMI)and then summarize the data by SEX and INSTITUTE

HtWt$BMI <- [equation]

Page 5: GRAPHS & STATS

Create this scatterplot of heights vs. weights for just UWICE personnel

Alternative format for plot():plot(data=UWICE, kg~cm)

Page 6: GRAPHS & STATS

Add a regression line to the UWICE scatterplot and determine if the linear relationship is significant

lm()summary(lm)

significant or not?

add the regression line:abline(lm.UWICE,col="red")

one way to add the p-value as text:(text(locator(1), "p = 0.0016", cex=1.5, col="red"))

Page 7: GRAPHS & STATS

Add a thick dashed blue line at x = 170 to indicate which UWICE staff can receive special travel privileges

(HINT: use ?par to figure out the plot arguments for setting line type &

line width)

What are the sexes of the 3 tall persons who will get special privileges? Use identify() to label

identify(UWICE$cm, UWICE$kg, labels=UWICE$sex, n=3)

Page 8: GRAPHS & STATS

Do the same with the SFS scatterplot...(is the relationship significant?)

Page 9: GRAPHS & STATS

You can put all the data in a single graph with different institutions represented by different colors

METHOD #11) Figure out the lower &

upper limits for x- and y-axes

2) Plot the first (e.g., UWICE) data and regression line,

setting xlim() and ylim()

3) Add the additional data, using points()

4) Add the lines (abline()), using colors matching

each institution's points

Page 10: GRAPHS & STATS

Add a Legend

'legend' is not an argument in plot(), so we add it as a

separate line of code

legend (x="topleft", legend=c("UWICE","SFS"), fill=c("purple","blue"), inset=.02, bty="n")

what does each of these arguments mean?

Page 11: GRAPHS & STATS

You can put all the data in a single graph with different institutions represented by different colors

METHOD #21) Figure out the lower &

upper limits for x- and y-axes

2) Plot the full data, setting xlim() and ylim() and 2

colors & pch values*

4) Add the lines (abline()), using colors matching

each institution's points

5) Add the legend

* col=c("blue","purple")[HtWt$institute], pch=c(16,15) [HtWt$institute]

Page 12: GRAPHS & STATS

How would you change the legend boxes to match the points in the scatterplot (UWICE = circle, SFS = square)?

Page 13: GRAPHS & STATS

Exporting Data

To transfer a matrix or data frame via clipboard:write.table(HtsWts,"clipboard",sep="\t") then in excel, paste

...to a tab-delimited text file: write.table(HtsWts, "c:/mydata.txt", sep="\t")

Page 14: GRAPHS & STATS

Intro to Statistics

Page 15: GRAPHS & STATS

Sample Statistic

Population Parameter (e.g., Height)

DESCRIPTIVE STATISTICS:

What is the mean height

of our sample of persons?

INFERENTIAL STATISTICS:

Is the mean height of our sample a good

measure of true population height?

Page 16: GRAPHS & STATS

Standard Deviation

Standard Error

Confidence Interval

Mean

N

DESCRIPTIVE STATISTICS

Our best estimate of the true population mean

Our best estimate of the true population variability

Page 17: GRAPHS & STATS

Standard Deviation

Standard Error

Confidence Interval

Mean

NOur best estimate of the true population mean

Our best estimate of the true population variability

How good is our estimate (from the sample) of the true population mean?IN

FEREN

TIAL S

TATIST

ICS

Page 18: GRAPHS & STATS

Descriptive StatisticsSummarize the data we have collected:

• mean, median, mode

• range, variance, standard deviation, interquartile range

• graphical summaries of the data (e.g., histogram, boxplot)

Why do we need it?It’s difficult to just look at raw data and understand what they mean

Page 19: GRAPHS & STATS

Inferential StatisticsUse a sample of data to make conclusions and predictions about the population we sampled from

Often used to determine if there are differences between populations or if a ‘treatment’ affected a population

Why do we need it?We often don’t have the time or money to collect data from the entire population we are interested in. For inferential statistics, conclusions are only reliable if we sampled properly!

Page 20: GRAPHS & STATS

Truth + Chance = Sample Statistic

We use the sample data to make our best prediction about the population (the data we don't have), and

then quantify the chance that we’re wrong (standard errors & confidence intervals)

But no matter how fancy the statistics or how pretty the graphs, conclusions are only reliable if we sampled properly!

Page 21: GRAPHS & STATS

What is a Normal Distribution?

Page 22: GRAPHS & STATS

Does ‘Normal’ Exist?

Page 23: GRAPHS & STATS

Does ‘Non-Normal’ Exist?

Page 24: GRAPHS & STATS

When are Data Non-Normal?When multiple processes or populations are combined in a single data set...

Heights of children aged 5 - 12

Page 25: GRAPHS & STATS

When are Data Non-Normal?When the population has many values close to zero or some other natural limit...

Page 26: GRAPHS & STATS

When are Data Non-Normal?When some extreme values skew the population...(here, also bounded by zero)

THE SUPER RICH

COMPANY EXECUTIVES

Page 27: GRAPHS & STATS

When are Data Non-Normal?When the data follow a process that naturally generates non-normal distributions

POISSON DISTRIBUTIONCounts of rare events, e.g., accidents (lower

bound of zero)

EXPONENTIAL DISTRIBUTIONPopulation growth

BINOMIAL DISTRIBUTIONProportion (%) data

Page 28: GRAPHS & STATS

What Can We Do With Non-Normal Data?

• Check the data for errors; then

• Transform data to approximate a normal distribution; OR

• Apply nonparametric statistics

Page 29: GRAPHS & STATS

Standard Deviation

Standard Error

Confidence Interval

Mean

NOur best estimate of the true population mean

Our best estimate of the true population variability

How good is our estimate (from the sample) of the true population mean?IN

FEREN

TIAL S

TATIST

ICS

Page 30: GRAPHS & STATS

What is the Standard Error?

 

Standard Error (SE): sd / sqrt(n)

• Standard deviation of the sample means

• Tells us if the sample mean is a good estimate of the true population mean

• Used to calculate the 95% confidence interval

Page 31: GRAPHS & STATS

What is the 95% Confidence Interval?

 

• If we sample from the same population many times, 95% of the samples will have confidence intervals that include the true population parameter

• The true population parameter (e.g., mean) is likely to be within the 95%CI of a sample (if the samples are unbiased). A large 95%CI tells us that our sample mean is not a very reliable estimate of the true mean. With large 95%CI's, it is hard to know from the samples whether or not two populations are truly different

95% Confidence interval: 1.96 X SE

Page 32: GRAPHS & STATS

Are plant heights significantly different between control & fertilized treatments?

Control17.2 (95%CI 16.4 – 18.0)

Fertilized18.9 (95%CI 18.1 – 19.7)

SIGNIFICANTLY DIFFERENTcontrol fertilized

N=30=17.2s=2.1

N=30=18.9s=2.2

N=5N=5

18.9 (95%CI 16.2 – 21.6)

17.2 (95%CI 14.6 – 19.8)

NOT SIGNIFICANTLY DIFFERENT

Page 33: GRAPHS & STATS

Exploratory Data AnalysisBefore jumping into statistical tests and p-values, LOOK AT YOUR DATA in spreadsheets and graphs to identify: data errors/outliers if your data meet assumptions of parametric

statistical tests interesting patterns

Before you do any statistical tests, you should already have an idea what the results will be

Page 34: GRAPHS & STATS

Errors in Data Collection / Entry

• Decimal in wrong place

• Same category spelled many ways

• Data collected in different measurement units

• Forgot to collect some data

• Numbers typed incorrectly when transferred from paper (sloppy handwriting, etc.)

Page 35: GRAPHS & STATS

OUTLIER: A data point that is much smaller or much larger than other data in the sample

Why do we care? A few outliers can change the sample mean, increase the variance of sample data, and change the p-value of a parametric statistical test

How do we find potential outliers in our data? • Look at data ranges, histograms & boxplots • For correlation & regression analyses, look at

scatterplots

Page 36: GRAPHS & STATS

p = 0.06

For correlations/regressions, outliers may fall within the normalrange of data...but plotting the scatterplot reveals outliers

Page 37: GRAPHS & STATS

p = 0.06p = 0.001

A single outlier can change the regression equationand the significance of the relationship

Page 38: GRAPHS & STATS

WHAT SHOULD I DO WITH OUTLIERS?

Are data entered

correctly?

Remove outliers before

analyzing data

Are data from the population of inference?

YES

YES

NO

NO

Page 39: GRAPHS & STATS

Transform data (if

appropriate)

Use nonparametric

statistics

WHAT SHOULD I DO WITH TRUE OUTLIERS?*

Analyze data with and without outliers

Do study conclusions

change?

Keep outliers & report results

Report & discuss both

results

Remove outliers & report results, but discuss your justification for removing

the outliers

* ALWAYS KEEP GOOD RECORDS OF YOUR DATA EXPLORATION ACTIVITIES AND ANY CHANGES YOU MAKE TO THE ORIGINAL DATA!

YES NO

YES

Page 40: GRAPHS & STATS

OUTLIERS

It is wrong to remove outliers from analyses just because they don't fit with the other data!

Outliers can tell us interesting information about a population—conduct more research to understand what causes these unusual data.

Page 41: GRAPHS & STATS

Exploratory Data AnalysisBefore jumping into statistical tests and p-values, LOOK AT YOUR DATA in spreadsheets and graphs to identify: data errors/outliers if your data meet assumptions of parametric

statistical tests interesting patterns

Before you do any statistical tests, you should already have an idea what the results will be

Page 42: GRAPHS & STATS

Do data come from a normally distributed population?

• Sample data are assumed to represent the distribution of the population. Non-normal data are not 'wrong', they just represent processes that naturally generate other types of distributions.

• With small sample sizes, it can be difficult to tell if data come from a normally distributed population. Consider what you know about the underlying process.

Page 43: GRAPHS & STATS

Evaluating NormalityUnderstand which processes generate non-normal data, then...• Visual assessment:

o histograms o normal Q-Q plots

• Normality tests:o Shapiro-Wilk (shapiro.test())o Anderson-Darling (from pkg ‘nortest’)o Pearson chi-square (from pkg ‘nortest’)o Kolmogorov-Smirnov (from pkg ‘nortest’)

Page 44: GRAPHS & STATS

Exploratory Data AnalysisBefore jumping into statistical tests and p-values, LOOK AT YOUR DATA in spreadsheets and graphs to identify: data errors/outliers (for scatterplots, graph it!) if your data meet assumptions of parametric

statistical tests interesting patterns

Before you do any statistical tests, you should already have an idea what the results will be

Page 45: GRAPHS & STATS
Page 46: GRAPHS & STATS

T-Test

Page 47: GRAPHS & STATS

T-TestFor determining if population means are different

Two-sample t-testCompare the means of two independent groups (don’t need same sample sizes), e.g., is the mean height of Bhutanese college men different from that of USA college men?

Paired t-testCompare the means paired groups, e.g., is the mean weight of USA college men different before and after a 3-month exercise & diet program?

Page 48: GRAPHS & STATS

U

G

U

U

U

U

G

G

G

G

GRAZED UNGRAZED56 7227 2651 6440 4232 46

AVERAGE 41.2 50

Is the height of shrubs different in grazed and

ungrazed areas?

Page 49: GRAPHS & STATS
Page 50: GRAPHS & STATS

TWO SAMPLE T-TEST

STEP ONE: Look at the data! (Make the point plot)

STEP TWO: Do the t-test: t.test (grazed, ungrazed)

Page 51: GRAPHS & STATS

[METHODS] "We conducted a two-sample t-test to compare vegetation biomass in grazed versus ungrazed plots."

[RESULTS] "We did not find a significant difference in vegetation biomass between grazed (M = 41.20; SD = 12.28) and ungrazed (M = 50.00; SD = 18.28) plots in this study, t(7) = -0.89; p = 0.40."

NOTE: Remember to report sample size for each treatment in METHODS.