introduction to statistics

What Statistics Books Try To Teach You ButDont

Joe King

University of Washington

1

Contents

I Introduction to Statistics 4

1 Principles of Statistics 51.1 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.1 Types of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.1.2 Sample vs. Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3.2 Type I & II Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3.3 What does Rejecting Mean? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Writing in APA Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.5 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Description of A Single Variable 82.1 Where’s the Middle? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Skew and Kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Testing for Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.6 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

II Correlations and Mean Testing 13

3 Relationships Between Two Variables 143.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Pearsons Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3 R Squared . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4 Point Biserial Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.5 Spurious Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.6 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Means Testing 164.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2.1 Independent Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2.2 Dependent Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2.3 Effect Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3 Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2

CONTENTS CONTENTS

III Latent Variables 20

5 Latent Constructs and Reliability 215.1 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

IV Regression 22

6 Regression: The Basics 236.1 Foundation Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236.2 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236.3 Bibliographic Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

7 Linear Regression 257.1 Basics of Linar Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7.1.1 Sums of Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7.2.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277.2.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

7.3 Interpretation of Parameter Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297.3.1 Continuous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7.3.1.1 Transformation of Continous Variables . . . . . . . . . . . . . . . . . . . . . 297.3.1.1.1 Natural Log of Variables . . . . . . . . . . . . . . . . . . . . . . . . 30

7.3.2 Categorical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317.3.2.1 Nominal Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317.3.2.2 Ordinal Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7.4 Model Comparisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327.5 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327.6 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.6.1 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327.6.1.1 Normality of Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.6.1.1.1 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327.6.1.1.2 Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7.7 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

8 Logistic Regression 368.1 The Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368.2 Regression Modeling Binomial Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

8.2.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368.2.2 Regression for Binary Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

8.2.2.1 Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388.2.2.2 Probit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388.2.2.3 Logit or Probit? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

8.2.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3

Part I

Introduction to Statistics

4

Chapter 1

Principles of Statistics

Statistics is scary to most students but it does not have to be. The trick is to build up your knowledge baseone-step at a time to make sure you get the building blocks necessary to understand the more advancedstatistics. This paper will go from very simple understanding of variables and statistics to more complexanalysis for describing data. This mini-book of statistics will give several formulas to calculate parametersyet rarely will you have to calculate these on paper or insert the numbers in an equation for a spreadsheet.

This first chapter will look at some of the basic principles of statistics. Some of the basic concepts that willbe necessary to understand statistical inference. These may seem simple and some of these many may befamiliar with but best to start any work of statistics with the basic principles as a strong foundation.

1.1 Variables

First we start with the basics. What is a variable? Essentially a variable is a construct we observe. Thereare two kinds of variables, manifest (or observed variables) and latent variables. Latent variables are ones wecan only measure by measuring other manifest variables, but we infer it (socio-economic status is a classicexample). Manifest variables we directly measure and can model or we can use them to construct morecomplex latent variables, for example we may measure parents education, parents incoming and combinethose into the construct of socio-economic status.

1.1.1 Types of Variables

There are four primary categories of manifest variables, nominal, ordinal, interval, and ratio. The firsttwo are categorical variables. Nominal variables are variables which are strictly categorical and have nodiscernible hierarchy or order to them, this includes race, religion, or states for example. Ordinal is alsocategorical but this has a natural order to it. Likert scales (strongly disagree, disagree, neutral, agree,strongly agree) is one of the most common examples of an ordinal variable. Other examples include includeclass status (freshman, sophomore, junior, senior) and levels of education obtained (high school, bachelors,masters, etc).

The continuous variables are interval and ratio. These are not categorical such as having a set number ofvalues but can take any value between two values. A continuous variable is exam scores; your score maytake any value from 0%-100%. Interval has no absolute value so we cannot make judgements about thedistance between two values. Temperature is a good example, Celsius and Fahrenheit realistically won’thave an absolute minimum or maximum from the temperatures we experience. We cannot say 30 degreesFahrenheit is twice as warm as 15 degrees Fahrenheit. A ratio scale is still continuous but has an absolutezero, so we can make judgements about differences. I can say a student who got an 80% on the exam didtwice as good as the student who got a 40% on their exam.

1.1.2 Sample vs. Population

One of the primary interests in statistics is to try to generalize our sample to a population. A populationdoesn’t always have to be the population of a state or nation as we usually think of the word. Lets say forexample the head of UW Medicine came to me and asked me to do a workplace climate survey on all thenursing staff at UW Medical Center. While there are alot of nurses there, I could conceivably give my surveyto each and every one of them. This would mean I would not have a problem of generalizability because Iknow the attitudes of my entire population.

Unfortunately statistics is rarely this clean, and you will not have access to an entire population. ThereforeI must collect data that is representative’s of the population I want to study, this will be a sample. It isimportant to note though because different notation is used for samples versus populations. For example

5

CHAPTER 1. PRINCIPLES OF STATISTICS 1.2. TERMINOLOGY

x is generally a sample mean while µ is used as the population mean. Rarely will you be able to knowthe population mean where this becomes a huge issue. Many books on statistics have the notation at thebeginning of their book, yet I feel this is not a good idea. I will introduce notation as it becomes relevant,and specifically discuss it when its necessary. Do not be alarmed if you find yourself coming back to chaptersremembering notation, it happens to everyone, and committing this to memory is a truly life long affair.

1.2 Terminology

There is also the discussion of terminology. This will be discussed before the primary methods for under-standing how to do statistics because the terminology can get confusing. Unfortunately statistics tends tolike to change its terminology and have multiple words for the same concept, which differ between journals,disciplines and different coursework.

One area where this is most true is when talking about types of variables. We classified variables into howthey are measured above, but how they fit into our research question is different. Basic statistics books stilltalk about variables as independent or dependent variables. Although these have fallen out of favor inalot of disciplines, especially the methodology literature, but still bears weight so will be discussed. We willtalk about which variables are independent and dependent based on the models we run when we get to thosemodels but in general, the dependent variable is the one we are interested in knowing about. In short, wewant to know how our independent variables influence our dependent variable(s).

Now of course there are different names for the dependent and independent variables depending on whatwe are studying. Almost universally the dependent variable is called the outcome variable. This seemsjustified given its the outcome we are studying. Its the independent variable which has been given manynames. In many cases its called the regressor (in regression model), predictor (again generally in regression)or covariate. I prefer the second term, and don’t like the third. The first one seems too tied to regressionmodelling and not as general as predictor. Covariate has different meanings with different tests so in myopinion can be confusing. Predictor also can be confusing because some people may conflate this with cau-sation which would be a very wrong assumption to make. I will usually use the term independent variableor predictor due to the lack of better terms and these are the more common ones you will see in the literature.

1.3 Hypothesis Testing

The basis from where we start our research is the null hypothesis. This simply says there is no relationshipbetween the variables we are studying. When we ”reject” the null hypothesis, we are saying we accept thealternative hypothesis which says the null hypothesis is not true and there is a ”significant” relationshipbetween the variable(s) we are studying.

1.3.1 Assumptions

There are many types of assumptions that we must make in our analysis in order for our coefficients to beunbiased.

1.3.2 Type I & II Error

So we have a hypothesis associated with a research question. This mini-book will look at ways to explorehypothesis and how we can either support or not support our hypothesis. First we must make a few basicsabout hypothesis testing. We have to have some basis to determine whether the questions we are testingare true or not. Yet we also don’t want to make hasty judgements about whether our hypothesis is corrector not. This leads us to committing errors in our judgements. There are two primary errors in this context.Type I error is where we reject the null hypothesis when it is correct. Type II error is when we do notreject the null hypothesis when it is wrong. While we attempt to avoid both types of errors, the latter ismore acceptable than the former. This is because we do not want to make hasty decision about discussingan important relationship between variables when none exists. If we say there is no relationship when in

6

CHAPTER 1. PRINCIPLES OF STATISTICS 1.4. WRITING IN APA STYLE

fact there is one, this is a more conservative approach that hopefully future research will correct.

1.3.3 What does Rejecting Mean?

When we try to reject the null hypothesis first we must determine our critical value which is generally 0.05.It is by convention that it is done and currently debated on whether its still of practical use given computingtechnology today. When we reject the null hypothesis all we are saying is the chances of finding as large orlarger result is less than the significance level. This does not mean that your research question really meritsany major practical effect. Rejecting the null hypothesis may be important but so can not rejecting the nullhypothesis be important. For example if there was a school where lower income groups and higher incomegroups were performing ”significantly different” on exams 5 years ago, and I came in and tested again, andI found ”no statistically significant differences”, I would find that to be highly important. It would meanthere was a change in the test scores and there is now some relative parity.

The next concern is practical significance. If my research is significant, but there may not be any real reasonto think its going to make a difference if implemented in policy or clinical settings. This is where othermeasures come into play, like effect sizes which will be discussed later. One should also note that largersample sizes can make even a very small statistics ”statistically significant” and a small sample size can maska significant result. All of these must be considerations. One should not take a black and white approach toanswering research questions. Something is just not significant or not.

1.4 Writing in APA Style

One thing to be cautious about is how to write up your results and present them in a manner which is bothethical and concise. This includes graphics, tables and paragraphs. These should make the main points ofwhat you want to say while not mis-representing your results. If you are going to be doing alot of writingfor publication you should pick up a copy of the APA Manual (Association2009).

1.5 Final Thoughts

A lot was discussed in this first part. These concepts will be revisited in later sections as we begin toimplement these concepts. There are many books which have been written which expand on these conceptsfurther and articles which have been written about these concepts. I ask that you constantly keep an openmind as researchers and realize statistics can never tell us ”truth”, it can only hint at it, or point us in theright directions, and the process of scientific inquiry never ends.

7

Chapter 2

Description of A Single Variable

So when we have variables we want to understand the nature of these variables. Our first job is to describeour data, before we can start to do any test. There are two measures we want to know about our data.The first is we want to know where the center of the mass of the data is, and how far from the center ofthe mass our data is distributed. The middle is calculated by the measures of central tendency (discussedmomentarily), how far from the middle of that helps us know how much variability there is in our data. Thisis also called uncertainty or the dispersion parameter. These concepts are more generally known as thelocation and scale parameters. Location being the middle of the distribution, where on a real number linedoes the middle lie. Scale is how far away from the middle does our data go. These are concepts that arecommon among all statistical distributions. Although, for now our focus is on the normal distribution. Thisis also known as the Gaussian distribution and is widely used in statistics for its satisfying mathematicalproperties and being able to conform to allow us to run many types of analyses.

2.1 Where’s the Middle?

The best way to describe data is to use the measure of central tendency, or what is the middle of a set ofvalues. This includes the mean, median, and mode.The equation to find the mean is in 2.1. The equation below has some notation which requires some dis-cussion as you will see this in alot of formulas. The

∑is the summation sign, which tells us to sum

everything to its left. The i = 1 below the summation sign simply means start at the first value in the vari-able, and the N at the top means go all the way to the end (or the number of responses seen in that variable).

x =

∑Ni=1 x

N(2.1)

If we return to our x vector we get 2.2

x = 1 + 2 + 3 + 4 + 5 = 15/5

x = 15/5

x = 3(2.2)

Our mean is influenced by all the numbers equally, so our example of variable y would give a different meanby formula 2.3.

x = 1 + 1 + 2 + 3 + 4 + 5 = 15/6

x = 16/6

x = 2.67(2.3)

The addition of the extra one weighed our mean down. As we will see, values can have dramatic changes onour mean, especially when the number of values we have is low. Finally we represent mean in several ways,the Greek letter µ represents the population mean, while the mean of a sample can be denoted with a flatbar on top, so we would say x = 3. Finally the mean is also known as the expected value, so we can write

8

CHAPTER 2. DESCRIPTION OF A SINGLE VARIABLE 2.2. VARIATION

it as E(x) = 3.

For categorical data there are two great measures. The first is Median which is simply the middle numberof a set, so for a set of values as in 2.4

Median = 1, 2, 3︸︷︷︸Median=3

, 4, 5 (2.4)

Now if there is an even number of values we take the mean of the two middle values 2.5

Median = 1, 1, 2, 3︸︷︷︸Median=2.5

, 4, 5 (2.5)

Mode is simply the most common number in a set, so the last example, 1 is the mode since it occurs twice,the others occurs once. You may get bi-modal data where there is two numbers that occur most of all, oreven more.

These last two measures if discussing the middle of a distribution are of great interest in categorical datamostly. Mode is rarely useful in interval or ordinal data, although median can be of help in this data. Meanis the most relevant for continuous data and one that will be used a lot in statistics. The mean is morecommonly referred to as the average Mean is computed by taking the sum of all of the values and dividingby the number of values.

2.2 Variation

We now know how to get the mean, but much of the time we also want to know how much variation is inour data. When we talk about variation we are talking about why we get different values in the data set.So going on our previous example of [1,2,3,4,5] we want to know why we got these values and not all 3s, or4s. A more practical example is why does one student score a 40 on an exam, and another 80, another 90,another 50, etc. This measure of variation is called variance. It is also called the dispersion parameter inthe statistics literature and the word dispersion will be used in discussion of other models.

Variance for the normal distribution is first to find the difference between each value and the sample mean.Then those differences are squared, and the sum of that is divided by the number of observations as seenbelow in taking the variance of x. Taking the square root of the variance gives the standard deviation forthe normal distribution. Formula 2.6 shows the equation for this.

V ar(x) =

N∑i=1

(x− x)2

N(2.6)

Formula 2.7 below shows how we take the formula above and use our previous variable x to calculate thesample variance.

9

CHAPTER 2. DESCRIPTION OF A SINGLE VARIABLE 2.3. SKEW AND KURTOSIS

V ar(x) = ([1− 3 = −2] + [2− 3 = −1] + [3− 3 = 0] + [4− 3 = 1] + [5− 3 = 2])/5

= (−22 +−12 + 02 + 12 + 22)/5

= (4 + 1 + 0 + 1 + 4)/5

= 10/5

= 2

(2.7)

A plot of the normal distribution with lines pointing to the distance between 1, 2 and 3 standard deviationsis shown in 2.1.

6 10 14 18 22 26 30 34

1 Standard Deviation (68.2%)

2 Standard Deviations (95.4%)

3 Standard Deviations (99.7%)

Figure 2.1: Normal Distribution

Now is when we start getting into the discussion of distributions. Specifically here we will talk about thenormal distribution. The standard deviation is one property of the normal distribution. The standarddeviation is a great way to understand how data is spread out and gives us an idea of how close to the meanour sample is. The rule for the normal distribution is 68% of the population will be within one standarddeviation of the mean, 95% will be within two standard deviations, and 99% will be within three standarddeviations. This is shown in Figure 1, which has a mean of 20, and a standard deviation of two.

There is two other forms of variation that are good to see. This the interquartile range. This shows themiddle 50% of the data. It goes from the upper 75th percentile to the lower 25th percentile. One goodgraphing technique for this is a box and whisker plot . This is shown in 8.1. The line in the middle is themiddle of the distribution. The box is the interquartile range, the horizontal lines are two standard devia-tions out. The dots outside those are outliers (data points more than two standard deviations from the mean).

2.3 Skew and Kurtosis

Two other concepts which help us evaluate a single normal variable is skew and kurtosis. This is not talkedabout as much but they are still important. Skew is when one part of the sample is on one side of the meanthan the other. Negative skew is where the peak of the curve is to the right of the mean (the tail goingto the left). Positive Skew is where the peak of the distribution is to the left and the tail is going to the right.

Kurtosis is how flat or peaked a distribution looks. A distribution which has a more peaked shaped is calledleptokurtic, and a shape that is flatter is called platokurtic. Although skewness and Kurtosis can make adistribution violate normality, it does not always.

10

CHAPTER 2. DESCRIPTION OF A SINGLE VARIABLE 2.4. TESTING FOR NORMALITY

●

●

●

●

●

Figure 2.2: Box and Whisker Plot

2.4 Testing for Normality

Can we test for normality? Well we can, and should. One way is to use descriptive statistics and to lookat a histogram. Below you can see a histogram of the frequency of a normal distribution. We can overlay anormal distribution over it, and we can see if the data looks normal. This is not a ”test” per se but we canget a good idea of our data looks like. This is shown in 2.3.

5 10 15 20 25 30 35

Figure 2.3: A Histogram of the normal distribution above with the normal curve overlaid

We could also example a PP Plot. This is a plot with a line at a 45 degree angle going from bottom left toupper right of a plot. the closer the points are to the line the closer to normality the distribution is. This isalso the same principle behind a qqplot (Q meaning quantiles).

2.5 Data

I will try to give examples of data analysis and its interpretation. One good data set is on Cars released in1993 (Lock, 1993), names of the variables and more info on the data set can be found in Appendix ??.

2.6 Final Thoughts

A lot of concepts were discussed are necessary for a basic understanding of statistical knowledge. Althoughdo not feel you have to have this entire chapter memorized. The concepts here you may need to come backto from time to time. Do not focus either on memorizing formulas, focus on what the formulas tell youabout the concept. With today’s computing powers your concern will be understanding what the output istelling you and how to connect that to your research question. While it is good to know how numbers are

11

CHAPTER 2. DESCRIPTION OF A SINGLE VARIABLE 2.6. FINAL THOUGHTS

calculated, its just to understand how to use it in your test.

12

Part II

Correlations and Mean Testing

13

Chapter 3

Relationships Between Two Variables

The first part of this book we just looked at describing variables. Now we look at how they are related andwant to test the strength of those relationships. This is a difficult task, something that will take time tomaster not only the concepts but its implementation. Course homework’s are actually the easiest way to dostatistics. You are given a research question told what to run and to report your results. In real analysisyou will have to decide for yourself what test to run that best fits your data and your research question.While I will provide some equations, its best to look at them just to see what they are doing, and whatthey mean, its less important to memorize them. This first part will look at basic correlations and testingof means (t-tests and ANOVA).

Much of statistics is correlational research. It is research where we look at how one variable changes whenanother changes, yet causal inferences will not be assessed. It is very tempting to use the word cause or toimply some directionality in your research but you need to refrain from it unless you have alot of evidence tojustify it as the ethical standards for determining causality is high. If you are wishing to learn more aboutcausality see (Pearl, 2009a;Pearl, 2009b)

3.1 Covariance

Before discussing correlations we have to discuss the idea of a covariance. One of the most basic ways toassociate variables is by getting a variance-covariance matrix. Now a matrix is like a spreadsheet, each cellhaving a value in it. The diagonal going from upper left to lower right is the variance of the variable (asit will be the same variable on the top row as it will be on the left column. The other values will be thecovariance between the two variables. The idea of covariance is similar to variance, except we want to knowhow one variable varies with another. So if one changes in one direction, how will another variable changein the same direction? Do note though we are only talking about continuous variables here (for the mostpart interval and ratio scales are treated the same and the distinction is rarely made in statistical testing,so when I mention continuous it may be either interval or ratio without compromising my analysis). Theformula for covariance is in 3.1.

Cov(x, y) =

N∑i=1

(x− x)(y − y)

N(3.1)

As one can see it is taking the deviations from the mean, and multiplying them together and then dividingby the sample size. This gives a good measure of the relationship between the two variables. While thisconcept is necessary and a bedrock of many statistical tools, its not very intuitive. It is not standardizingit in anyway that allows us to make quick understandings of the relationships, this is what leads us intocorrelations.

3.2 Pearsons Correlation

A correlation is essentially a standardized covariance. We take the covariance and divide it by the standarddeviation in 3.2:

rx,y =

∑Ni=1 (x− x)(y − y)√∑N

i=1 (x− x)2∑Ni=1 (y − y)2

(3.2)

If we dissect this formula its not as scary as it looks. The top of the equation is simply the covariance. Thebottom is the variance of x and the variance of y multiplied by each other. Taking the square root is simplyconverting that to a standard deviation. This puts the correlation coefficient into the metric of -1 to 1. Acorrelation of 0 means no association what so ever. A correlation of 1 is a perfect correlation. So lets say

14

CHAPTER 3. RELATIONSHIPS BETWEEN TWO VARIABLES 3.3. R SQUARED

we are looking at the association of temperatures between two cities, if city A temperature went up by onedegree, city B would also go up by one degree if their correlations were 1 (remember a correlation assumesthe units of measurement). If the correlation is -1, its a perfect inverse correlation, so if temperature of cityA goes up one degree, city B will go DOWN one degree. In social science the correlations are never thisclean, or clear to understand. Since the metrics can differ between correlations one must be careful aboutwhen you do a correlation and how you interpret it. Also remember a correlation is non-directional, so if wehave a correlation of .5 and temperature in city A goes up one degree and up a half degree in city B, thenif city B goes up a full degree then will go up a half degree in city A.

Pearsons correlations are reported with an ”r” and then the coefficient, followed by the significance level.For example r = 0.5, p < .05 if significant.

3.3 R Squared

When we get a pearsons correlation coefficient we can take the square of that value, and that is whats calledthe percentage of variance explained. So if we get a correlation of .5, then the square of that is .25, so wecan say that 25% of the variation in one variable is accounted for by the other variable. Of course as thecorrelation increases so will the amount of variance explained.

3.4 Point Biserial Correlation

One special case where a categorical variable can use a continuous Pearsons r is the point-biserial correlation.If you have a binary variable you can calculate the correlation between the two categories if the other variableyou are comparing it to is continuous. This is similar to a t-test we will examine later. The test looks atwhether or not there is a significant different between the two groups of the dichotomous variables. Whenwe ask whether its ”significant” or not, we are wanting to determine whether or not the difference is dueto random chance. We already know there is going to be random variability in any sample we take, butwe want to know if the difference between the two groups is due to this randomness or is there a genuinedifference in the groups which is due to true differences.

3.5 Spurious Relationships

So lets say we get a pearsons r=.5, so what now? Can we say there is a direct relationship between variables?No, because we don’t know if the relationship is direct or not. There are many examples of ”spurious rela-tionships”. For example, if I look at the rate of illness students report to the health center at their Universityand the the relative time of exams, I would most likely find a good (probably moderate) correlation. Nowbefore any students starts using this statement as a reason to cancel tests, there is no reason to believe yourexams are causing you to get sick! Well what is it then? Well something we DIDNT measure, Stress! Stressweakens the immune system, and stress is higher during periods of examinations, so you are more likely toget ill. If we just looked at correlations we would only be looking at the surface, so take the results but usethem with caution, as they may not be telling the whole story.

3.6 Final Thoughts

This may seem like a short chapter given the heavy use of correlations but much of the basics of this chapterwill be used in future statistical analysis. One of the primary concerns to take from this is this is not inanyway measuring causality, and this point can not be discussed enough. Correlations are a good way oflooking at associations, but that’s all, but is a good way to help us explore data and work towards moreadvanced statistical models which can help us support or not support our hypotheses. While correlationscan be used, use them with caution.

15

Chapter 4

Means Testing

This chapter goes a bit more into exploring the differneces between groups. So if we have a nominal or ordinalvariable, and we want to see if these categories are statistically different based on a continous variable, thereare several tests we can do. We already looked at the point bi-serial correlation, which is one test. Thischapter examines the t-test which is a test that gives a bit more detail, and Analysis of Variance (ANOVA)which will explore when the number of groups is greater than 2 (the letter denoting groups is generally ”k”,as ”n” denotes sample size, so ANOVA will be k > 2 or k ≥ 3). Here we will want to know whether thedifference in the means is statistically significant.

4.1 Assumptions

So the first assumption we will make is the continuous variables we are measuring are normally distributed,and we learned to test that earlier. Another assumption we must make is called ”homogeneity of variance”.This means the variance is the same for both groups (it doesn’t have to be exactly the same but similar,again it will be somewhat difference due to randomness but is the variance different enough to be statisti-cally different). If this assumption is untenable we will have to correct for the degrees of freedom, which willinfluence whether our t-statistic is significant or not.

This can be shown in the two figures below. 4.1 shows the difference in the means (mean of 10 and 20) butwith same variance of 4.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Mean Difference

Figure 4.1: Same Variance

4.2 has same means but one variance is 4 and the other is 16 (standard deviation of 4).

4.2 T-Test

The t-test is similar to the point-biserial as we are wanting to know whether two groups are statisticallydifferent.

So we will look at the first equation, which the numerator is the difference between the means. The denom-inator is the difference between the standard deviations. the variance of the sample is denoted s2, and n is

16

CHAPTER 4. MEANS TESTING 4.2. T-TEST

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36

Mean Difference

Figure 4.2: Different Variances

the sample size for that group. This is shown in 4.1

t =x1 − x2√s21n1

+s22n2

(4.1)

The degrees of freedom is denoted by 4.2.

df =s21/n1 + s22/n2

2

(s21/n1)2/n1 − 1 + (s22/n2)2/n1 − 1(4.2)

The above equations assume unequal sample sizes and variances. The equations get smaller if you have samevariance or same sample size in each group. Although this only generally occurs in experimental settingswhere sample size and other parameters can be more strictly controlled.

In the end we want to see if there is a statistical difference between groups. If we look at data from theNational Educational Longitudinal Study from 1988 baseline year, we can see how this works. If we lookat the difference in gender and science scores, we can do a t-test and we find there’s a significant meandifference. The means for gender are in 4.1

Mean SDMale 52.1055 10.42897Female 51.1838 10.03476

Table 4.1: Means and Standard Deviations of Male and Female Test Scores

Our analysis shows t(10963) = 4.712, p < .05. Although the test of whether variances are the same issignificant F = 13.2, p < .05, so we have to use the variances not assumed. This changes our results tot(10687.3) = 4.701, p < .05. You can see the main difference is our degrees of freedom dropped, thus ourt-statistic dropped.

This time it didn’t matter, our sample size was so large that both values were significant, but in some teststhis may not be the case. If the test of equal variances rejects the null hypothesis but the test of unequal

17

CHAPTER 4. MEANS TESTING 4.3. ANALYSIS OF VARIANCE

variances does not reject, even if levenes test is not significant, you should really be cautious about how youwrite it up.

4.2.1 Independent Samples

The above example was an independent samples t-test. This means the participants are independent of eachother and so their responses will be too.

4.2.2 Dependent Samples

This is a slightly different version of the t-test where you still have two means but the samples are notindependent of each other. A classic example of this is pre-test, post-test designs. Also longitudinal datawhere a measure was collected at one year then measured on the same test at a later date.

4.2.3 Effect Size

The effect size r is used in this part. The equation for this is in 4.3:

r =

√t2

t2 + df(4.3)

4.3 Analysis of Variance

Analysis of Variance (ANOVA) is used to compute when you have more than two groups. Here we will lookat what happens when have race and standardized test scores. The problem we will encounter is to see whichgroups are significantly different. ANOVA adds some steps to testing the analysis. First all of the meansare compared (the equations for this will be quite complex so we will just go through the analysis steps).First you see if any of the means are statistically different. This is called an omnibus test and follows theF distribution (the F distribution and t distribution are similar to the normal but have ”fatter tails” whichmeans it allows for more outliers but this is of not much consequence to the applied analysis). We get an Fstatistic for both levenes test and the omnibus test. In this analysis we get four group means. These meansare below in 4.2:

Race Mean SDAsian, Pacific Islander 56.83 10.69Hispanic 46.72 8.53Black, Not Hispanic 45.44 8.29White, Not Hispanic 52.91 10.03American Indian, Alaskan 45.91 8.13

Table 4.2: Means and Standard Deviations of Race Groups Test Scores

Table 4.3 is the mean differences. Now after we reject the omnibus test we need to see if there’s a significantdifferences between the tests. We do this by doing post-hoc tests. For simplicity reasons I have put it ina matrix where the numbers inside is the differences between the groups. Those with (*) beside them arestatistically significant. Now this is not how it is done in SPSS, because it will give you it in rows but thisis easily made. There are many post-hoc tests one can do. The ones done below are Tukey and Games-Howell, and both reject the same mean difference groups. There are alot more post-hoc tests but thesetwo do different things. Tukey adjusts for different sample sizes, Games Howell corrects for heterogeneityof variance. If you do a few types of post-hoc tests and the result is the same this gives credence to yourhypothesis. If not you should go back to see if there is a real difference or not or re-examine your assumptions.

18

CHAPTER 4. MEANS TESTING 4.3. ANALYSIS OF VARIANCE

Race GroupsAsian-PI Hispanic Black White AI-Alaskan

Asian-PI 0Hispanic 10.1092* 0Black 11.3907* 1.2815* 0White 3.9193* -6.1899* -7.4714* 0AI-Alaskan 10.9178* 0.8086 -0.4729 6.9985* 0Note: PI-Pacific Islander; AI-American Indian

Table 4.3: Mean Differences Among Race Groups

19

Part III

Latent Variables

20

Chapter 5

Latent Constructs and Reliability

Sometimes in statistics we have variables we want to study, but we cant measure them directly. This meanswe have to use multiple measures (called manifest variables), which come together to measure the constructwe are trying to understand. Unfortunately we cant say for certain the variables we measure are informingon the overall construct we want to test. This means we have to have measures to test this. Some examplesof latent variables include socio-economic status and intelligence. We cant measure socio-economic statusdirectly but we can look at income, education, neighborhood, and other measures to get an overall gauge ofthe construct.

5.1 Reliability

One measure of reliability (also called internal consistency) is chronbachs alpha. It is on a metric from 0to 1. The closer the one the more reliable the measure is. A measure less than .7 is considered too weakto be reliable. The true measure of reliability depends on your measure (some tests that are critical likestandardized test scores may have higher restrictions on them). In the end it comes down to the researcherto defend if a measure is reliable or not.

21

Part IV

Regression

22

Chapter 6

Regression: The Basics

Regression techniques make up a major portion of social science statistical inference. Regression is alsocalled linear models (this will be generalized later but for now we will stick with linear models) as we try tofit a line to our data. These methods allow us to create models to predict certain variables of interest. Thissection will be quite deep, since regression requires a lot of concepts to consider, but as in past portions ofthis book, we will take it one step at a time, starting out with basic principles and moving to more advancedones. The principle of regression to we have a set of variables (known as predictors, or independent variables)that we want to use to predict an outcome (known as the dependent variable but fallen out of favor in moreadvanced statistics classes and works). Then we have a slope for each independent variable, which tells usthe relationship between the predictors and outcomes.

If you see yourself not understanding something, come back to the more fundamental portions of regressionand it will sink in. This type of method is so diverse people spend careers learning and using this modelingprocedure, so it’s not expected you pick it up in one quarter, but are just laying the foundations for the useof it.

6.1 Foundation Concepts

So how do we try to predict an outcome? Well it comes back to the concept of variance. Remember early onin this book we looked at variance as simply variation in a variable. There are different values for differentcases (i.e. different scores on a test for different students). Regression allows us to use a set of predictors toexplain the variation in our outcome.

Now we will look at the equations themselves and the notation that we will use. The basic equation of aregression model (or linear model) is 7.11.

y = β0 +

p∑i=1

βpxp + ε (6.1)

This basic equation may look scary but it is not. There are some basic parts to the equation which willbe relevant to the future understanding of these models. So let us go left to right. The y is our outcomevariable, this is the variable we want to predict the behavior of. The β0 is the slope of the model (where theregression line crosses the y axis on a coordinate plane. The βpxp the actually two components together.The x is the predictor variables, and the β is the slopes for each predictor. This tells us the relationshipbetween that predictor and the outcome variable. The summation sign is there, yet unlike other times thishas been used, at the top is the letter p instead of n. This is because p stands for number of predictors, andnot summing to the number of cases. The ε is the ”error” term, which takes into account the variability inthe model the predictors don’t explain.

6.2 Final Thoughts

This brief chapter introduces regression as a concept, or more generally linear modeling. I don’t say linearregression (which is the next chapter) as this is just one form of regression. Many more types of regressionwill be done in future chapters. There are many books on regression, and at the end of each chapter I willnote very good ones. One extraordinary one is Gelman and Hill (2007) which I will use a lot to refer to withregards to creating this chapter.

23

CHAPTER 6. REGRESSION: THE BASICS 6.3. BIBLIOGRAPHIC NOTE

6.3 Bibliographic Note

Many books have been written on regression. I have used many as inspiration and references for this workalthough much of the information is freely available online. On top of Gelman and Hill (2007) for doingregression, the books Everitt, Hothorn, and Group (2010), Chatterjee and Hadi (2006) and finally the freebook Faraway (2002), and other excellent books that are available for purchase Faraway, 2004; Faraway, 2005.More theory based books are Venables and Ripley (2002), Andersen and Skovgaard (2010), Bingham2010Rencher and Schaalje (2008), Rencher and Schaalje (2008),Sheather (2009). As you can tell most of thesebooks use R which is my preferred statistical package of choice. Some books are focused on SPSS and doa good job at that, one notable one being by Field (2009), also more advanced books but still very good isTabachnick and Fidell (2006) and Stevens (2009). Stevens (2009) would not make a good text book but is anexcellent reference, including SPSS and SAS instructions and syntax for almost all multivariate applicationsin social sciences and is a necessary reference for any social scientist.

24

Chapter 7

Linear Regression

Lets focus for a while on one type of regression, linear regression. This requires us to have an outcomevariable that is continuous and normally distributed. When we have a continuous normally distributed out-come, we can use least squares to calculate the parameter estimates. Other forms of regression use maximumlikelihood, which will be discussed in later chapters. Although the least squares estimates are the maximumlikelihood estimates.

7.1 Basics of Linar Regression

This first regression technique we will learn, and the most common one used is where our outcome is contin-uous in nature (interval or ratio it nature, it does not matter). Linear regression uses an analytic techniquecalled least squares. We will see how this works graphically and then how the equations give us the numbersfor our analysis.

What linear regression does is it looks at the plot of x and y and tries to fit a straight line that is closestto all of these points. Figure 7.1 shows how this is done. I just randomly drew values for both x and yand the line is the regression line that is the best fit for the data. Now as the plot shows, the line doesn’tfit perfectly, its just the ”best fitting line”. The difference between the actual data and the line is what’stermed residuals as it is what is not being captured in the model. The better the line fits and the lessresidual there is, the stronger the predictor will predict the outcome.

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

−1 0 1 2 3

−1

01

23

x

y

Figure 7.1: Simple Regression Plot

7.1.1 Sums of Squares

When discussing the sums of squares we get two equations, one is for the sums of squares for the model in7.1. This is the difference between our predicted values and the mean. This is how good our model is fitting.We want this number to be as high as possible.

25

CHAPTER 7. LINEAR REGRESSION 7.2. MODEL

SSR =

n∑i=1

(yi − y)2 (7.1)

The second is the sums of squares regression (or error), this is the difference between predicted and actualvalues of the outcome, this we want to be as low as possible and is shown in 7.2.

SSE =

n∑i=1

(yi − yi)2 (7.2)

The total sums of squares can be done by summing the SSR and SSE or by 7.3.

SST =n∑i=1

(yi − yi)2 (7.3)

The table 7.1 shows how this can be arranged. We commonly report sums of squares and degrees of freedomalong with the F statistic, the mean squares are less important but will be shown for the purposes of theexamples in this book.

Sums of Squares DF Mean Square F Ratio

Regression SSR p MSR = SSRp F = MSR

MSE

Residual (Error) SSE n− p− 1 MSE = SSEn−p−1

Total SST n-1

Table 7.1: ANOVA Table

7.2 Model

First lets look at the simplest model, if we had one predictor it would be a simple linear regression 7.4.As shown, β0 is the slope parameter for the model, also called the ”y intercept”, it is where on the co-ordinate plane the regression line crosses the y-axis when x = 0. The β is the parameter estimate forthat predictor beside it, the x. This shows the magnitude and direction of the relationship to the outcomevariable. Finally ε is the residual, this is how much the data deviates from the regression line. This is alsocalled the ”error term”, it’s the difference between the predicted values of the outcome and the actual values.

y = β0 + β1x1 + ε (7.4)

More than one predictor is multiple linear regression, such as having two or more predictors will look like7.5, note the subscript p stands for parameters, so there will be a βx for each independent variable.

y = β0 + β1x1 + β2x2 + · · ·+ βpxp + ε (7.5)

26


7.2.1 Simple Linear Regression

If we have the raw data, we can find the equations by hand. While in the era of very high speed computersit is rare you will have to manually compute these statistics we should still look at the equations to see howwe derive the slopes. The slope below is how to calculate the beta coefficient for a simple linear regression.We square values so we get an approximation of the distance from the best fitting line as shown in 7.6. If wejust added the numbers up, some would be below the line, and some above giving us negative and positivevalues respectively so they would add to zero (as is one of the assumptions of error term). Squaring makessure we have this issue removed.

β1 =

∑ni=1(x− x)(y − y)∑n

i=1(x− x)2(7.6)

The equation 7.7 shows how the slope parameter is calculated in a simple linear regression. This is wherethe regression line crosses the y-axis when x = 0.

β0 = y + βx (7.7)

Finally we come to our residuals. When we plug in values for x into the equation, we get the ”fitted val-ues”. These values are predicted by the regression equation. This is signified by y. When we subtract theactual outcome value for the predicted value (which the fitted value is known as). This shows how muchour actual values fit from the line, and it gives us an idea of which values are furthest from the regression line.

ε = y − y (7.8)

We can also find in the model how much of the variability within our outcome is being explained by ourpredictors. When we run this model we will get a Pearson’s correlation coefficient (r). We can still squarethis number (as we did in correlation) and get the amount of variance explained. This is done in severalways, see 7.10.

r2 =SSR

SST= 1− SSE

SST=

∑ni=1(yi − yi)2∑ni=1(yi − y)2

(7.9)

We do need to adjust our r squared value to account for complexity of the model. Whenever we add apredictor, we will always explain more variance. The question is, is this is truly explaining variance fortheoretical reasons or if it is just randomly adding variation explanation. The adjusted r squared shouldbe comparable to the non-adjusted value, if they are substantially different, you should look at your modelmore closely. The adjusted r-squared can be particularly sensitive to sample size, so smaller sample size willshow differences in adjusted r squared values. Also its best to report both if they vary by a non-trivial amount.

Adjustedr2 = 1− (1− r2)SSE/n− p− 1

SST/n− 1(7.10)

We can look at an example of data. Let’s look at our cars example. Let’s see if we can predict the price ofa vehicle based on its miles per gallon (MPG) of fuel used while driving in the city.

27


> mod1<-lm(Price~MPG.city);summary(mod1)

Call:

lm(formula = Price ~ MPG.city)

Residuals:

Min 1Q Median 3Q Max

-10.437 -4.871 -2.152 1.961 38.951

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 42.3661 3.3399 12.685 < 2e-16 ***

MPG.city -1.0219 0.1449 -7.054 3.31e-10 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 7.809 on 91 degrees of freedom

Multiple R-squared: 0.3535, Adjusted R-squared: 0.3464

F-statistic: 49.76 on 1 and 91 DF, p-value: 3.308e-10

We find that it is a significant predictor of price. Our first test is similar to ANOVA, which is the F test. Thiswe reject the null hypothesis, F (1, 91) = 49.76, p < .001. We then look at the significance of our individualpredictor. It is significant, here we report two statistics, the parameter estimate (β), and the t-test associatedwith that. Here miles per gallon in city is significant with β = −1.0219, t(91) = −7.054, p < .001. The firstinteresting thing is there is an inverse relationship, as one variable increases, the other decreases, here wecan say that for every mile per gallon used in the city increase, there’s a drop in price of $1,000. 1 We canalso look at the r2 value to see how well the model is fitting. The r2 = 0.3535 and the Adjustedr2 = 0.3464.While the adjusted value is slightly lower it’s not a major issue, so we can trust this value.

7.2.2 Multiple Linear Regression

Multiple linear regression is similar to simple regression except we place more than one predictor in theequation. This is how most models in social science are ran, since we expect more than one variable to berelated to our outcome.

y = β0 +

p∑i=1

βpxp + ε (7.11)

Lets go back to the data, lets add to our model above not only miles per gallon in the city but fuel tankcapacity.

> mod3<-lm(Price~MPG.city+Fuel.tank.capacity);summary(mod3)

Call:

lm(formula = Price ~ MPG.city + Fuel.tank.capacity)

Residuals:


-18.526 -4.055 -2.055 2.618 38.669

Coefficients:


(Intercept) 10.1104 11.6462 0.868 0.38763

MPG.city -0.4608 0.2395 -1.924 0.05747 .

Fuel.tank.capacity 1.1825 0.4104 2.881 0.00495 **

1I say $1000 dollars and not one dollar as this is the unit of measurement, be sure when interpreting data you use the unitof measurement unless the data is transformed (which will be discussed later).

28

CHAPTER 7. LINEAR REGRESSION 7.3. INTERPRETATION OF PARAMETER ESTIMATES

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1




We find we reject the null hypothesis with F (2, 90) = 31.03, p < .05. We have an r2 = .408 and adjustedr2 =0.395. So this model is fitting well and we can explain around 40% of the variance by these two parameterestimates. Interestingly, miles per gallon fails to remain significant in the model, β = −0.4608, t(90) =−1.924, p = 0.057 This is one of those times where significance is close, and most people who hold rigidly tothe alpha of .05 would say this isnt important. I dont hold such views, while this seems less important thanin the last model, its still worth mentioning as a possible predictor, but in the presence of fuel tank capacityhas less predictive power.Fuel talk capacity is strongly related to price β = 1.1825, t(90) = 2.881, p < .05. We find the relationshiphere is positive, so the more fuel tank capacity the higher the price. We could speculate larger vehicles, withlarger capacity will be more expensive. Although we have seen consistently that miles per gallon in the cityis inversly related, well this may also deal with size. Larger vehicles may get less fuel efficiency but maybe more expensive, smaller cars may be more fuel efficient and yet cheaper. I am not an expert on vehiclepricing so we will just trust the data from this small sample.

7.3 Interpretation of Parameter Estimates

7.3.1 Continuous

When a variable is continuous generally, interpretation is relatively straight forward. We interpret the coef-ficients to mean that one unit increase in the predictor will mean an increase in y by the amount of β. Solets say you have a coefficient y = β0 + 2x+ ε. Well here the 2 is the parameter estimate (β), so we say foreach unit increase in x, we will increase y by 2 units. Now when saying the word ”unit” we are referring tothe original measurements of the individual variables. So if x is income in thousands of dollars, and y is testscores, then for each one thousand dollars increase in income (x) will mean 2 points greater score on the exam.

This changes if we transform our variables. If we standardize our x values, we would say for each standarddeviation increase in x, increase y by two units. If we standardized y and x, we would say one standarddeviation increase of x would mean two standard deviation increase in y.

If we log our outcome, then we would say that one thousand dollar increase in come would mean 2 log unitsincrease in y. One thing to note is when statisticians (or almost all scientists say log) they mean the naturallog. To transform this back to the original units, you take the exponential function, so ey if you had takenthe log of the outcome (reasons for this will be discussed in testing assumptions). If we take the log of y andx, the we can talk about percent’s, so a one percent increase in x, means a 2 percent increase in y. Althoughto get back to original units, exponentiation is still necessary.

If we look at our models above, in the simple linear regression model of just MPG in the city, for each increasein one MPG in the city, the price goes down by 1.0219 thousand dollars. This is because the coefficient isnegative, so the relationship is inverse. In our multiple regression model we see for each gallon increase infuel tank capacity the price increases 1.1825 thousand dollars. This is because the coefficient is positive.

7.3.1.1 Transformation of Continous Variables

Sometimes its neccessary to transform our variables. This can be done to make interpretation easier, morerelevant to our research question, or to allow our model to meet assumptions.

29


7.3.1.1.1 Natural Log of Variables Here we will explore what happens when we take the log of con-tinous variables.

> mod2<-lm(log(Price)~MPG.city);summary(mod2)

Call:

lm(formula = log(Price) ~ MPG.city)

Residuals:


-0.58391 -0.19678 -0.04151 0.19854 1.06634

Coefficients:


(Intercept) 4.15282 0.13741 30.223 < 2e-16 ***

MPG.city -0.05756 0.00596 -9.657 1.33e-15 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1




Here we have taken the natural logarithm of our outcome variable. This will be shown later to be advantan-geous when looking at our assumptions and violations of that. It can also make model interpration differentand sometimes easier. So now instead of the original units, its in log units, so we would say, for each MPGunit increase, the price will decrease 0.0576 percent. This is because the coefficient is negative and so therelationship is still inverse. Notice the percent of variance explained dramatically increased, from 35% to50%, this is due to the transformation process.

> mod3<-lm(log(Price)~log(MPG.city));summary(mod3)

Call:

lm(formula = log(Price) ~ log(MPG.city))

Residuals:


-0.61991 -0.21337 -0.03462 0.19766 1.05362

Coefficients:


(Intercept) 7.5237 0.4390 17.14 <2e-16 ***

log(MPG.city) -1.5119 0.1421 -10.64 <2e-16 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1



F-statistic: 113.2 on 1 and 91 DF, p-value: < 2.2e-16

This model looks at what happens when we take the natural log of both the outcome and the predictor.This is also interpreted differently, but now both estimates are in percents. So for each percent increase in

30


MPG in the city, the price decreases by 1.512 percent. Also the model estimates have changed due to ourtransformation.

7.3.2 Categorical

When our predictors are categorical, we need to be careful how they are modeled. They cannot be addedsimply as numerical values or words. This would cause estimates to be wrong, as the model will assume itis a continuous variable.

7.3.2.1 Nominal Variables

For nominal variables we must recode the levels of the factor. One way to do this is dummy coding. Thisis where we code one factor per variable as a ”1”, with the other factors as ”0”. If we denote the numberof factors as k, then the total number of dummy variables we can model for a factor variable is k − 1. Forexample, if we are coding sporting events, lets say we have a variable of different sporting events, such asfootball, basketball, soccer, and baseball. The total number of dummy variables we can have is 3. Thecoding can be done in statistics programs as shown in Table 7.2.

Factor Levels Dummy 1 Dummy 2 Dummy 3Football 1 0 0Basketball 0 1 0Soccer 0 0 1Baseball 0 0 0

Table 7.2: How Nominal Variables are Recoded in Regression Models using Dummy Coding

As you can see, the baseball part of our sports variable has all zeros. This is the baseline group, for which theother groups are compared. This is good when there is a natural baseline group (like treatment vs. controlin medical studies). Although ours does not have a natural baseline. So we can do another type of codingcalled contrast coding.

Factor Levels Dummy 1 Dummy 2 Dummy 3Football -1 0 0Basketball 0 -1 0Soccer 0 0 -1Baseball 1 1 1

Table 7.3: How Nominal Variables are Recoded in Regression Models using Contrast Coding

As you can see, the factors sum to 0 in the column. Of course in real data sets we may not have an evennumber of levels of the factors, the different levels (or group) may have different amounts. So if there were25 participants that played football and only 23 baseball players, finding the numbers that contrasts thatequal zero will be more difficult. Luckily many software programs allow for this type of coding automatically.

If we only had these variables as our predictors, this would be equivalent to an Analysis of Variance, andthe intercept would be the mean of the baseline variable. This is not so if more predictors are added, as thiswould be an Analysis of Covariance.

7.3.2.2 Ordinal Variables

For ordinal variables, we generally can allow them to be in the model as one variable and not require dummycoding. This is because our assumption of linearity is relatively tenable, as we expect the categories to benaturally ordered and to be increasing. The interpretation of this would be as you go up one category, the

31

CHAPTER 7. LINEAR REGRESSION 7.4. MODEL COMPARISIONS

value of y will change the amount of the parameter estimate (your beta-coefficient for that variable).

7.4 Model Comparisions

In many cases of research we want to know the effect of how much we add to the fit of a model when weadd or take away one or more predictors. When we do model comparisions, we must ensure the modelsare nested. This means we add or take away predictor(s), but otherwise still measuring same things. Forexample in the above models we compared MPG and fuel capacity. We will want to know how much addingfuel capacity to the model adds to model fit, or how adding MPG to the model with fuel capacity already inthe model compares. We can not compare directly a simple regression with only fuel capacity and anothermodel just measuring MPG.

7.5 Assumptions

The assumptions for regression depend on the nature of the regression being used. For continuous outcomes,the assumptions are the errors are homoscedastic, normally distributed errors, linearly related outcome andsamples are independent of one another. We look at the assumptions of linear regression and how to testthem. Then we will discuss corrections to them.

7.6 Diagnostics

We need to make sure our model is fitting our assumptions, and we need to see if we can correct for timesour assumptions are violated.

7.6.1 Residuals

So first we need to look at our residuals. Remember residuals are the actual y values subtracted from thepredicted y values. For this exercise, I will use the cars data I used above, as it is a good data set to discussregression on. For the purposes of looking at our assumptions, let us stick with simple regression wherewe have price of vehicles as our outcome and miles per gallon in the city as our predictor. Here I will justprovide R commands and code along with discussions of it.

7.6.1.1 Normality of Residuals

First lets look at our assumption of normality. We assume our errors are normally distributed with mean0 and some unknown variance. We can do tests of this via my preferred test, Shapiro Wilks test which isgood from sample sizes from 3 - 5000 (Shapiro and Wilk (1965)).

7.6.1.1.1 Tests Lets look at the above model and see if our normality assumption is met. First we test”mod1” which is just the variables in its original form.

> shapiro.test(residuals (mod1))

Shapiro-Wilk normality test

data: residuals(mod1)

W = 0.8414, p-value = 1.434e-08

As you can see the results aren’t pretty, we reject the null hypothesis for the test, so W = 0.8414, p < .05which means there’s enough evidence to say that the sample deviates from the theoretical normal distributionthe test was expecting. This test, the null hypothesis is the sample does conform to a normal distribution,so unlike most testing, we do not want to reject this test.

32

CHAPTER 7. LINEAR REGRESSION 7.6. DIAGNOSTICS




W = 0.9675, p-value = 0.02022

Doing a second model with the log of the outcome help some, but we still cant say our assumption is teneable,W = 0.9675, p < .05.




W = 0.9779, p-value = 0.1154

This time we cannot reject the null hypothesis, W = 0.9779, p > .05, so taking the log of both our outcomeand predictor allows us approximate the normal distribution, or at the very least we can say there isntenough evidence to say our distribution is significantly different than the theoretical (or expected) normaldistribution.

7.6.1.1.2 Plots Now lets look at plots. Two plots are important, one is a QQ plot, and another is ahistogram. A histogram allows us to look at the frequency of values, and the QQ plot plots our residualsagainst what we would expect from a theoretical normal distribution. In those plots the line representswhere we want our residuals to be, means its matching the theoretical normal distribution.

Distribution of Residuals

residuals(mod1)

Den

sity

−10 0 10 20 30 40

0.00

0.02

0.04

0.06

0.08

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●●

●

●

●●

● ●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

−2 −1 0 1 2

−10

010

2030

40

Normal Q−Q Plot

Theoretical Quantiles

Res

idua

ls

Figure 7.2: Histogram of Studentized Residuals for Model 1The first set of plots shows us what we expected from our statistics above. Our residuals dont conform to

33

CHAPTER 7. LINEAR REGRESSION 7.6. DIAGNOSTICS

a normal distribution, we can see heavy right skew in the residuals, and the QQ plot is very non-normal atthe extremes.


residuals(mod2)

Den

sity

−0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

−2 −1 0 1 2

−0.

50.

00.

51.

0

Normal Q−Q Plot


Res

idua

ls

Figure 7.3: Histogram of Studentized Residuals for Model 2As we saw in our statistics, taking the log of our outcome made it better, but still not quite to make ourassumption of normality tenable. We are still seeing too much right skew in our distribution.


residuals(mod3)

Den

sity

−0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

●

●●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

−2 −1 0 1 2

−0.

50.

00.

51.

0

Normal Q−Q Plot


Res

idua

ls

Figure 7.4: Histogram of Studentized Residuals for Model 3This looks much better! Our distribution is looking much more normal. Our QQ plot still shows some de-viation at the top and bottom but our Shapiro-Wilks test gives us enough evidence to show the assumptionof normality is tenable, so this is OK.

34

CHAPTER 7. LINEAR REGRESSION 7.7. FINAL THOUGHTS

7.7 Final Thoughts

Linear regression is used very widely in statistics, most notably because of the pleasing mathmatical prop-erties of the normal distribution. Its ease of interpretation and wide implementation in software packagesenhances its abilities. One should be cautious about the use of it though to ensure your outcome is normallydistributed.

35

Chapter 8

Logistic Regression

So now we begin to discuss the idea that our outcome is not linear. Logistic regression deals with the ideaout outcome is binary, that is it can only take one one of two values (almost universally 0 and 1). This hasmany applications, graduate or not graduate, contract in illness or not, get a job or not, etc. This does poseproblems for interpretation at times, because its not as easy to study.

8.1 The Basics

So we have to model the events that take on values of 0 or 1. The problem is with linear regression in thissense is that it requires us to use a straight line. This cant be done since our values are bounded. Thismeans we must go to a different distribution than the normald distribution

8.2 Regression Modeling Binomial Outcomes

Contingency tables are useful when we have one categorical covariate. Contingency tables are not possiblewhen we have a continuous predictor or multiple predictors. Even if there is one variable of interest inrelationship to the outcome, researchers still try to control for the effects of other covariates. This leads tothe use of a regression model to test the relationship between a binary outcome and one or several predictors.

8.2.1 Estimation

The basic regression model taught in introductory statistics classes is linear regression. This has a continuousoutcome, and estimation is done by least squares. That is a line fit to the data where the difference betweeneach data point and the line is at its minimum. In a binomial outcome, we cannot use this estimationtechnique. The binomial model will estimate proportions, which are bound from 0 to 1. A least squaresmodel may give estimates outside these bounds. Therefore we turn to maximum liklihood and a class ofmodels known as ”Genearlized Linear Models” (GLM)1.

E(y)︸︷︷︸RandomComponent

= β0 +

p∑i=1

βpxp︸︷︷︸Systematic Component

2 (8.1)

The random component is the outcome variable, its called the random component because we want to knowwhy there is variation in this variable. The systematic component is the linear combination of our covariatesand the parameter estimates. When our variable is continuous we don’t have to worry about establishinga linear relationship as we assume it exists if the covariates are related to the outcome. When we havecategorical outcomes we can not have this linear relationship, so GLMs provide a link function, that allowsa linear relationship to exist if there is a significant relationship.

8.2.2 Regression for Binary Outcomes

Two of the most common functions are logit and probit functions. These allow us to look at a linearrelationship between our outcome and our covariates. In figure 8.1, you can see there is not a lot of differencebetween logit and probit, the difference is in the interpretation of coefficients (discussed below). The greenline does show how a traditional regression line is not an appropriate fit, because the data (the blue dots)goes outside the range of the data. The logit and probit fits look at the probabilities of being a success.The figure also shows that there is little difference in the actual model fit between the two models. Logitand probit models will be very similar in the substantive conclusions made. The primary difference is inthe interpretation of the results. While we don’t have a true r2 coefficient, there is a pseudo r2 that wascreated by Nagelkerke (1992) which does give a general sense of how much variation is being explained bythe predictors.

1For SPSS users, do not confuse this with General Linear Model which performs ANOVA, ANCOVA and MANOVA2Some authors use α to denote the intercept term, although most still use β0 which is still the most popular and will continue

to be used here

36

CHAPTER 8. LOGISTIC REGRESSION 8.2. REGRESSION MODELING BINOMIAL OUTCOMES

x

π(x)

0.0

0.2

0.4

0.6

0.8

1.0

−10 0 10 20

●

●●

●●

●

●●●●●

●

●●●●

●

●●●●●●

●

●●●●●

●●

●●●●●●●

●

●●●●

●●

●

●

●

●●●●●●●

●

●●●●●●●●

●●

●

●

●

●

●●

●

●

●

●●●●●●●●●●

●●●

●

●

●●●

●

●●●● ● ●

Logit Probit OLS Regression

Figure 8.1: Logit, Probit and OLS regression lines; data simulated from R

37

CHAPTER 8. LOGISTIC REGRESSION 8.3. FURTHER READING

8.2.2.1 Logit

The most common model in education is the logit model, also known as logistic regression, there are twoequations we can solve, equation 8.2 allows us to get the log odds of a positive response (a ”success”).

logit[π(x)] = log

(π(x)

1− π(x)

)= β0 + βpxp (8.2)

The probability of a positive response is calcualted from equation 8.3.

π(x) =eβ0+βpxp

1− eβ0+βpxp(8.3)

Fitted values (either log odds or probabilities) are usually what is given in statistical programs, and justuses the values from the sample. Although a researcher can place values for the covariates of hypotheticalparticipants and it will give a probability for those values. One caution would be to ensure the values youplace in the covariates are within the range of the data values (i.e. if your sample ages are 18-24 don’t solvefor an equation of a 26 year old). Since the model was fitted with data that did not include that age range.

8.2.2.2 Probit

The probit function is similar in that its function is assumes an underlying latent normal distribution boundbetween 0 and 1 which is found in 8.4. A probit model will change the probabilities into z scores. In Agresti(2007, p. 72) he uses the probit coefficient of 0.05, which is -1.645, which is 1.645 standard deviations belowthe mean for that probability.

P (π) = Φ−1(β0 + βpxp) (8.4)

8.2.2.3 Logit or Probit?

As can be seen in figure 8.1 the model fit for both logistic and probit regression is very similar and this isusually true. Its also possible to alter the coefficients to change the coefficients from logit to probit or viceversa. Amemiya (1981) showed multiplying a logit coefficient by 1.6 will give the probit coefficient. AndrewGelman (2006) ran simulations and found results between 1.6 and 1.8 to be correct corrections, and alsocorresponds to Agresti (2007) which mentions the scaling being between 1.6 and 1.8.

8.2.3 Model Selection

Researchers tend to fit multiple models to try and find the best fitting model consistent with their theoreticalframework. There are several ways to evaluate models to determine which model fits best. Sequential modelbuilding is a technique frequently used to look at the addition of predictors to a regression model. The sameframework that is used with other regression models as well. In a linear regression the test to test the modelswill be an F test (since the null hypothesis of the model uses an F distribution), models which use maximumlikelihood use the likelihood ratio test which is chi-squared like the ratio test used above. Shmueli (2010)examines the differences in building a model to explain the relationship of predictors to an outcome, or amodel to predict an outcome from future data sources. The article also discusses the information criteriasuch as the AIC and BIC measures used to test model fit.

8.3 Further Reading

This chapter borrows heavily from Alan Agresti (2007) who is well known and respected for his work incategorical data analysis. Some books which cover many statistical models yet still do a good job at logisticregression is Tabachnick & Fidell (2006) and Stevens (2009). The first book is great for a textbook, Stevensis a dense book, but has both SPSS syntax and SAS code, works well a must have reference. Gelman andHill (2007) is rapidly becoming a classic book in statistical inference yet its computation is focused on Rwhich hasn’t hit mainstream academia much, but they do have some supplemental material at the end of thebook for other programs. Although for those who have an interest in R, another great book is by Faraway(2005). Andy Field (2009) has a classic book called ”Discovering Statistics Using SPSS” which blends verynicely SPSS and statistical concepts, and is good at explaining of difficult statistical concepts. Studentswho wish to explore categorical data analysis conceptually there are a few good books, I recommend Agresti(Agresti2002); this is a different book from his 2007 book a focus on theory yet still a lot of great examplesof application). Long’s (1997) book explores maximum likelihood methods focusing on categorical outcomes.

38

CHAPTER 8. LOGISTIC REGRESSION 8.4. CONCLUSIONS

It combines a more conceptual and mathematical ideas of maximum likelihood. A classic by McCullagh andNelder (1989) which is a seminal work in the concept of generalized linear models (the citation here is theirwell known second edition).

8.4 Conclusions

This chapter looked in an introductory manner. There is more to analyzing the binomial outcomes andreading some of the works above can help add to analyzing binomial outcomes. This is especially importantfor researchers whose outcomes will be binomial. These principals will also act as a starting point to learnabout other categorical outcomes such as nominal outcomes with more than two categories, or an ordinaloutcomes (used often as likert scales).

39

Bibliography

Agresti, A. (2007, March). An Introduction to Categorical Data Analysis. Hoboken, NJ: Wiley-Blackwell.doi:10.1002/0470114754

Amemiya, T. (1981). Qualitative response models: a survey. Journal of Economic Literature, 19 (4), 1483–1536. doi:10.2298/EKA0772055N

Andersen, P. K., & Skovgaard, L. T. (2010). Regression with Linear Predictors. Statistics for Biology andHealth. New York, NY: Springer New York.

Chatterjee, S., & Hadi, A. S. (2006). Regression analysis by example (4 ed). Hoboken, NJ: Wiley-Interscience.Everitt, B. S., Hothorn, T., & Group, F. (2010). A Handbook of Statistical Analyses Using R, Second Edition.

Boca Raton, FL: Chapman and Hall/CRC.Faraway, J. J. (2005). Extending the Linear Model with R: Generalized Linear, Mixed Effects and Non-

parametric Regression Models (Chapman & Hall/CRC Texts in Statistical Science). Boca Raton, FL:Chapman and Hall/CRC.

Faraway, J. J. (2004). Linear Models with R (Chapman & Hall/CRC Texts in Statistical Science). BocaRaton, FL: Chapman and Hall/CRC.

Faraway, J. J. (2002). Practical Regression and ANOVA using R.Field, A. (2009). Discovering Statistics Using SPSS (Introducing Statistical Methods). Thousand Oaks, CA:

Sage Publications Ltd.Gelman, A., & Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. New

York: Cambridge University Press.Gelman, A. (2006). Take logit coefficients and divide by approximately 1.6 to get probit coefficients. Retrieved

from http://www.andrewgelman.com/2006/06/take\ logit\ coef/Lock, R. (1993). 1993 new car data. Journal of Statistics Education, 1 (1). Retrieved from http://www.

amstat.org/PUBLICATIONS/JSE/v1n1/datasets.lock.htmlLong, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks,

CA: SAGE Publications.McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models, Second Edition (Chapman & Hall/CRC

Monographs on Statistics & Applied Probability). Boca Raton, FL: Chapman and Hall/CRC.Nagelkerke, N. J. D. (1992). Maximum likelihood estimation of functional relationships. Springer-Verlag New

York.Pearl, J. (2009a). Causal inference in statistics: An overview. Statistics Surveys, 3, 96–146.Pearl, J. (2009b). Causality: Models, Reasoning and Inference. Cambridge University Press.Rencher, A., & Schaalje, B. (2008). Linear Models in Statistics (2nd ed.). Wiley-Interscience.Shapiro, S. S., & Wilk, M. B. (1965, December). An analysis of variance test for normality (complete samples).

Biometrika, 52 (3-4), 591–611. doi:10.1093/biomet/52.3-4.591Sheather, S. J. S. J. (2009). A modern approach to regression with R. New York, NY: Springer Verlag.

Retrieved from http://www.springerlink.com/content/978-0-387-09607-0Shmueli, G. (2010, August). To Explain or to Predict? Statistical Science, 25 (3), 289–310.Stevens, J. P. (2009). Applied Multivariate Statistics for the Social Sciences, Fifth Edition. New York, NY:

Routledge Academic.Tabachnick, B. G., & Fidell, L. S. (2006). Using Multivariate Statistics (5th Ed.). Upper Saddle River, NJ:

Allyn & Bacon.Venables, W. N. N., & Ripley, B. D. D. (2002). Modern applied statistics with S (4th Ed.). New York, NY:

Springer.

40

http://dx.doi.org/10.1002/0470114754

http://dx.doi.org/10.2298/EKA0772055N

http://www.andrewgelman.com/2006/06/take\_logit\_coef/

http://www.amstat.org/PUBLICATIONS/JSE/v1n1/datasets.lock.html

http://www.amstat.org/PUBLICATIONS/JSE/v1n1/datasets.lock.html

http://dx.doi.org/10.1093/biomet/52.3-4.591

http://www.springerlink.com/content/978-0-387-09607-0

introduction to statistics

Documents

manifest variables

kinds of variables

nominal variables

ordinal variables

final thoughts

natural log of variables

basic principles of

logistic regression