chapter 13: pearson's r correlation and · pdf file3/4/2016 correlation 1/15 chapter...

3/4/2016 Correlation

http://www.derekborman.com/230_web_book/module4/correlation/index.html 1/15

Chapter 13: Pearson's r Correlation And Regression

Objectives for This Chapter

Understand how and why we use correlation analyses.Be able to construct and interpret a scatterplot.Understand how correlation magnitude and direction describe different types ofrelationships.Understand how to calculate and interpret the coefficient of determination.Be able to apply the steps of hypothesis testing to a correlation analysis.List and explain the assumptions of a correlation analysis.Apply regression concepts/calculations to prediction and decisionmaking.Distinguish among the alternative correlation analyses.

What will correlations do for me?

Suppose that you woke up one morning and discovered that you had been given the gift of being able to predictthe future. Suddenly, you found yourself able to predict, with some degree of accuracy, things like:

How long someone will liveWhether the stock market will go up or downWhether someone will become a criminalWhether a surgery will prolong a cancer patient’s lifeWhether a depressed person will commit suicideWhether a person will make a productive employeeWhether a football team will make a first down on the next playWhether somebody’s marriage will survive or end in divorce

Would you use your gift for good and not for evil? Or would you say, “No thanks! Not interested.” Well, in fact you are about to be given such a gift and I am going to give it to you! It is in the form of thecorrelation coefficient. Developed by Karl Pearson over a hundred years ago, it is officially called the PearsonProduct Moment Correlation Coefficient. However, it is often referred to simply as Pearson's r. That's right. We'vegone from z, to t, to F and finally to r. No wonder statistics is so tough. Statisticians don't even understand thealphabet.

The correlation coefficient is an extremely powerful and useful statistical tool that allows you to evaluate whetheror not a relationship exists between two things. These “things” are referred to as variables. Perhaps moreimportantly, not only does the correlation coefficient tell you whether or not a real (nonchance) relationshipexists, it allows you to tell just how strong that relationship is. Finally—and here is where predicting the futurecomes in—if a real relationship exists between two variables, then the correlation coefficient allows you to makepredictions about one variable based solely on knowledge of the other variable. In other words, if a relationshipexists between a child watching violent television programs and his or her aggressive behavior in school, then thecorrelation coefficient will allow you to predict how violent a child would be in school based solely on how muchviolent television the child watched at home!

Think about the kind of power that gives you! Insurance companies use the correlation coefficient to predict howlikely a person is to get into an accident. This enables them to set insurance rates so that high risk drivers pay



more. Mutual fund managers use correlation coefficients to decide what a stock’s price is likely to do and use thatinformation to decide whether to buy or sell stocks. Healthcare providers use correlation coefficients to decidewhether or not someone will likely die if they do not get a certain surgery and so they make decisions accordingly.Psychologists use correlation coefficients to predict whether or not a sexually violent predator (a person repeatedlyconvicted of sex crimes) is likely to commit another crime if he is released from prison.

How Do Scatterplots and correlations work?

Introduction to Scatterplots

Scatterplots are similar to line graphs in that they use horizontal and vertical axes to plot data points. However,they have a very specific purpose. Scatterplots show how much one variable is affected by another. Therelationship between two variables is called their correlation.

Scatterplots usually consist of a large body of data points. The plot is constructed by plotting two related, numericvariables on the x axis and y axis of a graph. The actual dots or data points represent the intersections of thevalues of the two variables. The closer the data points come (when plotted) to making a straight line, the higherthe correlation between the two variables, or the stronger the relationship. When we talk about how strong arelationship is, we refer to this as the magnitude of a correlation. We'll get into more specifics, shortly.

If the data points make a "line" going from lower left to upper right, then the variables are said to have a positivecorrelation. Positive correlations are frequently referred to as direct correlations. If the "line" goes from upper leftto lower right, the variables have a negative correlation. Negative correlations are often referred to as inversecorrelations. If there is no identifiable, linear trend, then we would conclude that there is no correlation betweentwo variables.

What else do we notice about the scatterplots? We see that both variables are numeric. Though we will talk brieflyabout a few other types of correlations, in this chapter we are going to focus on the bivariate correlation. Thebivariate correlation is always an analysis of the relationship between two numeric variables on either an INTERVALor RATIO scale.

Positive Correlation Example

(../../images/correlation/school_salary_scatterplot.png)The best way to give youan idea of how correlation works is to discuss a scatterplot example or two. Suppose that you are interested inanswering the question: Is a person’s starting salary (the amount they are paid when they start their first job)related to how many years of education they have successfully completed? In other words, you are interested infinding out if more school will translate into more money when you get a job. Click images to enlarge.

As you look at this scatterplot, does anything jump out at you? Can you say anything about the relationshipbetween a person’s starting salary and the number of years of education that he or she successfully completes?

http://www.derekborman.com/230_web_book/images/correlation/school_salary_scatterplot.png



Does it seem like they are related at all? Who would you say tended to have higher starting salaries, people withless education or more education?

A careful review of the graph clearly indicates that people who had less education (less than a high school diploma)tended to also have lower starting salaries. On the other hand, as the amount of education a person got went up,so did their starting salary. People with the most education also tended to have the highest starting salaries. Thisis an example of a positive correlation or positive relationship.

In answer to our question about whether or not there is a relationship between the amount of education onereceives and their starting salary, the evidence we have presented would suggest that “Yes, there is a positiverelationship between amount of education and a person’s starting salary.” As result, you might conclude that youshould pursue as much education as you can because it is likely to mean that you will receive a higher salary!

Negative Correlation Example

Okay, now you know what a positive relationship is; it is a relationship where as values on one variable (e.g.,years of education) go up the values on the other variable (e.g., starting salary) also goes up. There is anotherkind of relationship besides a positive relationship—a negative relationship.

Suppose you were interested in using data to find out if there is really a relationship between a person’s exercisepatterns and their risk of heart attacks. What you are really asking is: Is there a relationship between the average

(../../images/correlation/exrcse_chol_scatterplot.png)number of minutes ofexercise a person gets each week and that person’s risk of coronary hearth disease? What do you think would bethe answer to this question? Just based on what you hear on television and read in newspapers and magazines,you probably would conclude that there is a relationship between how much a person exercises and his or her riskof heart disease, right?

Well, you could use the very same procedure we looked at above to answer the question. Suppose we asked 30people the following questions: How many minutes, on average, do you exercise each day? What is yourcholesterol level? If we ask a bunch of people these two questions, then we will have two related, numeric datapoints for each participant in our study. You are ready to put your real question to the test. You could plot eachperson’s data on a scatterplot. You may very well see something similar to what you see here.

Do you notice anything about what happens to a person’s cholesterol level as their amount of daily exerciseincreases? Look at people who exercise less than 12.96 minutes a day do they tend to have higher or lowercholesterol levels than those who exercise more than 12.96 minutes a day?

Based on these data, it would appear that as the number of minutes a day a person exercises goes up, theircholesterol level goes down. This is an example of a negative relationship. You can generally identify a negativerelationship in a scatterplot because the tendency is for the dots to go from the upper left to the lower rightportion of the graph.

Spring Break!!

Now that you know how to see if there is a positive or negativerelationship between two variables, the next question to ask is: How strong is that relationship? In other words:What is the magnitude of the correlation? Why is this an important question? Let’s take a look at another example.

http://www.derekborman.com/230_web_book/images/correlation/exrcse_chol_scatterplot.png



Suppose Spring Break is fast approaching and you and some of your friends are thinking about heading south toCancun, Mexico, for some rest, relaxation and fun. Being the wise consumer/traveler that you are, you askyourself, “Is there a relationship between how expensive a hotel is and how much people enjoy their stay?” Your

thinking is (../../images/correlation/hotel_fun_scatter1.png)that if there is arelationship showing that the more people pay, the more fun they tend to have, then you may want to pay a littlemore for your hotel. On the other hand, if people who stay at cheaper hotels have as much as or more fun, whypay the extra bucks?

To answer your question, you ask a bunch of people who have been to Cancun over spring break about how muchtheir hotel cost and also ask them to rate how much fun they had on a scale of 0 to 10 (0 = no fun at all and 10means so much fun that your head exploded). Next, you plot their responses on a scatterplot. This plot providesan example of a “weak” positive relationship. Click images to enlarge. Notice that you can tell that there is apositive relationship between hotel cost and how much fun people who have stayed there reported (as one variablegoes up so does the other). HOWEVER, notice that there is a lot of variability. What I mean by variability is thatthere are a lot of people who only paid around $500 who had almost as much fun or more fun than people whopaid much more!

(../../images/correlation/hotel_fun_scatter2.png)But what if your results hadturned out differently? What if your results indicated a different relationship between the variables that you wereinvestigating. This is an example of a “strong” positive relationship. In a strong correlation (whether it is positiveor negative) the dots tend to group to form more of a straight line. Notice how in this scatterplot there is a veryclear trend that tells you: As people pay more for their hotel they also tend to report having more fun during theirstay.

Why is it so important to know whether a relationship is strong or weak?Because it allows us to make predictions!Let’s get back to our Spring Break example. Suppose that you have a certain amount of money to spend and youwant to be as certain as possible to get the most for your money. Should you pay top dollar for your hotel or go onthe cheap side? If you had gotten the results depicted in the first scatterplot, you might be less inclined to spend alot on your hotel, because it is difficult to predict with accuracy how much more fun you would have in anexpensive hotel. On the other hand, the results in the second scatterplot depict a strong, positive relationship; it isvery clear that higher price is related to more fun. Given this information one could clearly say that people whopaid $900 reported having much more fun than people who paid $500.Therefore, you could predict that if you paidtop dollar for your lodging, you would have more fun. Note, this does not mean you will have more fun only that itappears likely that you will. You cannot say that paying more will cause you to have more fun, only that since itappears cost and fun are related it seems reasonable to predict that you may have more fun.

The main point of this is that a strong relationship allows you to make much more accurate predictions than aweak relationship. And if I may also add...ACT RESPONSIBLY ON SPRING BREAK! There. I said it and I'm glad thatI did.

This chapter incorporates multiple exerpts from The Radical Statistician: A Beginners Guide to Unleashing the Power of Applied Statisticsin The Real World (5th Ed.) Jim Higgins Publishing. 2006.

Create your own scatterplot

http://www.derekborman.com/230_web_book/images/correlation/hotel_fun_scatter1.png

http://www.derekborman.com/230_web_book/images/correlation/hotel_fun_scatter2.png



In this activity, you will be able to create scatterplots for correlated data of different sample sizes and computed rvalues (correlation coefficients) of different strengths and directions. Just use the appropriate sliders. You can evencreate your own scatterplot by entering x and y data points for each dot on the scatterplot. If you want to seesome realworld data depicted, open up one of the sample data sets. Get in touch with your inner stats geek andenjoy!!

Random Data Sets

Sample size: 62

Pearson's r: 0

Graph Your OwnData X 0 Y 0

Add Data Point

Sample Data Sets Select a data set to view

Regression Line Observations

100 120 140 160105

107.5

Highcharts.com

What do I do with a correlation coefficient?

As it turns out, we need more than scatterplots to really make the best use of correlation analyses. After all,creating scatterplots can be time consuming to construct. And while scatterplots can be useful to get an idea aboutwhether or not a relationship exists between two variables, it can be very difficult to differentiate between weakrelationships (which are still important) and instances when there are no relationships. What we need is a way totake all the information we have collected and somehow reduce it to a single summary number that tells us:

1. Is there a relationship between the two variables?2. Is the relationship, if it exists, positive or negative?3. How strong is the relationship?

In addition, it would be great if the same number could be used to actually make predictions about one variablebased on information about the other. For example, wouldn’t it be great if this number would allow us to decidehow much fun we want to have in Cancun and have it tell us how much we would need to spend on our hotel?As a matter of fact, there is such a number. It is called the correlation coefficient, and we're going to talk aboutthat, right now.

A correlation coefficient is a single summary number that gives you a good idea about how closely one variableis related to another variable. The statistical letter associated with the correlation coefficient is r, sometimesreferred to as Pearson's r. Just as we have done in past chapters, we will assess computed and critical statisticalvalues. In this chapter we will assess rcomp and rcrit values. We will also become familiar with a new table ofcritical values associated with the r distribution.

In order for you to be able to understand this new statistical tool, we will need to start with a scatterplot and thenwork our way to the correlation coefficient. We will not be manually calculating the correlation coefficient butrather discussing it's characteristics and uses.

Should I tell my boss to take this job and shove it?



Okay. So, you've been working at the sameplace for three years. You see the same people, day in and day out. You know what they have for lunch each day.You even know about their weird little habits, like when Bill and Tom have their weekly contest to see who canendure having their finger stapled the most times. There is certainly a lot to consider in deciding whether to staywith this company.

Now, A good statistician would want to collect and analyze some data before doing anything rash. You know thatyou're going to conduct a correlation analysis. So, what two numeric variables would you want to relate? Whatvariables would allow you to predict an important factor related to your possible future with the company? Wecould start by trying to find out if there is any kind of relationship between the number of months people haveworked for the company and their hourly wage using a scatterplot. Let's look at the data and the scatterplot. Clickimages to enlarge.

(../../images/correlation/data_time_money.jpg) (../../images/correlation/time_wage_scatterplot.png)

Okay. Pretty much what we expected. It appears that we have a positive correlation between Months Working forCompany and Hourly Wage. In other words, the longer you are with the company, the more you get paid. Now,don't you feel bad for wanting to march in and give your boss an atomic wedgie? Well, don't be too hard onyourself, just yet. You need to dig into this a little more to really figure out what's going on here.

Correlation does not equal causation

Note that these findings do not mean that time with the company actually causes higher compensation(correlation does not equal causation) it only shows that there is a relationship between the two variables andthat the relationship tends to be positive in nature. To be correlated only means that two variables are related. Youcannot say that one of them “causes” the other. Correlation tells you that as one variable changes, the otherseems to change in a predictable way. If you want to show that one variable actually causes changes in anothervariable, then you need to use a different kind of statistical test. Additionally, you would probably want to knowsomething about average pay for different months with the company. Your correlation analysis doesn't providethese types of statistics.

NEED TO EXPAND NEXT SEMESTER

What is a good correlation coefficient

When we take all of the data above and calculate a correlation coefficient, we end up with rcomp = +.95. If yourcorrelation coefficient is a positive number you can tell, just by looking at it, that there is a positive relationshipbetween the two variables. Remember that a positive relationship means that as values on one variable increase(go up) the values on the other variable tend to increase (go up) in a predictable manner. If your correlationcoefficient is a positive number, then you know that you have a positive relationship. This means that the values ofthe two variables rise and fall together, in a predictable way.

http://www.derekborman.com/230_web_book/images/correlation/data_time_money.jpg

http://www.derekborman.com/230_web_book/images/correlation/time_wage_scatterplot.png



(../../images/correlation/perfect_correlation.jpg)One of the most important things to remember about correlationcoefficients is that they always fall between 1.00 and +1.00. All correlation coefficients range from 1.00 to+1.00. That never changes. A correlation coefficient of 1.00 tells you that there is a perfect negative relationshipbetween the two variables. This means that as values on one variable increase there is a perfectly predictabledecrease in values on the other variable. In other words, as one variable goes up, the other goes in the oppositedirection (it goes down). A correlation coefficient of +1.00 tells you that there is a perfect positive relationshipbetween the two variables. This means that as values on one variable increase there is a perfectly predictableincrease in values on the other variable. In other words, as one variable goes up so does the other. A correlationcoefficient of 0.00 tells you that there is a zero correlation, or no relationship, between the two variables. In otherwords, as one variable changes (goes up or down) you can’t really say anything about what happens to the othervariable.

Of course, our understanding of the world is imperfect. Psychologists and scientists rarely end up with a perfectcorrelation coefficient. Most correlation coefficients tend to be somewhat lower than plus or minus 1.00. The closera correlation coefficient is to 0.00, the weaker the relationship is and the less able you are to tell exactly whathappens to one variable based on knowledge of the other variable. The closer a correlation coefficient approachesplus or minus 1.00 the stronger the relationship is and the more accurately you are able to predict what happensto one variable based on the knowledge you have of the other variable.

So, should you stick with this job or move on to greener pastures? Read on.

Squaring your Correlation: The Coefficient of Determination

Many students who are new to the concept of correlation coefficients make the mistake of thinking that acorrelation coefficient is a percentage. They tend to think that when r = .90, it means that 90% of the changes inone variable are accounted for or by the other variable. Even worse, some think that this means that anypredictions you make will be 90% accurate. This is not correct! A correlation coefficient is a “ratio” not a percent.

However it is very easy to translate the correlation coefficient into a percentage. All you have to do is “square thecorrelation coefficient” which means that you multiply it by itself. So, if the symbol for a correlation coefficient is r,then the symbol for this new statistic is simply r2 which can be called r squared. There is a name for thisnewstatistic—the coefficient of determination. The coefficient of determination tells you the percentage of onevariable that is directly related to the variability in the other variable. Another way to think about this is that thecoefficient of determination is the percentage of characteristics that the two variables have in common. Thesecharacteristics underlie the values of our two variables and can be discussed (hypothetically) in concrete terms.More on this in a minute.

http://www.derekborman.com/230_web_book/images/correlation/perfect_correlation.jpg



(../../images/correlation/overlap_diagrams.png)For now, let'svisualize. Click image to enlarge. Overlapping variability can be depicted nicely with Venn diagrams as you seehere.

In the first example, there is no overlap between the two variables. This means that there is no relationship andthat whatever we know about Variable A tells us nothing at all about Variable B.

In the second example, there is some overlap. The correlation coefficient is r = 0.25. If we square that to get thecoefficient of determination (r2) would get 12.25%. This tells us that 12.25% of how a person scored on Variable Bis directly related to how he/she scored on Variable A (and viceversa). In other words, if we know a person’sscore on Variable A we know about 12.25% of what there is to know about how they scored on Variable B! Wecould also say that the two variables have 12.25% of underlying characteristics in common. That is quite animprovement over the first example. We could actually make an enhanced prediction based on your knowledge.

In the third example, there is even more overlap between the two variables. The correlation coefficient is r = 0.80.By squaring r we can see that fully 64% of the variation in scores on Variable B is directly related to scores onVariable A. Predictions based on one variable or the other would be even more accurate. That is, in the previousexample, we would be only 12.25% certain of our predictive accuracy. Whereas in the second example, we couldbe 64%

So what we know is that larger correlations allow us to make better predictions and provide us with moreunderstanding about the characteristics that two variables have in common.

But you still don't know whether you should stay with your current employer or break away to start up that datingwebsite focusing exclusively on people who are allergic to peanuts. Read on.

Steps of Hypothesis Testing

Thus far we've not really addressed the topic of statistical significance. We'll do that now. We'll be using the sameprocedures as before. State the null and alternative, alpha level, compare comp and crit, draw conclusions, and soforth. Let's walk through the steps of hypothesis testing for the hourly wage x months worked example.

1. H0: ρ = 0. We've introduced a new symbol. The symbol ρ which is read as "rho." Rho is the correlation within apopulation. For correlation analyses, we're not expressing the null in terms of average differences. Instead, thenull states that there is no relationship between the selected variables from the population. Stating this in wordsfor our example: There is no relationship between the number of months that someone works for this companyand someone's current hourly wage.

2. H1: ρ ≠ 0 or ρ > 0. Our alternative hypothesis probably would be a directional hypothesis (ρ > 0) in thisexample. When you set out to collect data from employees at the company, you would have anticipated that aspeople work more months, they are paid more. Stating the directional alternative in words: There is a significantpositive relationship between the number of months that someone works for this company and someone's currenthourly wage.

http://www.derekborman.com/230_web_book/images/correlation/overlap_diagrams.png



3. Set α = .05.

4. Reject H0 if |rcomp| ≥ rcrit (r.05 for df = N 2). In this case, r.05 (df = 8) = .632. Notice a few things aboutour rejection statement. First, rcomp is expressed as an absolute value because we can have positive andnegative coefficients in correlation analyses. Second, degrees of freedom for this type of correlation analysesalways equals the number of pairs of scores minus 2. And for our example, we have 10 pairs of scores. Finally, ourrcrit is .632, found by using our df and alpha level to find the critical correlation value in our Table of Critical rValues For Pearson Correlation. What we are essentially stating in this step is that if our computed correlationcoefficient (positive or negative) is equal to or greater than .632, then we will reject our initial assumption thatthere is no relationship between our variables.

5. The fifth step is to perform the computation. In this chapter we won't actually be performing the Pearson's rcalculation. Instead I will simply give the computed value to you. For our example, rcomp = .95. As far as rcompgoes, that's pretty big. Our critical r value was .632. Because rcomp is greater than rcrit, we will be rejecting ournull hypothesis and concluding that there is a significant, nonchance correlation between the two variables.

6a. A Pearson's r, bivariate correlation analysis was conducted to determine whether there is no relationshipbetween the number of months that someone works for this company and someone's current hourly wage.

6b. There was sufficient evidence to reject the null hypothesis; r(8) = +.95, p < .05. If our |rcomp| value wasless than rcrit, then we would have said, There was insufficient evidence to reject the null hypothesis...

6c. The statistically significant r coefficient of +.95 indicates a strong positive relationship between the variables inthis study. Based on these results, we can reliably conclude that those who work more months for the companymake more hourly income. Reflexively those who have worked a shorter period for the company make less money.Though our research method does not allow us to conclude which of our variables causes the other, we can behighly confident that we can make accurate predictions about the value of one variable based on the value of theother.

6d. This step will focus on explaining the coefficient of determination. Recall that when we square r, theresulting value provides a percentage measure of the amount of variability shared by two variables. Also keep inmind that simply squaring r doesn't provide a lot of useful information. Your interpretation of r2 is what bringsmore meaning to your conclusions. So, let's have a go at it. The coefficient of determination for this study is +.90.This means that months of work and hourly wage share 90% variance. In other words, 90% of the characteristicsunderlying the measurements of the first variable are part of the second variable. Such characteristics mightinclude the workrelated skillset, productivity, consistency, understanding of company politics, level ofresponsibility, etc. In other words, as one works for a longer period, characteristics such as those listed wouldvarymost likely increase. Alternatively, we could say that when one is paid more, characteristics such as thoselisted would increase.

That's all for our steps of hypothesis testing. We've concluded that there is a strong correlation between the twovariables. This means that you can be reasonably sure that if you continue with the company, your pay shouldcontinue to increase. We can't predict this unconditionally, because it's still possible that chance (though highlyunlikely) accounts for these findings. Moreover, given that the most "months worked" was 50, we can't really drawany solid conclusions about pay for one who has been with the company 60 months, 70 months, and so forth.

It is looking more and more like staying with the company may be in your best financial interests. Still, I'm not yetcomfortable telling you to strap in for the long haul and just get used to Nancy's strange habit of mixing pencilsharpener shavings in to her morning cup of Joe.

We've got one more topic to cover.

Regression

Now we are moving on up to the big time! You are going to learn how to do something that is very remarkable—you are about to learn how to predict the future without having to call the Psychic Friends Network. What’s more—your predictions are likely to be much more accurate!

The Regression Line



(../../images/correlation/time_wage_regression1.jpg)After youhave found a statistically significant correlation coefficient, there is one more thing that you can do—and it is oneof the coolest things in statistics—you can make predictions about one variable based on your knowledge of theother variable. The stronger the relationship is between the two variables (larger correlation coefficients which alsomean larger coefficients of determination) the more accurate any predictions you make are likely to be.

Notice that, in addition to being a scatterplot showing the relationship between Time With Company and HourlyPay, you now see a line drawn through the middle of the group of dots. This line, called the regression line (youwill learn more about this later), was made possible by our friend the correlation coefficient.

The regression line is a kind of “moving average” that is drawn through the balancing point between the dots ateach point on your Xaxis. This line is the one an only line that could be drawn in a manner so that the cumulativedifference between it and every individual dot is the smallest. There is a formula for calculating the position andslope of the regression line, but we will not be addressing that here.

Predicting one score from another

If you are trying to predict the compensation of an employee who has worked for the company for, say, 20months, then your best single guess is the average compensation paid to people who have worked for 20 monthswith the company. However, we have only one person in our sample who has worked for 20 months. You certainlycan't pull an average out of that. What you can see is that this particular employee is paid about $26/hour. So, ifyou knew that an employee had worked for the company for 20 months—and knew nothing else about theemployee—your best guess about the compensation that employee receives is around 26 dollars per hour. In thisgraph, these two pieces of data appear to intersect right on the regression line. Seems simple enough.

Well, we probably need to be a little more precise. Thefollowing formula is used for predicting one individual score based on another, where Y' is the predicted Y value, b= slope, X is the raw score used to predict a corresponding Y value, and Xbar and Ybar represent the averagesfor the X and Y variables.

For our ongoing example, the average number of months worked for the company is 26. Let's call this variable ourX variable. The average hourly pay is $28.7. Let's call this variable our Y variable. A calculation of b yields a valueof .63. Now, let's pick a single X value for our prediction. How about a value that isn't even in our data set. Let'spredict the hourly pay based on 60 months worked for the company. So, X = 60. We've got everything we need toperform our calculation and maybe reach some additional conclusions about whether you want to continue to workin that sweat shop. Let's walk through the calculation.

Y' = .63 (60 26) + 28.7

Y' = .63 (34) + 28.7

Y' = 21.42 + 28.7

http://www.derekborman.com/230_web_book/images/correlation/time_wage_regression1.jpg



Y' = $50.12

So, our best prediction is that if you stay with the company for 60 months, you can expect to earn somewherearound $50.12 dollars per hour. There are other factors, such as company policy, the economy, etc., that canmake this a bad prediction. However, this is our best guess. The pay doesn't sound too bad. I say go for it!Swallow all of that rage you have for your sadistic boss and continue in your 9to5 life of indentured servitude.

If you look at the chart above and imagine that regression line extending out to the right a little farther, it lookslike 60 months of work and $50.12 per hour will intersect right on the line. And actually, that's the way it alwaysis. Our best prediction is always where two variables intersect on the regression line. Recall that theregression line bisects the points on the scatterplot, such that the distances between the points and the line areminimized. In other words, the line is the best frame of reference for minimal error. And when error is minimized,we are making our best predictions. That's why our best predictions always fall somewhere on the line.

Check out this Khan Academy video for a great explanation using Excel!

Fitting a line to data | Regression | Probability and Statistics | Khan Acade...

Of course, this is no guarantee that our prediction will be perfect. Let's talk about error and how confident we canbe when making our predictions.

Error & Confidence Intervals

(../../images/correlation/regression_error.jpg)As wealready know, the larger our sample size, the closer our sample distribution comes to a standard normal curve(689599). Let's suppose that instead of collecting Hourly Pay and Time With Company from just 10 individuals,we collect such data from 5,000 people at this large company. Let's suppose that we plotted those dots in ascatterplot. If we were to plot all of those data in a single scatterplot, we would have a lot of dots.

https://www.youtube.com/watch?v=OhUkMQtBGmE

http://www.derekborman.com/230_web_book/images/correlation/regression_error.jpg



Not only would we have a lot of dots across our entire scatterplot, but we would also have quite a few Hourly Payscores for each score on the x axisnumber of months with the company. Click image to enlarge.

Because every month on the x axis will have a distribution of scores on the y axis, we know that we will havecentral tendency and spread. We also know that whenever we have a large distribution of scores, that group ofscores will likely form a normal curve. Based on our normal curve, we assume that for a given number of monthswith the company, approximately 68% of those polled would report wages within ± 1 standard deviation unit ofthe group average, or in this case, one standard deviation of the regression line. Recall that our best guess for agroup average is on our regression line. We would also assume that approximately 95% of hourly pay scores for agiven month, would fall within ± 2 standard deviation units of the regression line and that approximately 99% ofhourly pay scores for a given month, would be within ± 3 standard deviation units of the regression line. We wouldmake these assumptions for every y distribution for a given x value and every x distribution for a given y value.

Because we can make certain assumptions about the distribution of one variable for each score on the othervariable, we are able to add a little more information to the prediction made in the previous section. What morecan we add? We can add a a confidence interval around our predicted value. Confidence intervals for theregression line are conceptually very similar to the confidence intervals that we calculated for sample means in aprevious chapter.

Suppose that we wanted to know the 95% confidence interval for the hourly pay that corresponds on theregression line with 20 months worked at the company. In fact, this predicted wage is $24.90. In other words, wewould predict that if someone worked at this company for 20 months, she would be making $24.90 per hour. Nowremember, this value represents the predicted average for a range of wages at 20 months. So, in addition topredicting hourly wage, we could also find a confidence interval for that wage. We won't do the math. However,the confidence interval around a y value of $24.90 is $22.67 to $27.12. In other words, we can be 95% confidentthat the population/true hourly wage for 60 months worked, is somewhere between $22.67 and $27.12. That's apretty tight confidence interval. This should give you more confidence in using the relationship between hourly payand time with company to think about some of your career decisions.

Create your own regression line

It might help you to understand the regression line a little better if you were able to create your own. Play aroundwith the interactive regression simulation, below. Click to add points to the graph. Notice how the location of thepoints relates the location and orientation of the regression line and the strength/direction of your correlation.Think about how all of these factors affect your ability to accurately predict one score from another.



Assumptions For Correlation Analyses (L.I.N.E.S.)

You must always keep in mind that in order to make the most of your claims based on a Pearson's r analysis, youmust do your best to meet the assumptions for running a valid correlation analysis. Remember that assumptionscomprise a checklist of characteristics that we require of our data. If we can check off everything on the list, thenour analysis and conclusions will be viewed as having more validity and reliability.

1. LINEAR: The relationship, if there isany, between the two variables is best characterized by a straight line. This iscalled a linear relationship. The best way to check this is to plot the variables ona scatterplot and see if there is a clear trend from lower left to upper right (apositive relationship) or from the upper left to the lower right (a negativerelationship). If the relationship seems to change directions somewhere in thescatterplot, this means that you do not have a linear relationship. Instead, itwould be curvilinear and Pearson’s r is not the best type of correlation coefficientto use. The correlation coefficient will be weaker when the data are curvilinear,but you will not be able to tell just from the r value. That's why it's importantalways to construct a scatterplot.

2. INDEPENDENCE: The pairs of scores for each participant are independent of allother pairs of scores.

3. NORMAL: The traits you are measuring are normally distributed in thepopulation. In other words, even though the data in your sample may not beperfectly, normally distributed (if you plot them in a histogram they do not form



perfectly, normally distributed (if you plot them in a histogram they do not forma bellshaped curve) you are pretty sure that if you could collect data from theentire population the results would be normally distributed.

4. EQUAL VARIANCE: The data can bedescribed as having homoscedasticity, which means that scores on the Y variableare normally distributed across each value of the X variable. The illustration isprobably more easily understood than a verbal description. Again, one of theeasiest ways to assess homoscedasticity is to plot the variables on a scatterplotand make sure the “spread” of the dots is approximately equal along the entirelength of the distribution. In this illustration, the scatterplot on the left ishomoscedastic, and the scatterplot on the right is heteroscedastic.Heteroscedasticity is a bad thing and will result in a weaker correlation.

5. SCALE: Your data on both variables are measured on either an interval or ratioscale. Recall that interval scales have equal intervals between points on yourscale but they do not have a true zero point. Ratio scales have both equalintervals between points on their scale and they do have a true zero point.

Are their Other types of Correlation analyses?

In fact, there are other types. If your data do not meet the assumptions for a Pearson's r analysis, you are not outof luck. There are other tools available to you for assessing whether or not relationships exist between twovariables. These tools have different assumptions. In any event, you should be aware that they exist and knowthat they are there to help you in those situations where you cannot use Pearson’s r.

Please note that the term "dichotomous" is very similar to "binomial." Both terms indicate a nominallevel variablethat has two categories (e.g., yes/no, male/female, above/below the poverty line).

Spearman’s Rank Order Correlation Coefficient (rs)

When your data meet all of the assumptions for Pearson’s r except that one or both of the variables are measuredon an ordinal scale rather than on an interval or ratio scale, you can use Spearman’s rank order correlationcoefficient. For example, if we correlated place in a foot race (1st, 2nd, 3rd, etc.) with scores on an achievementmotivation test, we would be assessing a relationship between an ordinallevel variable and an intervallevelvariable.

Sometimes this coefficient is referred to as Spearman’s rho. The symbol for it is rs.

Biserial Correlation Coefficient (rbis)

Sometimes your data meet all of the assumptions for Pearson’s r except that one of the variables has been“artificially" converted to a dichotomous variable. For example, if you take a multiple choice test item with fourpossible alternatives and “recode” it so that a person has either answered it correctly (1) or incorrectly (0). Youhave taken a variable that could conceivably be considered an interval scale with four points to a scale with onlytwo points. Another example would occur if you took the variable “total income” and recoded it into two groupssuch as “below the poverty line” (0) and “above the poverty line” (1). In this case, you have taken a ratio levelvariable and converted it into a dichotomous variable which only has two possible values.

If you correlated above/below the poverty line with IQ, you would be assessing the relationship between anartificially dichotomous variable and an intervallevel variable. In cases such as those just described, you woulduse a correlation coefficient called the biserial correlation coefficient. The symbol for this correlation coefficientis rbis.

Point Biseriel Correlation Coefficient (rpbis)



The point biserial correlation coefficient is very similar to the biserial correlation coefficient. The big differenceis related to the “dichotomous” variable. Whereas in the biserial correlation coefficient one of the variables iscontinuous while the other one would be if it had not been artificially made into a dichotomous variable, in thepoint biserial correlation coefficient, the dichotomous variable is “naturally” dichotomous. For example, whetheryou smoke or don't smoke is a dichotomous distinction. Either you do or you don't. There are only two possibilities.

When you have one variable that is continuous on an interval or ratio scale and the other is naturally dichotomous,the point biserial correlation coefficient is the best choice to use when it comes to measuring whether or not arelationship exists between the variables. For example, if we wanted to correlate gender (male/female) with verbalcomprension scores on a test, we would be assessing the relationship between a naturally occurring, dichotomousvariable (gender) and an intervallevel variable (verbal comprehension scores).

The symbol for this correlation coefficient is rpbis.

All Correlation Coefficients Are Interpreted In the Same Way

Whatever correlation coefficient you use, it is interpreted in generally the same way. The value of the correlationcoefficient must be between 1.00 and +1.00; larger correlation coefficients (farther from 0) mean strongerrelationships; squaring the correlation coefficient tells you the amount of variation in one variable that is accountedfor by the other variable.

Self Test (Under Construction)

(ttests_test.pdf)

Selftest for chapterAnswers to selftest

(index.html)

Some content adapted from other's work. See home page for specifics.

LAST UPDATED: 20151204 11:55 AM

Mesa Community College | 1833 W. Southern Ave. Mesa, AZ 85202 | Email Address: [email protected] | Phone: (480) 4617181 | Disclaimerxhtml | css | 508

DEREK BORMAN: PSYCHOLOGICAL SCIENCEMCC PSYCHOLOGICAL SCIENCE HOMEPAGE

http://www.derekborman.com/230_web_book/module4/correlation/t-tests_test.pdf

http://www.derekborman.com/230_web_book/module4/correlation/index.html

http://www.maricopa.edu/disclaimer.html

http://validator.w3.org/check/referer

http://jigsaw.w3.org/css-validator/check/referer

http://bobby.watchfire.com/bobby/bobbyServlet?URL=http%3A//www.mc.maricopa.edu/other/ctl/_ctl_SD/about/index.html&output=Submit&gl=sec508&test=

http://www.derekborman.com/

http://www.mesacc.edu/departments/psychological-science

chapter 13: pearson's r correlation and · pdf file3/4/2016 correlation 1/15 chapter...

Documents