association between variables: scattergrams topic #11

37
ASSOCIATION BETWEEN VARIABLES: SCATTERGRAMS Topic #11

Upload: derick-butler

Post on 17-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

ASSOCIATION BETWEEN VARIABLES:

SCATTERGRAMS

Topic #11

Like Father, Like Son

Though it is not especially relevant to political science, suppose we want to research the following bivariate hypothesis that involves two interval and continuous variables.

FATHER’S HEIGHT + ADULT SON’S HEIGHT [father-

(actual height in inches) ======> (actual height in inches) adult son pairs]

Like Father, Like Son (cont.)We select a random sample of n = 1078 pairs of fathers and their adult sons and collect the relevant data, the first five cases, as well as the last case, of which appears below. Note that the unit of analysis is father-adult son pairs and observed values have been very precisely measured and recorded, so probably each case has a unique recorded value on each variable.

Pair ID Father’s Height (inches) Son’s Height (inches)

1 66.67 68.42 2 69.83 70.32 3 65.19 69.76 4 65.15 73.85 5 64.66 70.17

. . . 1078 62.31 62.09

Like Father, Like Son (cont.)We cannot straightforwardly crosstabulate these variables, because

the variables are continuous, each having an infinite number of possible values, so each case would be in a unique row and column.

So what should we do? One possibility is to create class intervals for both variables (as discussed in Topic #5 on histograms) — in effect, to turn them into discrete variables and then proceed as before.

For example, we might create class intervals for both variables with this recoding scheme:

Short less than 65 inchesMedium 65-70 inchesTall greater than 70 inches

Pair ID FH SH1 66.67 Med 68.42 Med2 69.83 Med 70.32 Tall3 65.19 Med 69.76 Med4 65.15 Med 73.85 Tall5 64.66 Short 70.17 Tall

Like Father, Like Son (cont.)

• We then can set up a crosstabulation worksheet and begin to tally cases as shown above.

• If hypothesis is true (and if on average fathers and sons are about the same height), we would expect most cases to fall in the main diagonal, i.e., in the S&S, M&M, and T&T cells.

Like Father, Like Son (cont.)

But note that this approach is not very satisfactory.

Pair ID F’s Height S’s Height Cell

1 66.67 68.42 M&M

2 69.83 70.32 M&T

3 65.19 69.76 M&M

4 65.15 73.85 M&T

5 64.66 70.17 S&T

Like Father, Like Son (cont.)

• Creating class intervals when you have gone to the trouble of measuring continuous variable quite precisely entails throwing away valuable information that bears on the hypothesis of interest. – This problem can be mitigated by creating more

refined class intervals. – But what we really should do is have “infinitely”

refined class intervals that match the very precise information we have collected.

• A very nice analytical device called a scattergram (or scatterdiagram or scatterplot), similar in its basic logic to a crosstabulation, allows us to do just this.

Like Father, Like Son (cont.)• First, we need to set up a scattergram template or

worksheet, which is similar in logic to that for a cross-tabulation but reflects the continuous character of both variables.

• Figure 1 shows the general template for such a scattergram.

• We draw a horizontal interval scale [just as in a histogram] representing values of the independent variable (corresponding to the column variable in a crosstab). – This scale should be appropriately labeled and calibrated to

encompass the full range of observed values found in the data (but it needn’t, and probably shouldn’t, be much wider than this).

• We then erect a vertical interval scale that similarly represents values of the dependent variable (corres-ponding to the row variable in a crosstab).

Like Father, Like Son (cont.)• Just as cases are placed in cells of a crosstabulation

defined by the intersection of the row and column corresponding to the particular combination of (discrete) variable values that characterizes the case, each case in a scattergram is plotted at the point defined by the intersecting of horizontal and vertical lines corresponding to the particular combination of (continuous) variable values that characterizes the case. – In a sense, each case falls (almost always) in its own unique and

tiny cell.

• Figure 2 shows the scattergram worksheet and the plotted points for each of the five father-son pairs listed in the previous slides. – To facilitate locating each point, I have put a (1" x 1") grid over

the scattergram.

Some Guidelines for Creating Scattergrams

• Graph paper is very useful for making hand-drawn scattergrams, as you will do in Problem Set #11.

• Draw the interval measurement scales on the left and bottom margins of the scattergram.

• The end points of each scale must accommodate the maximum and minimum (or range of) values observed in the data for each variable.

• But (as a general rule) the end points should not be much more extreme than these maximums and minimums.– In other words, when the data has been plotted, there should not

be a lot of unnecessary “white space” in the finished (presentation grade) scattergram.

• Scale the axes so that the scattergram is either approx-imately square or somewhat wider than tall (SPSS does the latter by default).

Like Father, Like Son (cont.)

• A statistician named Karl Pearson actually conducted such a study of father-son pairs over 100 years ago in England.

• Having collected height data on 1078 father-son pairs, Pearson realized that a list of 1078 pairs of numbers would be impossible to grasp as raw data and that a crosstabulation using class intervals would have the problems discussed previously.

• Pearson therefore developed the scattergram, an alternate analytic device appropriate for analyzing association in such data.

• Figure 3 shows the scattergram of Pearson’s data. – This scattergram is taken from David Freedman et al., Statistics, p. 110.

Questions Pertaining to the Pearson Scattergram

• What is the significance of 45° line in the scattergram?• On average, are the two generations about the same

height? If not, which generation is taller on average?• What is the approximate average height of the sons? Of

the fathers?• What is the significance of the two vertical dotted lines?• What is the average height of all sons whose fathers are

about 72 inches tall?• How does the average height of sons vary with the

height of their fathers?• Is the there an association between the variables?• Is it positive or negative?• How strong is the association between the two

variables?

Questions Pertaining to the Pearson Scattergram (cont.)

• Points that lie (approximately) on the 45° are cases in which it is (approximately) true that FH = SH.

• Also note that a clear majority (maybe 60%) of points lie “northwest” of the 45° line, indicating that most sons are taller than their fathers.

• Move a vertical line left and right until it appears that half of the points lie on either side. This line corresponds to the median height of father; their mean height is about the same.

• There is a distinct “football-shaped” cloud of points running from southwest (Short-Short) to “northeast” (Tall-Tall), indicating a positive association.

Questions Pertaining to the Pearson Scattergram (cont.)

• The points inside the two vertical dotted lines represent all cases in which the father is about 72 inches tall.– On average these 72” fathers have sons who are shorter than

their fathers, though at the same time taller than the average of all sons.

• We can draw other vertical strips and find the average height of the sons of these fathers.

• The line of averages indicates that the average height of sons increases with fathers’ height (positive association).

• But there is still a lot of dispersion in sons’ height within each vertical strip, so the association is far from perfect.

Association Visualized• The following figures show other scattergrams for small

hypothetical data sets. – These were produced by the Statistical “Applet” on

Correlation and Regression Demo available at the course web site.

– These scattergrams correspond directly to Tables 1B-1E near the beginning of the previous handout on crosstabulations, in that they show differing degrees of association running from high positive through zero to moderate negative.

– The (Pearson) correlation coefficient is the standard measure of association between two interval variables.

Crosstabs vs. Scattergram• It may at first blush seem

puzzling that– in a standard

crosstabulation a concentration of cases running from “northwest” to “southeast” (the main diagonal) reflects a positive association, while

– in a scattergram the same pattern reflects a negative association (and vice versa).

Crosstabs vs. Scattergram (cont.)• This reflects only a cosmetic

difference in setting things up: – in a crosstabulation, the

values of the row (vertical) variables are usually placed in descending order (lowest value on top, highest values at the bottom), while

– in a scattergram the values of the vertical (row) variable are invariably (and more sensibly) placed in ascending order (lowest value at the bottom, highest value on top).

Crosstabs vs. Scattergram (cont.)• But remember that scattergrams and crosstabulations are

essentially similar devices. • We can show how directly they are logically connected, and also

– how the former is much more informative that the latter.• Suppose that, having constructed Pearson’s scattergram, we want

also to construct a full crosstabulation of the data using the short-medium-tall class intervals we previously worked with.

Crosstabs vs. Scattergram (cont.) This can be

accomplished simply by • super-

imposing the appro-priate 3x3 grid (table) on the scatter-gram, and then

• counting the number of plotted points in each resulting cell.

Test 1: Blue Book Score by MC Score

Here is some more topical (and politically relevant) bivariate data

Data as of 11/05/08

Obama Vote in 2008 Vote By Kerry Vote in 2004

• A scattergram of more topical interest is shown on the following slide.

• Since there are only 50 cases (states), the plotted points can be individually labeled.– What is the significance of the four quadrants of the

scattergram?– What is the significance of the blue diagonal line?– What is the significance of the green diagonal line?

Note: DC not included in scattergram

SPSS Scattergrams• SPSS can readily create scattergrams.

– You can use SPSS (or Excel) it for PS #11 if you wish (see Note 5).

• For example, let’s open the PRESIDENTIAL ELECTION data file, giving state by state vote totals for each Presidential candidate in each election, and then– find the variables dem2000, rep2000, dem2004, and rep2004;– compute the DEMOCRATIC PERCENT OF THE TWO-PARTY

VOTE in each election, • by clicking on Transform => Compute and entering this

expression in the Compute Variables dialog box: – d2pc2004 = 100 * dem2000/(dem2000 + rep2004)

• and likewise for d2pc2004; and then– produce the following scattergram by clicking on Graphs =>

Scatter... => Simple/Define and then in the Simple Scatterplot dialog box put d2pc2004 on the Y Axis and d2pc2000 on the X Axis.

Histogram of Scores on Test 2

Scattergram of P2 by MC2

Scattergram of Score2 by Score 1

Correlation Matrix