1 types of data. 2 data as we get started in this chapter say as a research project we want to learn...

1

Types of Data

2

DataAs we get started in this chapter say as a research project we want to learn more about faculty at WSC. Say we gather information from faculty about 1) what is their highest educational degree, 2) how often they cuss during the day, and 3) how long they have been in Wayne. The data that are collected for a particular study are referred to as a data set and some data collected from faculty might look like (note each row represents measurements on elements and each colum is variable):

Faculty Degree Cuss In WaynePerson 1 PhD 0 22Person 2 EdD 3 35Person 3 MFA 0 15Person 4 PhD 237 16

3

Elements Any data set provides information about

some group of individual elements. In my faculty example, faculty are the

elements. In other studies the elements can be

people, states, organizations, objects, and many other things.

4

What is a variable?

Each element in a data set may have 1 or more characteristics of interest. Each characteristic would be called a vaiable.

For any variable in a study each element has to be assigned a valule. So each element has a “measurement” taken and the value is assigned.

For the most part, in our class the measurements have already taken place.

We tend to look at variables on subjects or elements in which we are interested. Each element has a value on each variable.

5

Qualitative or Categorical variable

The variable Degree in our example is an example of a categorical variable. The data, or observed values, from the people on the variable just yield a categorical response. IN my example I have things like PhD, MFA, and EdD.

Note that sometimes in a data set numbers may be used to express the values on the variable, but all we really have are categories of responses. For example, we could have

1 = EdD, 2 = PhD, 3 = MFA and in the data set all you would see are the numbers. But, the numbers really just represent a different category.

Check out the difference between nominal and ordinal data.

6

Quantitative or Numerical variable

In our example the variables how often they cuss during the day and how long they have been in Wayne are numerical variables. The data, or observed values, from the people on the variables yield a numerical response.

77

Describing Categorical Data

Here we study ways of describing a variable that is categorical or qualitative.

88

Say 50 people you know purchased a soft drink from a machine recently. A variable of interest might be the BRAND PURCHASED. Say the brands are made up of the 5 soft drinks Coke Classic, Diet Coke, Dr. Pepper, Pepsi-Cola, and Sprite (of course there are more varieties of soft drinks, but this is an illustrative example.)

Here each specific brand represents a different value on the variable brand purchased. Each specific brand represents a nonoverlapping class – each specific class represents a mutually exclusive category.

Here the variable brand purchased is a categorical or qualitative variable - values of the variable represent categories.

One thing that makes sense to do is ask each of the 50 people what they purchased. Then we could count the number of people who purchased Coke Classic and the others. The total number of people of the 50 who purchased Coke Classic would be the frequency.

99

Soft Drink Frequency Relative Frequency

Percent Frequency

Coke Classic

19 0.38 38

Diet Coke 8 0.16 16

Dr. Pepper 5 0.1 10

Pepsi-Cola 13 0.26 26

Sprite 5 0.1 10

Total 50 1.00 100

1010

The first two columns on the previous screen, the Soft Drink and Frequency columns, make up what is called a frequency distribution. It is a tabular summary of data showing the number, or frequency, of items in each of several nonoverlapping classes.

The third column shows the relative frequency. We need the second column to create the third. To get the relative frequency in each row

take the frequency in that row and divide by the total frequency.

The fourth column shows the percent frequency. The fourth column equals the third column multiplied by 100.

1111

Do you know why we put information in columns? Because then we can call’um as we see’um. Sorry:)

So, the frequency, relative frequency and percent frequency distributions are different ways of summarizing information about a categorical variable.

Notes about our table.

1) The total, or sum, of the frequency column is equal to the number of observations, sometimes called n, in general.

2) The total, or sum, of the relative frequency column is equal to 1.

3) The total, or sum, of the percent frequency column is equal to 100 (sometimes it may be a little off due to rounding of decimal places).

1212

In our example here we had 50 people and we asked what soft drink they purchased. Studies occur that have thousands of people and they are asked several questions. Using a computer can help in the counting of responses.

Bar Graphs

Bar graphs just put the frequency, relative frequency and percent frequency distributions into visual form. The form is a graph with certain properties.

The horizontal axis does not have numbers on it and the axis represents the categories. In our soft drink example we would put each brand in a different location on the axis.

1313

Imagine you have a piece of construction paper that is red. Do you remember way back when in school you would cut strips of paper and then curl the paper with the scissors? Well, we will not need to curl the paper here!

I mention this silly example because I want you to think about cutting strips that are one inch wide. The height of each strip would then represent the frequency, relative frequency or percent frequency on the variable. You would tape each strip onto the graph above each category. (You could also put the bars sideways.)

So the vertical axis, or height, in the bar graphs is either the frequency, relative frequency or percent frequency distributions.

In constructing the bar graph on a qualitative variable a space is left between each bar to help us remember we have a qualitative a variable.

1414

RuralSuburbanUrban

Pe

rce

nt

50

40

30

20

10

0

This is an example of what a percent frequency graph would look like. The variable is “what is the type of area in which you live” and the height of each bar is the percent frequency. (See how each bar is like a cut out from a piece of paper?)

1515

Pie Charts

Say we order a pizza pie and it is cut up into pieces. Below I show a pizza pie cut, and I wanted it to show it cut into slices that hits the middle. If you get a quarter of the pie, you get one of the sections shown. 0.25 of the pie is an example of the relative frequency. So, par charts show each category getting its relative share of the pie.

A pie chart could really be the frequency, relative frequency or percent frequency pie, but the size of each piece of the pie is always the relative frequency.

1616

Remember that a circle has 360 degrees. A way to think about this is if you go from “12 o’clock” on the pie to “3 o’clock” you have gone 90 degrees. A way to construct a pie chart is that each category will take up its respective relative frequency times the 360 dgrees. From the earlier example Coke Classic had a relative frequency of .38 and will thus take .38(360) = 136.8 degrees.

On a bar chart, you could take an 8.5 by 11 sheet of paper and cut out an inch strip 11 inches long. Then cut this strip into the same number of pieces as the number of categories, where each cut is the relative frequency of the group times 11. For Coke you would have a cut .38(11) = 4.18 inches long.

The relative frequency is a very important descriptor for a qaulititative variable and is the basis for bar and pie charts.

1717

Describing Numerical Data

Here we study ways of describing a variable that is

numerical.

1818

Numerical variables have values that are real numbers. Remember that categorical variables may use numbers, but the variable really has values that represent groups.

Example of a categorical variable: eye color 1 = blue, 2 = green, 3 = red (especially on Friday morning).

Our initial method of describing a numerical variable will be basically the same as with a qualitative variable, with some modification in our understanding.

Let’s consider the variable age. Consider the first 20 people you see today. Consider yourself if you look in the mirror, but just count yourself once. The age of these folks could be 1 day to 110 years in Nebraska, right?

1919

Remember, a frequency distribution is a tabular summary of data showing the number, or frequency, of items in each of several nonoverlapping classes.

With a variable like eye color (qualitative), we typically make each color a class. But with a variable like age (quantitative), if we make each age a class then we could have so many classes that the distribution is hard to interpret. The authors suggest grouping the ages into classes and having anywhere from 5 to 15 classes.

Let’s digress for a minute and think about a data set. Say I have data on people. Say I have social security number, eye color, age and blood alcohol level last Thursday night at 11:30. On the next screen I have what the data might look like in Excel, or other computer programs. Note each column is a variable. Each row represents a person in this example. Thus in each row we see the values of the variables for each person.

2020

SS# Eye color age Blood alcohol level

123456789 Blue 22 .00

987654321 Blue 21 .016

567891234 Green 19 .010

345678912 Blue 27 .00

654321987 Brown 20 .00

000000000 Red 22 .023

2121

The reason for my digression was to have you begin to think about data sets. (Typically) A variable is in a column. The values down the column are for different people (or what ever the subject might be). I believe it is useful to think about data as you consider statistical ideas. Here we are looking at how to describe a column of data, one variable.

Now, when we have a numerical variable like age we have to think about how many classes to have. We want each class to have more than a few people in it. For now, let’s not worry too much about how many classes to have.

The “width” of each class should be equal. Using age as an example, we might have classes that have 5 consecutive ages included. The first class might be 10-14 year olds, then 15-19 year olds and so on.

2222

Class “limits” need to be considered. Each person should be in only one class. Each class has a lower limit and an upper limit and these limits are exclusive to the class.

On the next screen I have an example of the frequency, relative frequency and percent frequency distributions for the variable age for 50 people.

The frequency column is just the counting of the number of people in each class. The relative frequency is the frequency of each class divided by the total number of people in the data set. The percent frequency is the relative frequency times 100.

(Look back at the distributions we had for the qualitative variable. Does it look the same?)

2323

Age Frequency Relative Frequency

Percent Frequency

10-14 19 0.38 38

15-19 8 0.16 16

20-24 5 0.1 10

25-29 13 0.26 26

30-34 5 0.1 10

Total 50 1.00 100

2424

Do you know why we put information in columns? Because then we can call’um as we see’um. Sorry:)

So, the frequency, relative frequency and percent frequency distributions are different ways of summarizing information about a numerical variable.

Notes about our table.

1) The total, or sum, of the frequency column is equal to the number of observations, n.

2) The total, or sum, of the relative frequency column is equal to 1.

3) The total, or sum, of the percent frequency column is equal to 100.

2525

Bar graphs are used for qualitative variables. What amounts to the same thing for quantitative variables are called histograms.

Histograms just put the frequency, relative frequency and percent frequency distributions into visual form. The form is a graph with certain properties.

The variable of interest is put along the horizontal axis. We would have the variable age on the axis.

2626

Imagine you have a piece of construction paper that is blue. Do you remember way back when in school you would cut strips of paper and then curl the paper with the scissors? Well, we will not need to curl the paper here!

I mention this silly example because I want you to think about cutting strips that are of the same with and are as wide as the class width (remember class widths are equal). The height of each strip would then represent the frequency, relative frequency or percent frequency on the variable. You would tape each strip onto the graph above each category.

So the vertical axis, or height, in the bar graphs is either the frequency, relative frequency or percent frequency distributions.

In constructing the histogram on a quantitative variable THERE IS NO SPACE between each bar to help us remember we have a quantitative variable.

2727

Pie Charts

The authors do not mention it, but pie charts could be made in a similar fashion to what we saw before.

Cumulative Distributions

Have you every accumulated a bunch of junk in your room? Yea, me to. Each day more stuff just shows up. So tomorrow I will have all the stuff I have today and more.

Cumulative distributions are kind of like my story. When you look at the frequency distribution we just saw, a slight modification can make then into cumulative distributions. For the cumulative frequency, start with the first class in the first row. The cumulative value for this row is the frequency.

2828

But the cumulative value for the second row is the frequency for the first row plus the frequency for the second row. So to get the cumulative frequency for a given row, add up the frequencies for that row and all previous rows.

The cumulative relative frequency and cumulative percent frequency are found as before: cumulative relative frequency is cumulative frequency divided by total and the cumulative percent frequency is the cumulative relative frequency times 100.

What’s a henway? About 4 or 5 pounds!

What’s an Ogive? It is what we call a graph of a cumulative frequency distribution. The horizontal axis has values of the variable and the vertical axis has the appropriate cumulative frequency.

2929

Freq

0

24

68

1012

1416

18

10 to 19 20 to 29 30 to 39 40 to 49 50 to 59

Freq

What is the most frequently occurring age group in this example? How many times does it occur? (the group is 30-39 and the frequency 17)

3030

Cumulative Freq

0

10

20

30

40

50

60

0 10 20 30 40 50 60

Age

Cumulative Freq

This is a frequency Ogive (or polygon). Note here that what was accumulated was just the frequency. The highest frequency is 50 because that was the total number of folks in the study. What would be the highest value if we had a cumulative relative frequency? (1, right?)

31

Summary

31

With both categorical and numerical data one way to summarize the data is to look at frequency information of groups. The categorical data are already in natural groups and the numerical data has to be grouped. Then we might look at the frequency, relative frequency, or the percent frequency of each group.

The relative frequency of a group = group frequency divided by the total frequency across all groups.

The percent frequency is the relative frequency times 100.

Bar charts, pie charts and histograms are based on these ideas.

32

Shape of histograms

32

Left skewed symmetrical Right skewed mound shaped

With left skewed left “tail” is longer, but there are relatively few responses on the left side.

With right skewed right “tail” is longer, but there are relatively few responses on the right side.

With mound shaped data most data is piled in the middle and both “tails” look about the same, or are mirror images.

3333

Cumulative Distributions

3434

Age Frequency Relative Frequency

Percent Frequency

10-14 19 0.38 38

15-19 8 0.16 16

20-24 5 0.1 10

25-29 13 0.26 26

30-34 5 0.1 10

Total 50 1.00 100

35

Table from previous slide

35

We saw this table before in a previous section. It is a frequency distribution of ages for a group of 50 people.

On the next slide I will show what the cumulative distribution would be for this example.

3636

Age CumulativeFrequency

CumulativeRelative Frequency

CumulativePercent Frequency

10-14 19 0.38 38

15-19 27 0.54 54

20-24 32 064 64

25-29 45 0.90 90

30-34 50 1.0 100

3737

Here are some additional methods for describing data

3838

0 10 100 Here I have the number line. We start on the left at zero and as we move to the right we increase in value. (We can move to the left of zero into the negative numbers.)

Now, if we want a number line vertically we tip it from the 100 side up, as my arrow shows. This is the normal convention in all of math. One EXCEPTion is for the stem-and-leaf plot or display. In this one area we tip the line up from the zero side.

As an example of some data say 50 people take a test that has up to 150 points. Possible values on the test are 70, 120, 132, 79 and so on.

3939

With the 50 people who took the test, let’s imagine we sort the data from low score to high score. In other words, we might arrange the data so that the lowest score is first, and then the next highest and so on. The last score is the highest.

When we have a 3-digit number, like with test score, we have in general xxy. I put xxy because we will call the xx part the stem and y the leaf. The number 69, for example would be stem 06, or 6, and leaf 9. 123 would be steam 12 and leaf 3.

Say we had two people on the test with score 69 and 68. In sorted order we would have 68, 69. In stem-and-leaf form we would have stem 6 and leafs 8 and 9. On the next slide I will have the beginning of a stem-and-leaf display similar to what you might see in the book.

4040

6

7

8

9

10

11

12

13

8 9

1 2 2 2 4 5 5 6 7 8 8 8

Here I have stem and leaf for the scores in the 60’s and the 90’s. Not the frequency of scores in the 60’s was two (two leafs) and the frequency of scores in the 90’s was 12.

The stem-and-leaf is like a histogram, but the leafs of each number are used in a stack to the side of the stem. Longer stacks represent more frequent groups.

4141

Crosstabulations, or crosstabs for short.

Sometimes in statistics we will want to work with two variables (bivariate) at a time. The reason is that we think the two variables are somehow related. As an example, what do you think is the relationship between student grades and the number of classes they skip in a semester? My hunch would be the more classes skipped the lower the grade. But we may want to research this idea.

Crosstabs is a tabular summary of two variables. On the next screen I create a basic crosstabs based on students midterm scores in the rows and the students final exam scores in the columns. 30 students were observed.

4242

Final exam <60 60 to <70 70 to <80 80 to <90 90 + total

Midterm < 60 2 2 0 0 0 4

60 to < 70 0 3 3 0 0 6

70 to < 80 0 0 3 3 0 6

80 to < 90 0 0 0 2 2 4

90 + 0 0 0 0 10 10

Total 2 5 6 5 12 30

Here I have a frequency crosstab. Note that 2 students had midterm scores < 60 and final exam scores < 60, for example.

The “total” row is really the frequency distribution for the variable Final exam score, and the “total” column is the frequency distribution for the variable midterm score.

In this table, we could put percentages in several formats.

4343

Percentages in the cells

We call each part of the table a cell and depending on what we want to think about we can calculate different percentages.

Column percentages

When we look at a given column we have a final exam score and each row is a midterm score. If we use a column total as the basis of a percentage then we see percents of midterm score at each final exam score.

Row percentages

When we look at a given row we have a midterm score and each column is a final exam score. If we use the row total as the basis for a percentage then we see the percents of final exam score for a given midterm score.

4444

Overall percentages

Since we had 30 in the data, each cell divided by 30 tells us the percent of times that cell came up.

Scatter Diagram

Say we talk to people and we ask them their years of schooling and income last year. For each person we would have two values. On the next screen I show a scatter plot, where each person is a dot in the graph.

4545

Scatterploty - income

x - years of schooling

In a graph we put the variable of interest on the y axis. Here it is thought that knowing the years of schooling for a person will better help us understand income and schooling is put on the x axis. In other words, certain values of income are ‘matched’ with schooling amounts.

note here in the scatterplot that in general the higher the schooling, the higher the income. Thus, knowing schooling will permit better prediction of income.

4646

In a scatterplot, when the dots seem to be going up hill we say there is a positive, or direct, relationship between the variables. This means that the higher the value on one variable, the higher the value of the other variable.

Dots going down hill suggest a negative, or indirect, relationship. This means as the value of one variable is getter higher the value on the other is getting lower.

x

y

x

y

x

y

Positive relationship

Negative relationship

No relationship?

47

Caution: Correlation is not causation!If in the scatter plot x and y have a positive relationship, then we say as x goes up in value from one subject to the next, then in general the y value goes up as well. In this sense we could also say x and y are positively correlated.

But, statistically we can not say that x causes y. We need other theories from our studies to be able to talk about causation.

Does every acre planted with corn yield the same number of bushels of corn? No, but the yield is correlated with the amount of water added. In fact, folks into ag would say up to a certain point adding water causes a higher yield!Correlation must be present to have causation. But causation may not be there just because there is correlation.

48

Say you have this really large city and suburbs outside the city limits. Also say that there are many locations of the same franchise business (like McD’s, for example).

Say that the more miles a store location is from the city center, the more sales a given store will have. Do the miles from the city center cause the sales to be higher?

No, maybe here the people with more income moved out from the city center and the fact that they have more income permits them to purchase more (and they do purchase more because they like the product).

So, miles from the city center and sales are both influenced by a third variable. Miles from city center and sales are merely correlated.

1 types of data. 2 data as we get started in this chapter say as a research project we want to learn...

Documents