introduction to statistical data analysis lecture 1 ... · statistical software: the r project...

76
Introduction Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency Measures of Dispersion Introduction to Statistical Data Analysis Lecture 1: Working with Data Sets James V. Lambers Department of Mathematics The University of Southern Mississippi James V. Lambers Statistical Data Analysis 1 / 76

Upload: others

Post on 09-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Introduction to Statistical Data AnalysisLecture 1: Working with Data Sets

James V. Lambers

Department of MathematicsThe University of Southern Mississippi

James V. Lambers Statistical Data Analysis 1 / 76

Page 2: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Introduction

This course is an introduction to statistical data analysis.

The purpose of the course is to acquaint students with fundamentaltechniques for gathering data, describing data sets, and mostimportantly, making conclusions based on data.

Topics that will be covered include probability, probability distributions,sampling, confidence intervals, hypothesis testing, correlation, andregression.

James V. Lambers Statistical Data Analysis 2 / 76

Page 3: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

The R Project

To illustrate and work with concepts and techniques presented in thiscourse, we will use a software tool known as R, which provides aprogramming environment for statistical computing and graphics. It isfreely available for download from the site

http://www.r-project.org/

Throughout this course, as concepts are presented, relevant R functionsand sample code will be given.

James V. Lambers Statistical Data Analysis 3 / 76

Page 4: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Descriptive StatisticsInferential StatisticsEthics in Statistics

Descriptive Statistics

The purpose of descriptive statistics to summarize and display data insuch a way that it can readily be interpreted. Examples of descriptivestatistics are as follows:

I The average, or mean is a convenient way of describing a set ofmany numbers with just a single number.

I A chart is useful for organizing and summarizing data in meaningfulways.

James V. Lambers Statistical Data Analysis 4 / 76

Page 5: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Descriptive StatisticsInferential StatisticsEthics in Statistics

Example

Consider a list of test scores in a class with many students:

78 60 89 80 77 83 79 61 73 7367 100 62 68 64 57 72 71 98 7159 99 94 72 52 68 73 79 71 8281 56 61 64 67 70 75 30 68 94

The average of all of these test scores is approximately 72.5, whichsuggests that the overall performance of the class on the test was a C.

James V. Lambers Statistical Data Analysis 5 / 76

Page 6: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Descriptive StatisticsInferential StatisticsEthics in Statistics

Example, cont’d

We can also gauge the overall performance of the class with this chart inwhich the scores are categorized according to their letter grade (assuming“straight-scale” letter-grading):

Range Number of scores in range90-100 5

80-89 570-79 1460-69 11

0-59 5

which shows that the majority of the students earned C’s or D’s.

James V. Lambers Statistical Data Analysis 6 / 76

Page 7: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Descriptive StatisticsInferential StatisticsEthics in Statistics

Inferential Statistics

The other, much more sophisticated branch of statistics is inferentialstatistics, which is used to make actual claims about an entire (large)population based on a (relatively small) sample of data.

Related topics:

I Confidence intervals

I Hypothesis testing

I Goodness-of-fit tests

I Correlation and regression

James V. Lambers Statistical Data Analysis 7 / 76

Page 8: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Descriptive StatisticsInferential StatisticsEthics in Statistics

Example

For example, suppose that a pollster wanted to determine the percentageof all registered voters in California that would support a certain ballotmeasure.

It would not be practical to question the entire population consisting ofall of these voters, as there are millions of them.

Instead, the pollster would question a sample consisting of a reasonablenumber of these voters (such as, for example, 200 voters), and then useinferential statistics to make a conclusion about the voting preference ofthe entire population based on the data obtained from the sample.

James V. Lambers Statistical Data Analysis 8 / 76

Page 9: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Descriptive StatisticsInferential StatisticsEthics in Statistics

The Distinction

The essential difference between descriptive and inferential statistics liesin the size of the population about which conclusions are being made.

In descriptive statistics, conclusions are made about a relatively smallpopulation based on direct observations of every member of thatpopulation.

In inferential statistics, conclusions are made about a relatively largepopulation based on descriptive statistics applied to a small sample fromthat population.

James V. Lambers Statistical Data Analysis 9 / 76

Page 10: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Descriptive StatisticsInferential StatisticsEthics in Statistics

Ethics in Statistics

The example of inferential statistics given above, concerning a pollster,can be expanded to illustrate important aspects of ethics in statistics.

In order to draw sound conclusions about a large population, it isessential that a sample of that population be representative of thatpopulation; otherwise, the sample is said to be biased.

James V. Lambers Statistical Data Analysis 10 / 76

Page 11: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Descriptive StatisticsInferential StatisticsEthics in Statistics

1936 Presidential Election

This occurred during the presidential election of 1936, in which a poll ofa sample of voters was conducted in order to determine whether themajority would vote for Franklin D. Roosevelt, the Democratic candidate,or Alf Landon, the Republican candidate.

The conclusion made from the poll was that Landon would win theelection, when in fact Roosevelt won.

James V. Lambers Statistical Data Analysis 11 / 76

Page 12: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Descriptive StatisticsInferential StatisticsEthics in Statistics

Where Did They Go Wrong?

The reason why the poll yielded an incorrect conclusion was thattelephone directories were used to obtain voter names, and in 1936,telephones existed primarily in more affluent households, which tended tovote Republican.

That is, the method of polling led to an unintentional bias.

In some cases, unfortunately, a sample can be biased intentionally, inorder to make a false conclusion that supports one’s agenda.

James V. Lambers Statistical Data Analysis 12 / 76

Page 13: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Descriptive StatisticsInferential StatisticsEthics in Statistics

Internet Polling

Just as telephone polling was problematic decades ago, internet polling isproblematic today.

It is very difficult to ensure that voters in an internet poll vote only once,and it is impossible to ensure that those who vote are actuallyrepresentative of any given population.

For this reason, such polls are generally labeled as “unscientific”,although this disclaimer is not always noted by those who read the resultsof such polls.

James V. Lambers Statistical Data Analysis 13 / 76

Page 14: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Descriptive StatisticsInferential StatisticsEthics in Statistics

Worst Practices in Data Display

Another example of questionable or unethical uses of statistics is thetactic of emphasizing differences through display.

Suppose that over a period of three years, the average price of a home ina certain city has increased from $380,000 to $390,000 to $400,000.

This data can be displayed in different ways to either emphasize orde-emphasize the increase.

James V. Lambers Statistical Data Analysis 14 / 76

Page 15: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Descriptive StatisticsInferential StatisticsEthics in Statistics

Different approaches to displaying the same increase in home prices overa three-year period

James V. Lambers Statistical Data Analysis 15 / 76

Page 16: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Descriptive StatisticsInferential StatisticsEthics in Statistics

Manipulation of Axes

Note that both charts display exactly the same data, but whereas thechart on the left uses a vertical scale that has the effect of making theyearly increase seem negligible, the chart on the right uses a vertical scalethat makes this same increase seem much more dramatic.

People who report statistics can, unfortunately, use tactics like this tosubtly influence consumers of the information that they provide.

James V. Lambers Statistical Data Analysis 16 / 76

Page 17: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Data SourcesLevels of Measurement Scales

Data Collection and Usage

In this section, we discuss various approaches to data collection, and theramifications of each. It is important to consider both the source of thedata, and the method of measurement used during its collection. First,we give some definitions.

I data (singular datum) are values assigned to observations that aremade about a population.

I A parameter is a type of data that describes a characteristic of apopulation, such as the income level of every member of the laborforce within a city.

I By contrast, a statistic is data that describes a characteristic of asample, such as the favorite candy bar of every member of a focusgroup.

I Information is data transformed into useful facts, typically throughinferential statistics.

James V. Lambers Statistical Data Analysis 17 / 76

Page 18: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Data SourcesLevels of Measurement Scales

Example

Suppose that a large corporation, that has hundreds of stores throughoutthe United States, wants to determine the trend of its sales from year toyear.

The average revenue of all of its stores would be considered a parameter,where the population consists of all stores.

However, the corporation could consider just a sample of its stores andcompute the average revenue for this subset, which would be a statistic.

Suppose that this average is found to be dropping from year to year.From this data, the corporation could glean the essential information thatit is in danger of going bankrupt if this trend continues, and must actbefore it is too late.

James V. Lambers Statistical Data Analysis 18 / 76

Page 19: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Data SourcesLevels of Measurement Scales

Data Sources

We now examine various sources of data.

Regardless of the type of source, data can be categorized as eitherprimary data, which is data collected by an individual or organization fortheir own use, as opposed to secondary data, which is data collected byothers (such as a government agency).

Regardless of whether one collects their own data or obtains it fromelsewhere, it is essential to ensure that this data is collected from asample that is representative of the population that is being studied.

James V. Lambers Statistical Data Analysis 19 / 76

Page 20: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Data SourcesLevels of Measurement Scales

Direct Observation

Direct observation is an approach to data collection in which subjects ofthe observation are in their natural environment.

That is, there is little or no interaction between the subjects and theobserver.

Some examples are observing animals in the wild or people in publicplaces. An advantage of this approach is that the subjects are notinfluenced by the data collection process, which helps ensure morereliable data.

A disadvantage is lack of control over the sample, thus making it difficultto ensure that it is representative of the population of interest.

James V. Lambers Statistical Data Analysis 20 / 76

Page 21: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Data SourcesLevels of Measurement Scales

Experiments

A clinical trial for a new medication is an example of an experiment,which is another type of data source.

In an experiment, unlike with direct observation, a statistician has morecontrol over the makeup of the sample, to ensure that it is representativeof the population of interest.

On the other hand, because the participants are aware that data is beingcollected from them, they might (even unintentionally) be biased, thusinfluencing this data.

James V. Lambers Statistical Data Analysis 21 / 76

Page 22: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Data SourcesLevels of Measurement Scales

Surveys

In surveys, subjects are asked direct questions in order to produce thedesired data.

In this approach, it is essential to avoid two kinds of bias: bias due to thesubjects not being a representative sample of the population, and biasdue to the form of the questions being asked, which can substantiallyinfluence the data.

James V. Lambers Statistical Data Analysis 22 / 76

Page 23: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Data SourcesLevels of Measurement Scales

Levels of Measurement Scales

Now that we know of some sources from which data can be gathered, weneed to also know about ways in which it can be measured, and theramifications of each.

James V. Lambers Statistical Data Analysis 23 / 76

Page 24: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Data SourcesLevels of Measurement Scales

Nominal

Nominal measurement is a purely qualitative form of measurement, inwhich observations are assigned to categories, such as one’s gender,occupation, or state of residence.

It does not make sense to perform mathematical operations orcomparisons of any kind on such measurements, even if the categories arelabeled numerically (for example, zip codes).

James V. Lambers Statistical Data Analysis 24 / 76

Page 25: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Data SourcesLevels of Measurement Scales

Ordinal

The “next step up” from nominal measurement, on the spectrum fromqualitative to quantitative, is ordinal measurement.

Such measurements can be either qualitative or quantitative, and theycan be ranked; examples would be the order of finish in a race, or thenumber of stars given to a movie by a critic as a rating.

However, other mathematical operations do not make sense; for instance,one cannot claim that a movie that earns four stars is twice as good as amovie that earns two stars, or that the difference in quality between any2-star movie and any 4-star movie is the same.

James V. Lambers Statistical Data Analysis 25 / 76

Page 26: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Data SourcesLevels of Measurement Scales

Interval

Interval measurements are purely quantitative, and can be added orsubtracted.

An example would be temperature, since differences in temperaturemeasurements are meaningful.

However, interval measurements cannot be multiplied or divided; that is,one hundred degrees is not considered twice as warm as fifty degrees.

James V. Lambers Statistical Data Analysis 26 / 76

Page 27: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Data SourcesLevels of Measurement Scales

Ratio

The most versatile form of measurement is ratio measurement.

For such measurements, addition, subtraction, multiplication, divisionand comparison are valid.

Examples of ratio measurement are age, weight, or salary.

What distinguishes ratio measurements from interval measurements isthat there is a “zero point” that makes ratios have meaning.

A useful rule of thumb is the “twice as much” rule: if doubling ameasurement has a consistent meaning, then the measurement is a ratiomeasurement rather than an interval measurement.

James V. Lambers Statistical Data Analysis 27 / 76

Page 28: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Frequency DistributionsStem-and-Leaf DisplaysCharts

Frequency Distributions

A frequency distribution is a table that lists specific intervals, calledclasses, along with the number of data observations that fall into eachclass.

The number of observations belonging to a particular class is called afrequency.

James V. Lambers Statistical Data Analysis 28 / 76

Page 29: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Frequency DistributionsStem-and-Leaf DisplaysCharts

Example

Suppose that a survey of 100 voters is taken, in which the age of eachrespondent is recorded. The ages of the respondents are

48 55 73 54 36 82 30 37 63 5025 64 48 84 34 18 69 72 66 6460 47 24 63 65 50 51 31 63 7251 75 37 85 77 48 29 38 84 4367 68 29 35 42 50 42 24 33 6467 86 38 65 73 72 61 58 68 4763 55 49 38 65 41 31 66 35 7720 41 55 65 18 73 70 56 26 7623 25 50 67 60 51 35 48 61 3640 61 79 23 45 21 82 63 50 61

James V. Lambers Statistical Data Analysis 29 / 76

Page 30: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Frequency DistributionsStem-and-Leaf DisplaysCharts

Example, cont’d

Since voters must be at least 18 years of age, classes could be chosen asfollows: 18-27, 28-37, and so on, up to 78-87, since the maximum ageamong all respondents is 86. Then, the frequency distribution is

Age Range Number of Respondents18-27 1128-37 1438-47 1248-57 1858-67 2468-77 1478-87 7

Frequency distribution of ages of 100 voters surveyed

James V. Lambers Statistical Data Analysis 30 / 76

Page 31: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Frequency DistributionsStem-and-Leaf DisplaysCharts

Frequency Distributions in R

Suppose that the 100 ages from the preceding example are stored in atext file, called ages.txt, as a simple list of numbers separated byspaces. To create this frequency distribution in R, the followingcommands can be used:

> ages=scan("ages.txt")

> breaks = seq(min(ages),max(ages)+10,by=10)

> freq = table(cut(ages,breaks,right=FALSE))

> freq

[18,28) [28,38) [38,48) [48,58) [58,68) [68,78) [78,88)

11 14 12 18 24 14 7

James V. Lambers Statistical Data Analysis 31 / 76

Page 32: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Frequency DistributionsStem-and-Leaf DisplaysCharts

Dissection of R Code

In Windows, by default, R assumes that files are stored in your MyDocuments folder; otherwise, a full pathname should be specified as theargument to scan.

The min and max functions return the minimum and maximum values,respectively, of their argument.

The seq function returns a sequence of numbers with specified startingvalue, ending value, and spacing. In this case, 10 is added to themaximum value to ensure that it is included in a class.

James V. Lambers Statistical Data Analysis 32 / 76

Page 33: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Frequency DistributionsStem-and-Leaf DisplaysCharts

Dissection, cont’d

The cut function determines which class each element of its firstargument belongs to, where the classes are specified by the secondargument. The third argument right=FALSE is used to specify that theright endpoint of each class is not included in the class.

Finally, the freq function generates the frequency distribution from theoutput of cut.

James V. Lambers Statistical Data Analysis 33 / 76

Page 34: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Frequency DistributionsStem-and-Leaf DisplaysCharts

Class Selection

In determining the classes for a frequency distribution, the followingguidelines should be observed:

I All classes should be of equal size, so that the number ofobservations in each class can be compared in a meaningful way.

I There should be between 5 and 15 classes. Using too few classesfails to give a sense of the distribution of observations, and havingtoo many classes makes comparing classes less useful.

I Classes should not be “open-ended”, if possible. For example, ifobservations are ages, there should not be a class of “over age 50”.

I Classes should be exhaustive, so that all data observations can beincluded.

Note that the frequency distribution in the preceding example followsthese guidelines; had classes spanned 20 years instead of 10, there wouldhave been too few.

James V. Lambers Statistical Data Analysis 34 / 76

Page 35: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Frequency DistributionsStem-and-Leaf DisplaysCharts

Variations

Some variations on a frequency distribution are:

I A relative frequency distribution, all frequencies are divided by thetotal number of observations, in order to obtain the percentage ofobservations in each class. As before, classes should be exhaustive,so that the total of all relative frequencies is 100%.

I A cumulative frequency distribution lists, for each class, thepercentage of observations that are less than or equal the values inthe class.

I A histogram is a bar graph in which the height of each bar is thenumber of observations in a class.

James V. Lambers Statistical Data Analysis 35 / 76

Page 36: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Frequency DistributionsStem-and-Leaf DisplaysCharts

Histograms

A histogram can easily be created in R, using the hist command. Forexample, from the age data used in previous examples, the command

hist(ages)

produces the histogram shown on the next slide.

With this simple usage of hist, the classes are chosen automatically; asecond argument, breaks, can be used to specify the classes manually.For example,

hist(ages, breaks=c(18,27.5,37.5,47.5,57.5,67.5,77.5,87))

produces a histogram that conforms to the frequency distribution given inthe preceding example.

James V. Lambers Statistical Data Analysis 36 / 76

Page 37: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Frequency DistributionsStem-and-Leaf DisplaysCharts

Histogram Example

Histogram of age data produced in R

James V. Lambers Statistical Data Analysis 37 / 76

Page 38: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Frequency DistributionsStem-and-Leaf DisplaysCharts

Stem-and-Leaf Display

A stem-and-leaf display is a table for displaying integer-valuedobservations in which each observation is decomposed into a “leaf”,which is the ones digit, and a “stem”, which consists of the rest of thedigits.

The display consists of two columns; the left column lists stems and theright column lists all leaves with their corresponding stems.

An advantage of using a stem-and-leaf display is that all of the originalobservations are actually visible in the display, as opposed to a frequencydistribution that only lists the number of observations that fall withineach class.

James V. Lambers Statistical Data Analysis 38 / 76

Page 39: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Frequency DistributionsStem-and-Leaf DisplaysCharts

Stem-and-Leaf Display of Age Data

1 882 013344556993 0113455566778884 011223577888895 000001114555686 001111333334445555667778897 0222333567798 224456

James V. Lambers Statistical Data Analysis 39 / 76

Page 40: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Frequency DistributionsStem-and-Leaf DisplaysCharts

Pie Charts

A pie chart is a circle divided into sectors, that are associated withclasses.

The central angle of each sector is equal to the relative frequency of thecorresponding class, multiplied by 360 degrees.

As a result, the size of each sector is indicative of the relative frequencyof each class. It is best to also use colors to distinguish the classes.

A pie chart for the age data used in previous examples is shown on thenext slide. It is generated using the R command

pie(freq)

where freq is the frequency distribution generated earlier.

James V. Lambers Statistical Data Analysis 40 / 76

Page 41: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Frequency DistributionsStem-and-Leaf DisplaysCharts

Pie Chart Example

Pie chart generated from frequency distribution of age data

James V. Lambers Statistical Data Analysis 41 / 76

Page 42: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Frequency DistributionsStem-and-Leaf DisplaysCharts

Bar Charts

A bar chart is like a histogram, except that the height of each bar isdetermined by a specific data value, rather than the frequency of a class.

Thus, a bar chart is used to highlight the actual values in the data set, asopposed to a pie chart, which highlights the relative sizes of classes.

The bar chart shown on the next slide is generated in R from the agedata using the command

barplot(sort(ages))

James V. Lambers Statistical Data Analysis 42 / 76

Page 43: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Frequency DistributionsStem-and-Leaf DisplaysCharts

Bar Chart Example

Bar chart generated from sorted age data

James V. Lambers Statistical Data Analysis 43 / 76

Page 44: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

Frequency DistributionsStem-and-Leaf DisplaysCharts

Line Charts

A line chart is useful for illustrating a relationship between two sets ofdata, particularly when there is a large number of observations.

Observations are plotted as points on the chart, and the x- andy -coordinates of the points are obtained from the observations of eachdata set.

The points are then connected to help depict the relationship betweenthe sets.

James V. Lambers Statistical Data Analysis 44 / 76

Page 45: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

MeanMedianModeChoosing a Measure

Measures of Central Tendency

It is highly desirable to be able to characterize a data set using a singlevalue.

Suppose that a data set consists of numerical values, and that theobservations are plotted as points on the real number line.

Then, a number that is at the “center” of these points can serve as sucha characterizing value.

This value is called a measure of central tendency.

James V. Lambers Statistical Data Analysis 45 / 76

Page 46: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

MeanMedianModeChoosing a Measure

Mean

Given a set of n numerical observations {x1, x2, . . . , xn} of a population,the mean of the set is

µ =x1 + x2 + · · ·+ xn

n.

When the observations are drawn from a sample, rather than an entirepopulation, then the mean is denoted by x̄ :

x̄ =x1 + x2 + · · ·+ xn

n.

The mean can be defined more concisely using sigma notation:

µ =1

n

n∑i=1

xi .

James V. Lambers Statistical Data Analysis 46 / 76

Page 47: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

MeanMedianModeChoosing a Measure

The Mean in R

To compute the mean of a data set in R, the mean function can be used.

For example, with the age data used in previous example, we have:

> mean(ages)

[1] 52.55

James V. Lambers Statistical Data Analysis 47 / 76

Page 48: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

MeanMedianModeChoosing a Measure

Weighted Mean

In some instances, a measure of central tendency needs to be computedfrom the values in a data set, in which some values should be assignedmore weights than others.

This leads to the notion of a weighted mean

µ =w1x1 + w2x2 + · · ·+ wnxn

w1 + w2 + · · ·+ wn=

n∑i=1

wixi

n∑i=1

wi

.

The weights must all be positive.

James V. Lambers Statistical Data Analysis 48 / 76

Page 49: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

MeanMedianModeChoosing a Measure

Example

Suppose that an overall course grade is computed by weighting ahomework average h by 10%, two test grades t1 and t2 by 25% each, anda final exam f by 40%.

Then the overall grade is

10h + 25t1 + 25t2 + 40f

10 + 25 + 25 + 40.

James V. Lambers Statistical Data Analysis 49 / 76

Page 50: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

MeanMedianModeChoosing a Measure

Weighted Mean in R

To compute a weighted mean in R, the weighted.mean function can beused.

The first argument is a vector of observations, and the second argumentis a vector of weights.

For example, suppose the homework average is 80, the test scores are 75and 85, and the final exam score is 90. Then, the weighted mean is

> grades <- c(80,75,85,90)

> weighted.mean(grades,c(10,25,25,50))

[1] 84.54545

James V. Lambers Statistical Data Analysis 50 / 76

Page 51: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

MeanMedianModeChoosing a Measure

Mean of Grouped Data

When data observations are summarized in a frequency distribution, anapproximation of their mean can readily be obtained.

Suppose that the frequency distribution has n classes, with frequenciesf1, f2, . . . , fn.

Furthermore, suppose that the ith class has a representative value ci ; forexample, it could be the average of the lower and upper bounds of theclass.

James V. Lambers Statistical Data Analysis 51 / 76

Page 52: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

MeanMedianModeChoosing a Measure

Approximating the Mean

Then an approximation of the mean is

µ =

n∑i=1

ci fi

n∑i=1

fi

.

It follows that if each class contains only a single value, then thisapproximate mean is given by a weighted mean of these values, in whichthe frequencies are the weights.

James V. Lambers Statistical Data Analysis 52 / 76

Page 53: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

MeanMedianModeChoosing a Measure

Example

Consider the frequency distribution of age data given earlier. The classesare age ranges 18-27, 28-37, and so on.

If we average the upper and lower bounds of each class, we obtainrepresentative values of the classes.

In R, this can be accomplished using the following statements, and thebreaks variable that was defined earlier.

> breaks

[1] 18 28 38 48 58 68 78 88

> class midpoints=(breaks[1:7]+(breaks[2:8]-1))/2

> class midpoints

[1] 22.5 32.5 42.5 52.5 62.5 72.5 82.5

James V. Lambers Statistical Data Analysis 53 / 76

Page 54: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

MeanMedianModeChoosing a Measure

Vectors in R

Note that components of a vector are accessed using indices enclosed insquare brackets, and that the first component of each vector has theindex of 1.

Also, a contiguous portion of a vector can be extracted by specifiying arange of indices with a colon.

For example, breaks[1:7] is a vector consisting of the first 7 elements,numbered 1 through 7, of breaks.

James V. Lambers Statistical Data Analysis 54 / 76

Page 55: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

MeanMedianModeChoosing a Measure

Example, cont’d

Now, an approximate mean can be computed using (52):

> sum(class midpoints*freq)/sum(freq)

[1] 52.5

Note that this approximation is very close to the actual mean of 52.55.

Also, note that vectors of the same length can be multiplied; the result isa vector of products of corresponding components of the vectors.

Then, sum can be used to compute the sum of all of the components of avector.

James V. Lambers Statistical Data Analysis 55 / 76

Page 56: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

MeanMedianModeChoosing a Measure

Median

The median of a data set is, informally, the value such that half of thevalues in the set are less than the median, and half are greater than themedian.

Specifically, if the number n of observations in the set is odd, then themedian is the middle value of the set, at position (n + 1)/2, if the valuesare sorted.

If n is even, then the median is defined to the average of the values atpositions n/2 and n/2 + 1.

The median function in R can be used to compute the median of avector of observations. For example, using the age data, we have

> median(ages)

[1] 52.5

James V. Lambers Statistical Data Analysis 56 / 76

Page 57: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

MeanMedianModeChoosing a Measure

Mode

The mode of a data set is the value that occurs most often within theset. It is possible for a data set to have more than one mode.

There is no function in R for computing the mode, but if v is a vectorcontaining all of the values of a data set, the following statements can beused to find its modes.

> vtable=table(v)

> where <- vtable==max(vtable)

> names(vtable)[where]

James V. Lambers Statistical Data Analysis 57 / 76

Page 58: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

MeanMedianModeChoosing a Measure

Code Dissection

The first statement

vtable=table(v)

creates a one-row table from v, in which the data values of v are theheader names of the columns in vtable, and the values in the one row ofvtable are the counts of those values in v.

The second statement

where <- vtable==max(vtable)

finds the indices within the table at which the counts are equal to themaximum. The variable where is a logical vector, with the same numberof elements as there are distinct values in v. Each element of where isTRUE if the count of the corresponding value is equal to the maximum,and FALSE otherwise.

James V. Lambers Statistical Data Analysis 58 / 76

Page 59: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

MeanMedianModeChoosing a Measure

Code Dissection, cont’d

The third statement

names(vtable)[where]

uses the names function to extract the column names from vtable,which are also the distinct values in the original data set in v.

Then, the subscript [where] extracts only those column names in whichthe corresponding counts are equal to the maximum, which are themodes.

James V. Lambers Statistical Data Analysis 59 / 76

Page 60: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

MeanMedianModeChoosing a Measure

Choosing a Measure

Given these three measure of central tendency, it is natural to ask whichone should be used.

The mean can be skewed if the data set contains outliers, thus making itan unreliable measure.

The median, on the other hand, is not susceptible to such bias.

Finally, the mode is not often used, except with nominal data, whichcannot be compared or added anyway.

James V. Lambers Statistical Data Analysis 60 / 76

Page 61: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

RangeVarianceStandard DeviationQuartiles

Measures of Dispersion

A measure of central tendency is quite limited in its ability to describe adata set.

For example, the values may be clustered closely around the mean ormedian, or they may be widely spread out.

As such, we can use a measure of dispersion that describes how farindividual data values deviate from a measure of central tendency.

James V. Lambers Statistical Data Analysis 61 / 76

Page 62: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

RangeVarianceStandard DeviationQuartiles

Range

The range of a set of data observations is simply the difference betweenthe largest and smallest values.

This measure of dispersion has the advantage that it is very easy tocompute.

However, it uses very little of the data, and is unduly influenced byoutliers.

The range function in R can be used to obtain the range of a set ofobservations.

> range(ages)

[1] 18 86

James V. Lambers Statistical Data Analysis 62 / 76

Page 63: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

RangeVarianceStandard DeviationQuartiles

Population Variance

The variance of a population, denoted by σ2, is obtained from thedeviation of each observation from the mean:

σ2 =1

N

N∑j=1

(xj − µ)2.

An equivalent formula, that is less tedious for larger populations, is

σ2 =

1

N

N∑j=1

x2j

− µ2.

James V. Lambers Statistical Data Analysis 63 / 76

Page 64: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

RangeVarianceStandard DeviationQuartiles

Sample Variance

The formula for the variance of a sample, denoted by s2, is slightlydifferent:

s2 =1

N − 1

N∑j=1

(xj − x̄)2.

The division by (N − 1) instead of N is intended to compensate for thetendency of the sample variance, when dividing by N, to underestimatethe population variance.

The var function in R computes the sample variance of a vector ofobservations that is given as an argument.

James V. Lambers Statistical Data Analysis 64 / 76

Page 65: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

RangeVarianceStandard DeviationQuartiles

Standard Deviation

For both a population and a sample, the standard deviation is the squareroot of the variance. That is, the standard deviation of a population is

σ =

√√√√ 1

N

N∑j=1

(xj − µ)2,

whereas for a sample, we have

s =

√√√√ 1

N − 1

N∑j=1

(xj − x̄)2.

An advantage of the standard deviation over the variance, as a measureof dispersion, is that the standard deviation is measured using the sameunits as the original data.

James V. Lambers Statistical Data Analysis 65 / 76

Page 66: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

RangeVarianceStandard DeviationQuartiles

Standard Deviation in R

The sd function in R computes the sample standard deviation of a givenvector of observations. For example, from the age data, we obtain

> var(ages)

[1] 325.0379

> sd(ages)

[1] 18.02881

James V. Lambers Statistical Data Analysis 66 / 76

Page 67: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

RangeVarianceStandard DeviationQuartiles

Standard Deviation of Grouped Data

For grouped data in a relative frequency distribution, with n classes, classvalues cj (for example, the midpoint of the values in the class), andrelative frequencies fj , j = 1, 2, . . . , n, the population standard deviationcan be computed as follows:

σ =

√√√√√ n∑

j=1

c2j fj

− µ2.

James V. Lambers Statistical Data Analysis 67 / 76

Page 68: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

RangeVarianceStandard DeviationQuartiles

Empirical Rule

The empirical rule states that if the distribution of a set of observationsis “bell-shaped”, meaning that the distribution is symmetric around themean and decreases toward zero away from the mean, then approximately68, 95, and 99.7 % of the observations fall within 1, 2, and 3 standarddeviations of the mean, respectively.

James V. Lambers Statistical Data Analysis 68 / 76

Page 69: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

RangeVarianceStandard DeviationQuartiles

Chebyshev’s Theorem

Another rule of thumb, that applies even to distributions that are notbell-shaped or symmetric, is Chebyshev’s Theorem, which states that ifk > 1, then at least (

1− 1

k2

)100%

of the observations fall within k standard deviations of the mean.

James V. Lambers Statistical Data Analysis 69 / 76

Page 70: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

RangeVarianceStandard DeviationQuartiles

Quartiles

Another measure of dispersion is the use of quartiles, which are obtainedby dividing a data set into four segments that, as much as possible,contain an equal number of observations.

Just as the median is the “middle” value of the data set, the firstquartile, denoted by Q1, is the median of the “lower half” of the data,and the third quartile, denoted by Q3, is the median of the “upper half”of the data.

There are various ways of determining what constitutes the lower andupper halves; some statisticians include the median in these halves if it isan actual observation, but some do not.

James V. Lambers Statistical Data Analysis 70 / 76

Page 71: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

RangeVarianceStandard DeviationQuartiles

Interquartile Range and Outliers

Once the first and third quartiles are computed, the interquartile range,denoted by IQR, is defined by

IQR = Q3 − Q1.

This value is used to measure the spread of the center half of data, andidentify outliers.

A rule of thumb is to classify any values less than Q1 − 1.5IQR, orgreater than Q3 + 1.5IQR, as outliers.

James V. Lambers Statistical Data Analysis 71 / 76

Page 72: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

RangeVarianceStandard DeviationQuartiles

Quartiles in R

The following R statements illustrate the computation of Q1, Q3 and theIQR, in order:

> quantile(ages,0.25)

25%

37.75

> quantile(ages,0.75)

75%

66

> IQR(ages)

[1] 28.25

James V. Lambers Statistical Data Analysis 72 / 76

Page 73: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

RangeVarianceStandard DeviationQuartiles

Five-point Summary

The five-point summary of a data set consists of the minimum value, Q1,the median (also denoted by Q2), Q3, and the maximum value.

It can be obtained using the summary function in R. For example, fromthe age data, we obtain

> summary(ages)

Min. 1st Qu. Median Mean 3rd Qu. Max.

18.00 37.75 52.50 52.55 66.00 86.00

James V. Lambers Statistical Data Analysis 73 / 76

Page 74: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

RangeVarianceStandard DeviationQuartiles

Box-and-Whisker Plot

These measures can be used to construct a box-and-whisker plot, whichdisplays the interquartile range and outliers.

A box is drawn with opposing boundaries placed at Q1 and Q3, with aparallel line drawn within the box at the median.

Then, perpendicular lines, which are the “whiskers”, are drawn from Q1

to the minimum value, and from Q3 to the maximum value.

The length of the box is equal to IQR, and if the length of either of thewhiskers is more than 1.5 times the width of the box, then the value atthe end of the whisker is an outlier.

James V. Lambers Statistical Data Analysis 74 / 76

Page 75: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

RangeVarianceStandard DeviationQuartiles

Box-and-Whisker Plots in R

A box-and-whisker plot can be produced in R using the boxplot

command.

For example, the plot shown on the next slide is obtained from the agedata used in earlier examples using the command

boxplot(ages)

James V. Lambers Statistical Data Analysis 75 / 76

Page 76: Introduction to Statistical Data Analysis Lecture 1 ... · Statistical Software: The R Project Types of Statistics Data Collection and Usage Data Display Measures of Central Tendency

IntroductionStatistical Software: The R Project

Types of StatisticsData Collection and Usage

Data DisplayMeasures of Central Tendency

Measures of Dispersion

RangeVarianceStandard DeviationQuartiles

Box-and-Whisker Plot Example

Box-and-whisker plot produced from age data

James V. Lambers Statistical Data Analysis 76 / 76