displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfobjectives...
TRANSCRIPT
![Page 1: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/1.jpg)
ObjectivesDisplaying data and distributions with graphs
p Variables p Types of variables (CIS p40-41)p Distribution of a variable p Bar graphs for categorical variables (CIS p42)p Histograms for quantitative variables (CIS p43-46)p Interpreting histograms (CIS p56-57)Further reading:https://www.openintro.org/stat/textbook.php?stat_book=os (from now on abbreviated to OS3) Section 1.6.3
![Page 2: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/2.jpg)
Topics:PopulationsandSamplesp Learning objectives
p Understand what population, sample and random variables are.p Know if a random variable is binary, categorical or numerical (continuous
or discrete) p Understand what a distribution of a variable is and that a histogram is
good way of depicting the distribution.p Know how to plot a relative frequency histogram in Statcrunch.p Know the main features of a histogram of a variable will look like.
![Page 3: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/3.jpg)
Samples,PopulationsandVariables
![Page 4: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/4.jpg)
Definitionsp A population is group of individuals of interest. They are
not necessarily humans. The population is usually comprised of millions/billions or infinite number of individual. The population is generally never observed. p Example: Individuals can be individual people,
companies, animals or any object of interest.p A sample is a subset of the population. p The ratio of the sample size to the population size has
absolutely no influence on inference. Typically this ratio will be close to zero.Further reading:https://www.openintro.org/stat/textbook.php?stat_book=osSection 1.6 (numerical data)
![Page 5: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/5.jpg)
Differenttypesofvariable
![Page 6: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/6.jpg)
Numericalvariablesp Usually numerical values, where arithmetic operations,
such as adding and averaging, make sense (adding bus numbers together makes no sense!).
p Example: Your height, your age, your blood iron level, the number of credit cards you own.
p If the numerical variable can only take integers (such as 1, 2, 3) it is called numerical discrete. p Averages of numerical discrete variables are close to
continuous.
p If it can take “any” number within a range it is called numerical continuous (such as weights, heights etc).
![Page 7: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/7.jpg)
Categoricalvariables
pExample: Your blood type (A, B, AB, O), your favorite drink, the bus you take to university.
p If there are only two categories, we call it a binary variable.
pOne can always give a categorical variable a numerical labelling. For example, one can label your favorite color as Green = 1, Blue = 2 Orange = 3. However, the labelling can change Green = 2, Blue = 3 Orange = 1.
pThe “average” favorite color in a class has no meaning.
![Page 8: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/8.jpg)
QuestionTime
![Page 9: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/9.jpg)
Population andSample:Example1
q 261 (most are verified customers) submitted a review for this fitness trackers. This is a sample from all customers who have bought such a fitness tracker (but it is biased, since response is voluntary).
q The variable of interest is the score from 1-5 that an individual gives the tracker.
q The variable is numerical discrete.
![Page 10: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/10.jpg)
Statcrunch:Inputtingdatap Transferring data into Statcrunchp It is interesting to analyze Amazon scores in Statcrunch.
p Find a product of interest whose scores you want to input into Statcrunch.
p I will use the tracker on the previous slide. p Transform the percentages into numbers:
§ 5 star: 261*0.72 = 188§ 4 star: 261*0.08 = 21 etc.
p These number have to be put into the Statcrunch spreadsheet. The easiest, but most boring method is simply to type the column of 188 5s, then below that 21 4’s and so forth (there are other methods too).
p Once you have one long column, where the number of rows is equal to the number of reviewers. Save the data by Data -> Save.
![Page 11: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/11.jpg)
Populationandsample:Example2p In a medical check-up, often, a blood sample is taken to
check whether, for example, your iron level is at a healthy level.
p The variable of interest is the concentration of iron in a blood sample. This is numerical continuous.
p The concentration of iron will fluctuate over the samples. Several factors can determine the concentration of iron in a given samplep Food eaten that dayp Bowel movements.
p The population in this case are all the blood samples that can be taken over a period of time for that individual.
p Given a few blood samples, the doctor will be able to estimate the mean iron level in the body.
![Page 12: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/12.jpg)
Populationandsample:Example3
p The population of interest are calves (it is impossible to observe all of them). However, we do have a small sample, 44 calves monitored biweekly from birth to 7 weeks.
p This data set has a whole bunch of different interesting variables:p Weights are numerical continuous (though since the
are rounded to the nearest pound it “feels” numerical discrete).
p Treatment the calf is given is categorical.
![Page 13: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/13.jpg)
q The above is a snapshot from the calf data.
q Tag # corresponds to an individual.
q Igg and TSP correspond to treatments that each calf was given. These are categorical variables.
q Wt 1. Corresponds to the weight at week 1. This a numerical variable (continuous or discrete?).
q The data for each individual/case is called an observation.
What the data looks like in Statcrunch:
![Page 14: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/14.jpg)
Populationandsample:Example4
p Over 272 consecutive eruptions were recorded. Both the duration of an eruption and the waiting time between eruption were recorded.
p These as numerical continuous variables. p The population of interest are all the
eruptions over a period of time.
Old Faithful is a famous geyser in Yellowstone national park. It has been erupting water over thousands of years.
![Page 15: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/15.jpg)
Example5:Poultrysciencep Does the type of light a hen is exposed to influence the
strength of its egg?p The population would be all chickens in the world.p Below is a sample from two light treatments and their
shell strength (this is an experimental design).
![Page 16: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/16.jpg)
Distributionofavariablep We previously defined a variable as a quantity in a
population that we are interested in.p Example: Heights of people, major of student, favorite
colour.q A variable by definition is `random’ in the sense that it
can vary from individual to individual.q The distribution of a random variables, gives the
probability or frequency of all outcomes that the variable can take.q If the variable is the subject a student majors in at
A&M. The distribution of the variable is the proportion of students majoring in each subject.
![Page 17: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/17.jpg)
Distributionsp The distribution of a variable is rarely known. However,
we can get some idea of what it `looks’ like by analysingthe distribution of a sample.
p Of course, a whole bunch of numbers is very hard to comprehend.
p The best course of action is to plot the distribution. p Various plotting tools can be used:
p Bar chart for categorical datap Histogram for numerical data
p The distribution of each variable will have certain characteristics, which we should look out for when we make a plot.
![Page 18: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/18.jpg)
BarchartsforcategoricaldataWhen the variable is categorical, the categories in the graph can be ordered any way we want (alphabetical, by increasing value, by personal preference, etc.).
p Bar graphs. Each category is represented by a bar with length equal to a count or percent (starting from the x-axis).
![Page 19: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/19.jpg)
Histogramfornumericaldata§ Definition The range is the small interval containing all
the observations. It is an interval which starts with the smallest number and ends with the largest in the data set.
§ A histogram breaks the range of values of a variable into “bins” (intervals) and displays either the count or percent of the observations that fall into each bin.
§ Depending how whether the counts of percentages are used it is either a histogram or relative frequency histogram.
![Page 20: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/20.jpg)
HistogramandRelativeFrequencyHistogram
p We plot both for the weights of 44 new born calves.
What proportion of the new born calf weights are between 85-100 pounds:
![Page 21: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/21.jpg)
RelativefrequencyhistogramofAmazondatap Amazon gives the relative frequency histogram rating for all its
products
If you rotate the plot you get the Amazon plot.
![Page 22: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/22.jpg)
InStatcrunch:loadingdatap Download onto your computer the data set of interest (know the
folder where the data lies, usually it will be in the Download folder).p Log into Statcrunch. To upload the data into Statcrunch:
p Data -> Load -> From file -> on my computer§ You will see a new window § Choose File selects the data set§ The delimiter determines how
columns in the file are separated. Usually whitespace or comma is fine.
§ Click Load File (at bottom of page)
![Page 23: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/23.jpg)
p In plots:
![Page 24: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/24.jpg)
InStatcrunch:MakingaplotTo make a plot:q Graph -> Histogramq You will get a pop down menu like the one of
the right.q Select the column(s) you want plottedq Select Type (I usually use relative frequency).q Using Bins Width you can select the length of
bin. There is usually a default chosen by the software.
q We discuss later how bin widths can influence the information the plot conveys.
q If you check Marker, then it will include the location of the average in the plot.
![Page 25: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/25.jpg)
Strengthofeggsp Below are the relative frequency histograms for the
Haugh units of eggs under white and red light. What do you observe?
![Page 26: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/26.jpg)
Mostcommondistributionshapes
p A distribution is symmetric if the right and left sides of the histogram are approximately mirror images of each other.
Symmetric distribution
Skewed distributionp A distribution is skewed to the right if
the right side of the histogram (side with larger values) extends much farther out than the left side. It is skewed to the left if the left side of the histogram extends much farther out than the right side.
![Page 27: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/27.jpg)
Complex, multimodal distribution
p Not all histograms have a simple overall shape.
p Multiple modes suggest a mixtures of subpopulations.
It can be difficult to identify shapes if the sample size is small.
![Page 28: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/28.jpg)
ExamplesofShapes
p Distributions which tend to be nearly symmetric are:p Heights of a particular gender.p Lengths of bird bills.p Other biological lengths.
![Page 29: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/29.jpg)
p Distributions which tend to be skewed:p RIGHT SKEWED: The price of houses. Many which
are moderately priced, but the expensive ones can be extremely expensive (right skewed).
p LEFT SKEWED: The gestation period of baby. Common knowledge tells us that the due data is at 40 weeks.
![Page 30: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/30.jpg)
p Distribution with multiple modes:p The number of M&Ms in a bag (several modes
because the of the different types).p Anything where several subpopulation, each with their
own characteristics are put together.
![Page 31: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/31.jpg)
p Distributions which tend to be flat (no modes):p Numbers in a lottery p The outcome of a die.p The chance of all outcomes for flat distributions are
equally likely. One outcome cannot be favoured over another (for example in a die 4 occurring more often than a 6). Such distribution are very rare.
![Page 32: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/32.jpg)
QuestionTimep Which characteristic below bests describes the
distribution (histogram) for the length of adult (all adults) feet?
A. Bimodal.B. Symmetric and Bell shaped.
![Page 33: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/33.jpg)
QuestionTime
1. Temperature in College Station in July (in Fahrenheit)2. The time a student takes to complete a midterm (in minutes).3. The result of an 8 sided die.4. The number of cars a student owns in College StationABC
![Page 34: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/34.jpg)
p As you can see from the previous question, different variables will have different distributions.
p You cannot change the distribution. It is what it is. p There is a common misconception that as the sample
size increases (for example the number of people surveyed), the distribution will become increasingly symmetric and bell shaped.
p This is not true. Variables can have all sorts of shapes
![Page 35: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/35.jpg)
QuestionTimep Waiting time between eruptions of Old Faithful
A. Eruption times tend to be every 50-65 minutes or 70-90 minutes.
B. Eruption times can be anywhere between 40 to 100 minutes and all times are equally likely.
C.Eruptions time are unimodal with the average time about 80 minutes.
![Page 36: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/36.jpg)
Influenceofbinwidthonperception
p We plot a histogram of the M&M data set using 4 different bin widths.
![Page 37: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/37.jpg)
QuestionTimeHere are three different plots of the same Old Faithful Geyser data set.
Which plot communicates the most information about the data
(A) Top
(B) Middle
(C)Bottom
![Page 38: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/38.jpg)
Populationandsamplehistogramp The histogram of the sample will look different to the
histogram of the population (which is strictly called a density). Compare the two histograms below.
![Page 39: Displaying data and distributions with graphssuhasini/teaching301/stat301graphical.pdfObjectives Displaying data and distributions with graphs p Variables p Types of variables (CIS](https://reader035.vdocuments.net/reader035/viewer/2022071218/6051060ac14bfb04b851ce99/html5/thumbnails/39.jpg)
Summary