reading statistics and research - dalhousie universityweb.cs.dal.ca/~anwar/ds/lec2.pdfspearman rho...
TRANSCRIPT
Descriptive Statistics
Scales of Measurement – Nominal
• No numerical or quantitative properties. A way to classify groups or categories.
• Gender: Male and Female • Major: RC or PH
– Ordinal • Used to rank and order the levels of the variable
being studied. No particular value is placed between the numbers in the rating scale.
• Movie Ratings: 4 Stars, 3 Stars, 2 Stars, and 1 Star
Descriptive Statistics Scales of Measurement Cont.
– Interval
• Difference between the numbers on the scale is meaningful and intervals are equal in size. NO absolute zero.
• Allows for comparisons between things being measured
• Temperatures on a thermometer: The difference between 60 and 70 is the same as the difference between 90 and 100. You cannot say that 70 degrees is twice as hot as 35 degrees, it is only 35 degrees warmer.
– Ratio
• Scales that do have an absolute zero point than indicated the absence of the variable being studied. Can form ratios.
• Weight: 100 pounds is ½ of 200.
• Time
http://en.wikipedia.org/wiki/68-95-99.7_rule
Descriptive Statistics
– Normal Curve
– Bimodal Curve
Descriptive Statistics
– Positively Skewed
– Negatively Skewed
You MUST see this: http://www.exceluser.com/explore/statsnormal.htm
Central Limit Theorem
• CLT states conditions under which the mean of a sufficiently large number of independent random variable, each with finite mean and variance, will be approximately normally distributed.
http://www.statisticalengineering.com/central_limit_theorem.htm
Descriptive Statistics
• Frequency Distributions
– In tables, the frequency distribution is constructed by summarizing data in terms of the number or frequency of observations in each category, score, or score interval
– In graphs, the data can be concisely summarized into bar graphs, histograms, or frequency polygons
Descriptive Statistics
• Measures of Central Tendency – Mode
• The most frequently occurring score • 3 3 3 4 4 4 5 5 5 6 6 6 6: Mode is 6 • 3 3 3 4 4 4 5 5 6 6 7 7 8: Mode is 3 and 4
– Median • The score that divides a group of scores in half with 50% falling above and
50% falling below the median. • 3 3 3 5 8 8 8: The median is 5 • 3 3 5 6: The median is 4 (Average of two middle numbers)
– Mean • Preferred whenever possible and is the only measure of central tendency
that is used in advanced statistical calculations: – More reliable and accurate – Better suited to arithmetic calculations
• Basically, and average of all scores. Add up all scores and divide by total number of scores.
• 2 3 4 6 10: Mean is 5 (25/5)
Descriptive Statistics
• Measures of Central Tendency
– Your Turn!
– Mode • Example: 2 3 4 4 4 6 8 9 10 11 11
– Median • Example: 2 3 4 4 4 6 8 9 10 11 11
– Mean • Example: 2 3 4 4 4 6 8 9 10 11 11
Descriptive Statistics
• Measures of Variability (Dispersion) – Range
• Calculated by subtracting the lowest score from the highest score.
• Used only for Ordinal, Interval, and Ratio scales as the data must be ordered – Example: 2 3 4 6 8 11 24 (Range is 22)
– Variance • The extent to which individual scores in a distribution of scores
differ from one another
– Standard Deviation • The square root of the variance
• Most widely used measure to describe the dispersion among a set of observations in a distribution.
Descriptive Statistics
• Standard Scores: Z-Scores and T-Scores
– Z-Scores • Most widely used standard score in statistics
– It is the number of standard deviations above or below the mean.
• a Z score of 1.5 means that the score is 1.5 standard deviations above the mean; a Z score of -1.5 means that the score is 1.5 standard deviations below the mean
• Always have the same meaning in all distributions
• To find a percentile rank, first convert to a Z score and then find percentile rank off a normal-curve table
Descriptive Statistics
• Standard Scores: Z-Scores and T-Scores
– T-Scores
• Most commonly used standard score for reporting performance
• May be converted from Z-scores and are always rounded to two figures; therefore, eliminating decimals
• Always reported in positive numbers
• The mean is always 50 and the standard deviation is always 10. – A T-score of 70 is 2 SDs above the mean
– A T-score of 20 is 3 SDs below the mean
t Score
• The t statistic (aka, the t score) is defined by:
𝑡 =𝑥 − 𝜇𝑠
𝑁
• Where:
– 𝒙 is the sample mean,
– μ is the population mean,
– s is the standard deviation of the sample,
– and N is the sample size.
• The distribution of this t statistic is called the t distribution or the Student t distribution. It can be used whenever samples are drawn from populations possessing a bell-shaped distribution (i.e., approximately normal).
Source: http://stattrek.com/help/glossary.aspx?target=t_score
Chapter 3: Bivariate Correlation
http://www.readingstats.com/fifth/funny3.htm
Descriptive Statistics
• Correlation or Covariation – A correlation coefficient is a statistical summary of the
degree or magnitude and direction of the relationship or association between two variables
– It is possible to have a negative or positive correlation
• Linear Regression – The purpose of a regression equation is to make
predictions on a new sample of observations from the findings on a previous sample
Bivariate Correlation
• Mainly, the concept of ‘relationship’ between two variables.
• One variable explains the change in the other variable (not causes the change).
• The concept of "relationship" cannot be investigated if we measure each person in a group one time only on a single variable.
Example
• Math and Science Scores:
– What kind of a relationship? Brainstorm.
Person Math Science
Will 7 5
Jill 9 3
Bill 8 4
How Correlations Can Be Revealed
1- Scatter Diagrams
• Has two axes labeled with the two variables in focus.
• Dots are drawn to represent scores (at the intersection of the two variables).
• A scatter diagram (aka scatter plot) reveals the relationship between the two variables through the pattern formed by the set of dots.
• The types of relationships that can be revealed include: high-high, high-low, low-high, and low-low (other namings may be found)
• DisAdv: Interpreting a scatterplot can be subjective. Other approaches are needed.
Scatter Plot Examples
http://www.icoachmath.com
http://www.qimacros.com
2- The Correlation Coefficient
• A decimal number that lies between -1 and 1.
• Abbreviated as r.
• Correlations according to (r) can carry one of many modifying adjectives.
• It has to be positive (direct relationship) (0-1) or negative (indirect/inverse) (<0 & >-1), or no correlation (zero).
• There is also high, strong, low, and weak for the values resulted for r.
3- The Correlation Matrix
• For more than two variables.
• No correlation exists between a variable and itself.
Simple Example of a Correlation Matrix (imaginary results)
Age Depression Smoking Family Values
Age --- 0.79 -0.33 0.02
Depression --- 0.84 0.21
Smoking --- 0.65
Family values ---
• Two correlation matrix can be combined for the same subjects and the same tests if the test was run twice, say into two different formats.
• Always remember, correlations does not mean causation!
Which correlation to use?
Pearson
Product
Moment
Kendall
tau
Biserial
Correlatio
n
Spearman
rho
Phi
correlation
4- Pearson’s Product Moment
• Most frequently used when:
– The two variables are quantitative
– Each variable is measured to produce raw scores.
• Usually represented by (r).
• See this site for online calculator: http://www.wessa.net/corr.wasp
• Good reference: http://davidmlane.com/hyperstat/A34739.html
• Formula for Pearson’s:
5- Spearman’s Rho
• The two variables are quantitative
• The two variables are measured to produce ranks.
• The correlation is computed between the ranks.
• Spearman’s is also called rank-order correlation.
• Referred to as (rs / p).
• Please see: http://davidmlane.com/hyperstat/desc_biv.html
• A good example:
• http://www.experiment-resources.com/spearman-rank-correlation-coefficient.html
Example Queries submitted to complete search task X.
Naïve Users Expert Users
13 12
14 14
15 13
22 17
11 9
17 15
23 19
16 13
9 8
24 17 p-values:
Variables Naïve Expert
Naïve 0 < 0.0001
Expert < 0.0001 0
Correlation matrix (Spearman):
Variables Naïve Expert
Naïve 1 0.939
Expert 0.939 1
Coefficients of determination (Spearman):
Variables Naïve Expert
Naïve 1 0.882
Expert 0.882 1
Correlation matrix (Pearson):
Variables Naïve Expert
Naïve 1 0.950
Expert 0.950 1
p-values:
Variables Naïve Expert
Naïve 0 < 0.0001
Expert < 0.0001 0
Coefficients of determination (R²):
Variables Naïve Expert
Naïve 1 0.902
Expert 0.902 1
6
8
10
12
14
16
18
20
5 10 15 20 25
Exp
ert
Naïve
Data
Scatter Plot
Pearson Spearman
Correlation in Excel
Go to Excel and XLSTATs
6- Kenadll’s Tau, Point Biserial, and Biserial
• Tau: – Does a better job (produces lower corr. values) in the
case of ties. – Close to rho.
• Point Biserial – Correlation between variable 1 (raw data) and
Variable 2 (real dichotomy such as 0’s and 1’s)
• Biserial – Correlation between variable 1 (raw data) and
Variable 2 (artificial dichotomy such player’s height and productivity (<10 point or >=10points))
7- Phi, Tetrachotic, and Cramer’s V Correlations
• If both variables are dichotomous in nature. • Phi
– True dichotomies (both variables) – Example: Gender and Car ownership.
• Tetrachotic – Artificial dichotomies (both variables) – Example: Height and Weight when forced into
dichotomies. – Rare in scientific research.
• V – Two nominal variables – Its value: 0-1
More Principles to Remember
• In reporting correlations in research
reports you report both the r value and
the p.
Interpretation of the Strength
of Correlations
• .00 - .20 – Very Weak
• .21 - .40 – Weak
• .41 - .60 – Moderate
• .61 - .80 – Strong
• .81 – 1.00 - Very Strong
Different statisticians
may have similar but
slightly different
scales.
Warnings
• Correlation does not mean cause.
• It indicates the ‘shared variance’
• Coefficient of determination (r2) is squaring the coefficient value to reduce exaggeration and get rid of negatives.
• Outliers: can cause exaggeration or overstating. Get rid of them before publishing.
• Here is how to decide on outliers: http://www.5min.com/Video/How-to-Interpret-Box-and-Whisker-Plots-275636648
• Linearity: does not require all points on ‘the’ line.
• Relationship strength: depends on the adjectives researchers use (subjectivity (no standard rules)).