reading statistics and research - dalhousie universityweb.cs.dal.ca/~anwar/ds/lec2.pdfspearman rho...

Descriptive Statistics

Scales of Measurement – Nominal

• No numerical or quantitative properties. A way to classify groups or categories.

• Gender: Male and Female • Major: RC or PH

– Ordinal • Used to rank and order the levels of the variable

being studied. No particular value is placed between the numbers in the rating scale.

• Movie Ratings: 4 Stars, 3 Stars, 2 Stars, and 1 Star

Descriptive Statistics Scales of Measurement Cont.

– Interval

• Difference between the numbers on the scale is meaningful and intervals are equal in size. NO absolute zero.

• Allows for comparisons between things being measured

• Temperatures on a thermometer: The difference between 60 and 70 is the same as the difference between 90 and 100. You cannot say that 70 degrees is twice as hot as 35 degrees, it is only 35 degrees warmer.

– Ratio

• Scales that do have an absolute zero point than indicated the absence of the variable being studied. Can form ratios.

• Weight: 100 pounds is ½ of 200.

• Time

http://en.wikipedia.org/wiki/68-95-99.7_rule







– Normal Curve

– Bimodal Curve


– Positively Skewed

– Negatively Skewed

You MUST see this: http://www.exceluser.com/explore/statsnormal.htm

http://www.exceluser.com/explore/statsnormal.htm



Central Limit Theorem

• CLT states conditions under which the mean of a sufficiently large number of independent random variable, each with finite mean and variance, will be approximately normally distributed.

http://www.statisticalengineering.com/central_limit_theorem.htm

http://www.statisticalengineering.com/central_limit_theorem.htm


• Frequency Distributions

– In tables, the frequency distribution is constructed by summarizing data in terms of the number or frequency of observations in each category, score, or score interval

– In graphs, the data can be concisely summarized into bar graphs, histograms, or frequency polygons


• Measures of Central Tendency – Mode

• The most frequently occurring score • 3 3 3 4 4 4 5 5 5 6 6 6 6: Mode is 6 • 3 3 3 4 4 4 5 5 6 6 7 7 8: Mode is 3 and 4

– Median • The score that divides a group of scores in half with 50% falling above and

50% falling below the median. • 3 3 3 5 8 8 8: The median is 5 • 3 3 5 6: The median is 4 (Average of two middle numbers)

– Mean • Preferred whenever possible and is the only measure of central tendency

that is used in advanced statistical calculations: – More reliable and accurate – Better suited to arithmetic calculations

• Basically, and average of all scores. Add up all scores and divide by total number of scores.

• 2 3 4 6 10: Mean is 5 (25/5)


• Measures of Central Tendency

– Your Turn!

– Mode • Example: 2 3 4 4 4 6 8 9 10 11 11

– Median • Example: 2 3 4 4 4 6 8 9 10 11 11

– Mean • Example: 2 3 4 4 4 6 8 9 10 11 11


• Measures of Variability (Dispersion) – Range

• Calculated by subtracting the lowest score from the highest score.

• Used only for Ordinal, Interval, and Ratio scales as the data must be ordered – Example: 2 3 4 6 8 11 24 (Range is 22)

– Variance • The extent to which individual scores in a distribution of scores

differ from one another

– Standard Deviation • The square root of the variance

• Most widely used measure to describe the dispersion among a set of observations in a distribution.


• Standard Scores: Z-Scores and T-Scores

– Z-Scores • Most widely used standard score in statistics

– It is the number of standard deviations above or below the mean.

• a Z score of 1.5 means that the score is 1.5 standard deviations above the mean; a Z score of -1.5 means that the score is 1.5 standard deviations below the mean

• Always have the same meaning in all distributions

• To find a percentile rank, first convert to a Z score and then find percentile rank off a normal-curve table


• Standard Scores: Z-Scores and T-Scores

– T-Scores

• Most commonly used standard score for reporting performance

• May be converted from Z-scores and are always rounded to two figures; therefore, eliminating decimals

• Always reported in positive numbers

• The mean is always 50 and the standard deviation is always 10. – A T-score of 70 is 2 SDs above the mean

– A T-score of 20 is 3 SDs below the mean

t Score

• The t statistic (aka, the t score) is defined by:

𝑡 =𝑥 − 𝜇𝑠

𝑁

• Where:

– 𝒙 is the sample mean,

– μ is the population mean,

– s is the standard deviation of the sample,

– and N is the sample size.

• The distribution of this t statistic is called the t distribution or the Student t distribution. It can be used whenever samples are drawn from populations possessing a bell-shaped distribution (i.e., approximately normal).

Source: http://stattrek.com/help/glossary.aspx?target=t_score

http://stattrek.com/help/glossary.aspx?target=t_score

Chapter 3: Bivariate Correlation

http://www.readingstats.com/fifth/funny3.htm

http://www.readingstats.com/fifth/funny3.htm


• Correlation or Covariation – A correlation coefficient is a statistical summary of the

degree or magnitude and direction of the relationship or association between two variables

– It is possible to have a negative or positive correlation

• Linear Regression – The purpose of a regression equation is to make

predictions on a new sample of observations from the findings on a previous sample

Bivariate Correlation

• Mainly, the concept of ‘relationship’ between two variables.

• One variable explains the change in the other variable (not causes the change).

• The concept of "relationship" cannot be investigated if we measure each person in a group one time only on a single variable.

Example

• Math and Science Scores:

– What kind of a relationship? Brainstorm.

Person Math Science

Will 7 5

Jill 9 3

Bill 8 4

How Correlations Can Be Revealed

1- Scatter Diagrams

• Has two axes labeled with the two variables in focus.

• Dots are drawn to represent scores (at the intersection of the two variables).

• A scatter diagram (aka scatter plot) reveals the relationship between the two variables through the pattern formed by the set of dots.

• The types of relationships that can be revealed include: high-high, high-low, low-high, and low-low (other namings may be found)

• DisAdv: Interpreting a scatterplot can be subjective. Other approaches are needed.

Scatter Plot Examples

http://www.icoachmath.com

http://www.qimacros.com

http://www.icoachmath.com/

http://www.qimacros.com/

2- The Correlation Coefficient

• A decimal number that lies between -1 and 1.

• Abbreviated as r.

• Correlations according to (r) can carry one of many modifying adjectives.

• It has to be positive (direct relationship) (0-1) or negative (indirect/inverse) (<0 & >-1), or no correlation (zero).

• There is also high, strong, low, and weak for the values resulted for r.

3- The Correlation Matrix

• For more than two variables.

• No correlation exists between a variable and itself.

Simple Example of a Correlation Matrix (imaginary results)

Age Depression Smoking Family Values

Age --- 0.79 -0.33 0.02

Depression --- 0.84 0.21

Smoking --- 0.65

Family values ---

• Two correlation matrix can be combined for the same subjects and the same tests if the test was run twice, say into two different formats.

• Always remember, correlations does not mean causation!

Which correlation to use?

Pearson

Product

Moment

Kendall

tau

Biserial

Correlatio

n

Spearman

rho

Phi

correlation

4- Pearson’s Product Moment

• Most frequently used when:

– The two variables are quantitative

– Each variable is measured to produce raw scores.

• Usually represented by (r).

• See this site for online calculator: http://www.wessa.net/corr.wasp

• Good reference: http://davidmlane.com/hyperstat/A34739.html

• Formula for Pearson’s:

http://www.wessa.net/corr.wasp

http://davidmlane.com/hyperstat/A34739.html

http://davidmlane.com/hyperstat/A34739.html

5- Spearman’s Rho

• The two variables are quantitative

• The two variables are measured to produce ranks.

• The correlation is computed between the ranks.

• Spearman’s is also called rank-order correlation.

• Referred to as (rs / p).

• Please see: http://davidmlane.com/hyperstat/desc_biv.html

• A good example:

• http://www.experiment-resources.com/spearman-rank-correlation-coefficient.html

http://davidmlane.com/hyperstat/desc_biv.html

http://www.experiment-resources.com/spearman-rank-correlation-coefficient.html









Example Queries submitted to complete search task X.

Naïve Users Expert Users

13 12

14 14

15 13

22 17

11 9

17 15

23 19

16 13

9 8

24 17 p-values:

Variables Naïve Expert

Naïve 0 < 0.0001

Expert < 0.0001 0

Correlation matrix (Spearman):


Naïve 1 0.939

Expert 0.939 1

Coefficients of determination (Spearman):


Naïve 1 0.882

Expert 0.882 1

Correlation matrix (Pearson):


Naïve 1 0.950

Expert 0.950 1

p-values:


Naïve 0 < 0.0001

Expert < 0.0001 0

Coefficients of determination (R²):


Naïve 1 0.902

Expert 0.902 1

6

8

10

12

14

16

18

20

5 10 15 20 25

Exp

ert

Naïve

Data

Scatter Plot

Pearson Spearman

Correlation in Excel

Go to Excel and XLSTATs

6- Kenadll’s Tau, Point Biserial, and Biserial

• Tau: – Does a better job (produces lower corr. values) in the

case of ties. – Close to rho.

• Point Biserial – Correlation between variable 1 (raw data) and

Variable 2 (real dichotomy such as 0’s and 1’s)

• Biserial – Correlation between variable 1 (raw data) and

Variable 2 (artificial dichotomy such player’s height and productivity (<10 point or >=10points))

7- Phi, Tetrachotic, and Cramer’s V Correlations

• If both variables are dichotomous in nature. • Phi

– True dichotomies (both variables) – Example: Gender and Car ownership.

• Tetrachotic – Artificial dichotomies (both variables) – Example: Height and Weight when forced into

dichotomies. – Rare in scientific research.

• V – Two nominal variables – Its value: 0-1

More Principles to Remember

• In reporting correlations in research

reports you report both the r value and

the p.

Interpretation of the Strength

of Correlations

• .00 - .20 – Very Weak

• .21 - .40 – Weak

• .41 - .60 – Moderate

• .61 - .80 – Strong

• .81 – 1.00 - Very Strong

Different statisticians

may have similar but

slightly different

scales.

Warnings

• Correlation does not mean cause.

• It indicates the ‘shared variance’

• Coefficient of determination (r2) is squaring the coefficient value to reduce exaggeration and get rid of negatives.

• Outliers: can cause exaggeration or overstating. Get rid of them before publishing.

• Here is how to decide on outliers: http://www.5min.com/Video/How-to-Interpret-Box-and-Whisker-Plots-275636648

• Linearity: does not require all points on ‘the’ line.

• Relationship strength: depends on the adjectives researchers use (subjectivity (no standard rules)).

http://www.5min.com/Video/How-to-Interpret-Box-and-Whisker-Plots-275636648

















reading statistics and research - dalhousie universityweb.cs.dal.ca/~anwar/ds/lec2.pdfspearman rho...

Documents