reading statistics and research - dalhousie universityweb.cs.dal.ca/~anwar/ds/lec2.pdfspearman rho...

33
Descriptive Statistics Scales of Measurement Nominal No numerical or quantitative properties. A way to classify groups or categories. Gender: Male and Female Major: RC or PH Ordinal Used to rank and order the levels of the variable being studied. No particular value is placed between the numbers in the rating scale. Movie Ratings: 4 Stars, 3 Stars, 2 Stars, and 1 Star

Upload: others

Post on 16-Nov-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Descriptive Statistics

Scales of Measurement – Nominal

• No numerical or quantitative properties. A way to classify groups or categories.

• Gender: Male and Female • Major: RC or PH

– Ordinal • Used to rank and order the levels of the variable

being studied. No particular value is placed between the numbers in the rating scale.

• Movie Ratings: 4 Stars, 3 Stars, 2 Stars, and 1 Star

Page 2: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Descriptive Statistics Scales of Measurement Cont.

– Interval

• Difference between the numbers on the scale is meaningful and intervals are equal in size. NO absolute zero.

• Allows for comparisons between things being measured

• Temperatures on a thermometer: The difference between 60 and 70 is the same as the difference between 90 and 100. You cannot say that 70 degrees is twice as hot as 35 degrees, it is only 35 degrees warmer.

– Ratio

• Scales that do have an absolute zero point than indicated the absence of the variable being studied. Can form ratios.

• Weight: 100 pounds is ½ of 200.

• Time

Page 4: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Descriptive Statistics

– Normal Curve

– Bimodal Curve

Page 5: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Descriptive Statistics

– Positively Skewed

– Negatively Skewed

You MUST see this: http://www.exceluser.com/explore/statsnormal.htm

Page 6: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Central Limit Theorem

• CLT states conditions under which the mean of a sufficiently large number of independent random variable, each with finite mean and variance, will be approximately normally distributed.

http://www.statisticalengineering.com/central_limit_theorem.htm

Page 7: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Descriptive Statistics

• Frequency Distributions

– In tables, the frequency distribution is constructed by summarizing data in terms of the number or frequency of observations in each category, score, or score interval

– In graphs, the data can be concisely summarized into bar graphs, histograms, or frequency polygons

Page 8: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Descriptive Statistics

• Measures of Central Tendency – Mode

• The most frequently occurring score • 3 3 3 4 4 4 5 5 5 6 6 6 6: Mode is 6 • 3 3 3 4 4 4 5 5 6 6 7 7 8: Mode is 3 and 4

– Median • The score that divides a group of scores in half with 50% falling above and

50% falling below the median. • 3 3 3 5 8 8 8: The median is 5 • 3 3 5 6: The median is 4 (Average of two middle numbers)

– Mean • Preferred whenever possible and is the only measure of central tendency

that is used in advanced statistical calculations: – More reliable and accurate – Better suited to arithmetic calculations

• Basically, and average of all scores. Add up all scores and divide by total number of scores.

• 2 3 4 6 10: Mean is 5 (25/5)

Page 9: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Descriptive Statistics

• Measures of Central Tendency

– Your Turn!

– Mode • Example: 2 3 4 4 4 6 8 9 10 11 11

– Median • Example: 2 3 4 4 4 6 8 9 10 11 11

– Mean • Example: 2 3 4 4 4 6 8 9 10 11 11

Page 10: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Descriptive Statistics

• Measures of Variability (Dispersion) – Range

• Calculated by subtracting the lowest score from the highest score.

• Used only for Ordinal, Interval, and Ratio scales as the data must be ordered – Example: 2 3 4 6 8 11 24 (Range is 22)

– Variance • The extent to which individual scores in a distribution of scores

differ from one another

– Standard Deviation • The square root of the variance

• Most widely used measure to describe the dispersion among a set of observations in a distribution.

Page 11: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Descriptive Statistics

• Standard Scores: Z-Scores and T-Scores

– Z-Scores • Most widely used standard score in statistics

– It is the number of standard deviations above or below the mean.

• a Z score of 1.5 means that the score is 1.5 standard deviations above the mean; a Z score of -1.5 means that the score is 1.5 standard deviations below the mean

• Always have the same meaning in all distributions

• To find a percentile rank, first convert to a Z score and then find percentile rank off a normal-curve table

Page 12: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Descriptive Statistics

• Standard Scores: Z-Scores and T-Scores

– T-Scores

• Most commonly used standard score for reporting performance

• May be converted from Z-scores and are always rounded to two figures; therefore, eliminating decimals

• Always reported in positive numbers

• The mean is always 50 and the standard deviation is always 10. – A T-score of 70 is 2 SDs above the mean

– A T-score of 20 is 3 SDs below the mean

Page 13: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

t Score

• The t statistic (aka, the t score) is defined by:

𝑡 =𝑥 − 𝜇𝑠

𝑁

• Where:

– 𝒙 is the sample mean,

– μ is the population mean,

– s is the standard deviation of the sample,

– and N is the sample size.

• The distribution of this t statistic is called the t distribution or the Student t distribution. It can be used whenever samples are drawn from populations possessing a bell-shaped distribution (i.e., approximately normal).

Source: http://stattrek.com/help/glossary.aspx?target=t_score

Page 14: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Chapter 3: Bivariate Correlation

http://www.readingstats.com/fifth/funny3.htm

Page 15: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Descriptive Statistics

• Correlation or Covariation – A correlation coefficient is a statistical summary of the

degree or magnitude and direction of the relationship or association between two variables

– It is possible to have a negative or positive correlation

• Linear Regression – The purpose of a regression equation is to make

predictions on a new sample of observations from the findings on a previous sample

Page 16: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Bivariate Correlation

• Mainly, the concept of ‘relationship’ between two variables.

• One variable explains the change in the other variable (not causes the change).

• The concept of "relationship" cannot be investigated if we measure each person in a group one time only on a single variable.

Page 17: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Example

• Math and Science Scores:

– What kind of a relationship? Brainstorm.

Person Math Science

Will 7 5

Jill 9 3

Bill 8 4

Page 18: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

How Correlations Can Be Revealed

Page 19: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

1- Scatter Diagrams

• Has two axes labeled with the two variables in focus.

• Dots are drawn to represent scores (at the intersection of the two variables).

• A scatter diagram (aka scatter plot) reveals the relationship between the two variables through the pattern formed by the set of dots.

• The types of relationships that can be revealed include: high-high, high-low, low-high, and low-low (other namings may be found)

• DisAdv: Interpreting a scatterplot can be subjective. Other approaches are needed.

Page 20: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Scatter Plot Examples

http://www.icoachmath.com

http://www.qimacros.com

Page 21: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

2- The Correlation Coefficient

• A decimal number that lies between -1 and 1.

• Abbreviated as r.

• Correlations according to (r) can carry one of many modifying adjectives.

• It has to be positive (direct relationship) (0-1) or negative (indirect/inverse) (<0 & >-1), or no correlation (zero).

• There is also high, strong, low, and weak for the values resulted for r.

Page 22: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

3- The Correlation Matrix

• For more than two variables.

• No correlation exists between a variable and itself.

Page 23: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Simple Example of a Correlation Matrix (imaginary results)

Age Depression Smoking Family Values

Age --- 0.79 -0.33 0.02

Depression --- 0.84 0.21

Smoking --- 0.65

Family values ---

• Two correlation matrix can be combined for the same subjects and the same tests if the test was run twice, say into two different formats.

• Always remember, correlations does not mean causation!

Page 24: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Which correlation to use?

Pearson

Product

Moment

Kendall

tau

Biserial

Correlatio

n

Spearman

rho

Phi

correlation

Page 25: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

4- Pearson’s Product Moment

• Most frequently used when:

– The two variables are quantitative

– Each variable is measured to produce raw scores.

• Usually represented by (r).

• See this site for online calculator: http://www.wessa.net/corr.wasp

• Good reference: http://davidmlane.com/hyperstat/A34739.html

• Formula for Pearson’s:

Page 27: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Example Queries submitted to complete search task X.

Naïve Users Expert Users

13 12

14 14

15 13

22 17

11 9

17 15

23 19

16 13

9 8

24 17 p-values:

Variables Naïve Expert

Naïve 0 < 0.0001

Expert < 0.0001 0

Correlation matrix (Spearman):

Variables Naïve Expert

Naïve 1 0.939

Expert 0.939 1

Coefficients of determination (Spearman):

Variables Naïve Expert

Naïve 1 0.882

Expert 0.882 1

Correlation matrix (Pearson):

Variables Naïve Expert

Naïve 1 0.950

Expert 0.950 1

p-values:

Variables Naïve Expert

Naïve 0 < 0.0001

Expert < 0.0001 0

Coefficients of determination (R²):

Variables Naïve Expert

Naïve 1 0.902

Expert 0.902 1

6

8

10

12

14

16

18

20

5 10 15 20 25

Exp

ert

Naïve

Data

Scatter Plot

Pearson Spearman

Page 28: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Correlation in Excel

Go to Excel and XLSTATs

Page 29: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

6- Kenadll’s Tau, Point Biserial, and Biserial

• Tau: – Does a better job (produces lower corr. values) in the

case of ties. – Close to rho.

• Point Biserial – Correlation between variable 1 (raw data) and

Variable 2 (real dichotomy such as 0’s and 1’s)

• Biserial – Correlation between variable 1 (raw data) and

Variable 2 (artificial dichotomy such player’s height and productivity (<10 point or >=10points))

Page 30: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

7- Phi, Tetrachotic, and Cramer’s V Correlations

• If both variables are dichotomous in nature. • Phi

– True dichotomies (both variables) – Example: Gender and Car ownership.

• Tetrachotic – Artificial dichotomies (both variables) – Example: Height and Weight when forced into

dichotomies. – Rare in scientific research.

• V – Two nominal variables – Its value: 0-1

Page 31: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

More Principles to Remember

• In reporting correlations in research

reports you report both the r value and

the p.

Page 32: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Interpretation of the Strength

of Correlations

• .00 - .20 – Very Weak

• .21 - .40 – Weak

• .41 - .60 – Moderate

• .61 - .80 – Strong

• .81 – 1.00 - Very Strong

Different statisticians

may have similar but

slightly different

scales.

Page 33: Reading Statistics and Research - Dalhousie Universityweb.cs.dal.ca/~anwar/ds/Lec2.pdfSpearman rho Phi correlation 4- Pearson’s Product Moment •Most frequently used when: –The

Warnings

• Correlation does not mean cause.

• It indicates the ‘shared variance’

• Coefficient of determination (r2) is squaring the coefficient value to reduce exaggeration and get rid of negatives.

• Outliers: can cause exaggeration or overstating. Get rid of them before publishing.

• Here is how to decide on outliers: http://www.5min.com/Video/How-to-Interpret-Box-and-Whisker-Plots-275636648

• Linearity: does not require all points on ‘the’ line.

• Relationship strength: depends on the adjectives researchers use (subjectivity (no standard rules)).