psyc 6130c univariate analysis prof. james elder
TRANSCRIPT
PSYC 6130C UNIVARIATE ANALYSIS
Prof. James Elder
Introduction
PSYC 6130, PROF. J. ELDER 3
What is (are) statistics?
• A branch of mathematics concerned with understanding and summarizing collections of numbers
• A collection of numerical facts
• Estimates of population parameters, derived from samples
PSYC 6130, PROF. J. ELDER 4
What is this course about?
• Applied statistics
• Emphasizes methods, not proofs
• Descriptive statistics
• Inferential statistics
PSYC 6130, PROF. J. ELDER 5
Fall Term
Date Title Readings Notes
10-Sep-08 Introduction Probability Descriptive statistics
1.1-1.3 5.1-5.5, 5.7 2.1,2.2,2.5,2.7-2.9,2.12,2.13
17-Sep-08 The normal distribution 3.1-3.4 Lab 1
24-Sep-08 Introduction to hypothesis testing t-tests
4 7
1-Oct-08 Rosh Hashanah – No Classes
8-Oct-08 t-tests 7 Lab 2
15-Oct-08 Statistical power and effect size 8 Assignment 1 due
22-Oct-08 Correlation and regression 9
29-Oct-08 One-way independent ANOVA 11 Lab 3
5-Nov-08 Multiple comparisons 12.1-12.12
12-Nov-08 Multiple comparisons 12.1-12.12 Lab 4
19-Nov-08 Two-way ANOVA 13.1-13.11,13.14 Assignment 2 due
26-Nov-08 Review
3-Dec-08 Exam
PSYC 6130, PROF. J. ELDER 6
Winter Term
Date Title Readings Deadlines
7-Jan-09 Repeated measures ANOVA 14
14-Jan-09 Two-way mixed design ANOVA 14 Lab 5 Deadline for choosing project topic
21-Jan-09 Reading Week
28-Jan-09 Multiple regression 15 Lab 6
4-Feb-09 The general linear model 16 Assignment 3 due, drop date is Feb 1
11-Feb-09 The binomial distribution 5.6, 5.8-5.10 Lab 7
18-Feb-09 Reading Week – No Classes
25-Feb-09 Chi-square tests 6
4-Mar-09 Resampling and nonparametric techniques 18 Lab 8
11-Mar-09 Student Presentations
18-Mar-09 Student Presentations Assignment 4 due
25-Mar-09 Review
1-Apr-09 Exam
Some Background
(Howell Ch. 1)
PSYC 6130, PROF. J. ELDER 8
Variables and Constants
• Constants are properties that never change (e.g., the speed of light in a vacuum ~3x108m/s).
• Most physiological and psychological parameters of interest vary considerably
– Between individuals (e.g., intelligence quotient)
– Within individuals (e.g., heart rate)
• Any variable whose variation is somewhat unpredictable is called a random variable (rv).
PSYC 6130, PROF. J. ELDER 9
Scales of measurement
• Nominal scale: values are categories, having no meaningful correspondence to numbers.
PSYC 6130, PROF. J. ELDER 10
Scales of measurement
• Ordinal scale: ordering is meaningful, but exact numerical values (if they exist) are not.
PSYC 6130, PROF. J. ELDER 11
Scales of measurement
• Interval scale: values are numerically meaningful, and interval between two values is meaningful.
– Example: Celsius temperature scale. It takes the same amount of energy to raise the temperature of a gram of water from 20 °C to 21 °C as it does to raise it from 30 °C to 31 °C.
• Ratio scale: ratio of two values is also meaningful.
– Example: Kelvin temperature scale. A gram of H20 at 300 K has twice the energy of a gram of H20 at 150 K.
– Ratio scales require a 0-point corresponding to a complete lack of the substance being measured.
• Example: a gram of H20 at 0 K has no heat (particles are motionless).
PSYC 6130, PROF. J. ELDER 12
Continuous vs Discrete Variables
• A continuous variable may assume any real value within some range
PSYC 6130, PROF. J. ELDER 13
Continuous vs Discrete Variables
• A discrete variable may assume only a countable number of values: intermediate values are not meaningful.
PSYC 6130, PROF. J. ELDER 14
Independent vs Dependent Variables
• Experiments involve independent and dependent variables.
– The independent variable is controlled by the experimenter.
– The dependent variable is measured.
– We seek to detect and model effects of the independent variable on the dependent variable.
• Example: In a visual search task, subjects are asked to find the odd-man-out in a display of discrete items (e.g., a horizontal bar amongst vertical bars).
– The number of items in the display is an independent variable.
– Reaction time is the main dependent variable.
– Typically, we observe a roughly linear relationship between the number of items and the reaction time.
PSYC 6130, PROF. J. ELDER 15
Experimental vs Correlational Research
• Experimental study:
– Researcher controls the independent variable.
– Seek to detect effects on the dependent variable.
– Direction of causation may be inferred (but may be indirect).
• Correlational study:
– There are no independent or dependent variables.
– No variables are under control of the researcher.
– Seek to find statistical relationships (dependencies) between variables.
– Direction of causation may not normally be inferred.
PSYC 6130, PROF. J. ELDER 16
Correlational Studies: Examples
PSYC 6130, PROF. J. ELDER 17
Populations vs Samples
• In human science, we typically want to characterize and make inferences not about a particular person (e.g., Uncle Bob) but about all people, or all people with a certain property (e.g., all people suffering from a bipolar disorder).
• These groups of interest are called populations.
• Typically, these populations are too large and inaccessible to study.
• Instead, we study a subset of the group, called a sample.
• In order to make reliable inferences about the population, samples are ideally randomly selected.
• The population properties of interest are called parameters.
• The corresponding measurements made on our samples are called statistics. Statistics are approximations (estimates) of parameters.
PSYC 6130, PROF. J. ELDER 18
Different Types of Populations and Samples
• Outside of human science, populations do not necessarily refer to humans
– e.g. populations may be of bees, algae, quarks, stock prices, pork belly futures, ozone levels, etc…
• In clinical and social psychology you will often be conducting large-n studies on human populations.
• In cognitive psychology, you will often be doing small-n within-subject studies involving repeated trials on the same subject.
– Here, you may think of the ‘population’ as being the infinite set of responses you would obtain were you able to continue the experiment indefinitely.
– The sample is the set of responses you were able to collect in a finite number of trials (e.g., 5000) on the same subject.
PSYC 6130, PROF. J. ELDER 19
Summation Notation
Let Number of siblings for respondent iX i
i Xi Yi
1 1 2
2 2 1
3 2 1
… … …
N 4 0
Number of children for respondent iY i
1
1Then
N
ii
X XN
1
1 N
ii
Y YN
where Number of respondents in sampleN
PSYC 6130, PROF. J. ELDER 20
Some Summation Rules
N
i ii=1
1. Often abbreviate X as X
2. ( )i i i iX Y X Y
1 1 2 2 1 2 1 2since (X ) (X ) (X ) (Y ) Associative property of additionY Y X Y
Similarly, ( )i i i iX Y X Y
3. , where is a constant,C NC Csince adding C to itself N times yields N C's.
4. i iCX C X
1 2 1 2since ( ) Multiplication is distributive over additionCX CX C X X
But note that
5. i i i iXY X Y
1 1 2 2 1 2 1 2 1 1 1 2 2 1 2 2since X X (X )(Y ) X +X X XY Y X Y Y Y Y Y
PSYC 6130, PROF. J. ELDER 21
Summary
• What is (are) statistics
• Variables and constants
• Scales of measurement
• Continuous and discrete variables
• Independent and dependent variables
• Experimental and correlational research
• Populations and samples
• Summation Notation
Descriptive Statistics(Howell, Ch 2)
PSYC 6130, PROF. J. ELDER 23
Frequency Tables1991 U.S. General Social Survey: Number of Brothers and Sisters
Frequency Percent Valid Percent Cumulative Percent
Valid 0 74 4.88 4.92 4.921 236 15.56 15.68 20.602 276 18.19 18.34 38.943 236 15.56 15.68 54.624 209 13.78 13.89 68.505 118 7.78 7.84 76.356 80 5.27 5.32 81.667 81 5.34 5.38 87.048 58 3.82 3.85 90.909 47 3.10 3.12 94.02
10 34 2.24 2.26 96.2811 22 1.45 1.46 97.7412 11 0.73 0.73 98.4713 9 0.59 0.60 99.0714 5 0.33 0.33 99.4015 3 0.20 0.20 99.6016 1 0.07 0.07 99.6717 2 0.13 0.13 99.8018 1 0.07 0.07 99.8721 1 0.07 0.07 99.9326 1 0.07 0.07 100.00
Total 1505 99.21 100.00Missing DK 4 0.26
NA 8 0.53Total 12 0.79
Total 1517 100.00
PSYC 6130, PROF. J. ELDER 24
Bar Graphs and Histograms
PSYC 6130, PROF. J. ELDER 25
Grouped Frequency Distributions
• What are the apparent limits?
• What are the real limits?X f
<5 5815 - 9 66110 - 14 74015 - 19 70120 - 24 68925 - 29 67430 - 34 73135 - 39 90340 - 44 93045 - 49 83850 - 54 74655 - 59 60860 - 64 43465 - 69 38370 - 74 34575 - 79 28880 - 84 17485+ 97
Statistics Canada 2001 CensusAge of Respondent
PSYC 6130, PROF. J. ELDER 26
Percentiles and Percentile Ranks
• Percentile: The score at or below which a given % of scores lie.
• Percentile Rank: The percentage of scores at or below a given score
PSYC 6130, PROF. J. ELDER 27
Linear Interpolation to Compute Percentile Ranks
What if you have a 23-year-old respondent and
would like to know her percentile rank?
Let age (percentile)xpercentile ranky
Then the linear (affine) interpolation model is: y ax b
There are 2 unknowns ( and ). If we have two
data points near these unknowns, we can solve:
a b
1 1
2 2
y ax b
y ax b
2 1
2 1
y ya
x x
Thus y ax b
1 1ax y ax
1 1( )y a x x
2 11 1
2 1
( )y y
y x xx x
Frequency Percent Cumulative Percent
Valid <5 581 5.5 5.55 - 9 661 6.3 11.810 - 14 740 7.0 18.815 - 19 701 6.7 25.520 - 24 689 6.5 32.025 - 29 674 6.4 38.430 - 34 731 6.9 45.435 - 39 903 8.6 54.040 - 44 930 8.8 62.845 - 49 838 8.0 70.850 - 54 746 7.1 77.955 - 59 608 5.8 83.660 - 64 434 4.1 87.865 - 69 383 3.6 91.470 - 74 345 3.3 94.775 - 79 288 2.7 97.480 - 84 174 1.7 99.185+ 97 0.9 100.0Total 10523 100.0
Statistics Canada 2001 Census Age of Respondent
PSYC 6130, PROF. J. ELDER 28
Frequency Percent Cumulative Percent
Valid <5 581 5.5 5.55 - 9 661 6.3 11.810 - 14 740 7.0 18.815 - 19 701 6.7 25.520 - 24 689 6.5 32.025 - 29 674 6.4 38.430 - 34 731 6.9 45.435 - 39 903 8.6 54.040 - 44 930 8.8 62.845 - 49 838 8.0 70.850 - 54 746 7.1 77.955 - 59 608 5.8 83.660 - 64 434 4.1 87.865 - 69 383 3.6 91.470 - 74 345 3.3 94.775 - 79 288 2.7 97.480 - 84 174 1.7 99.185+ 97 0.9 100.0Total 10523 100.0
Linear Interpolation to Compute Percentiles
What if you want to know what the median age is? Statistics Canada 2001 Census Age of Respondent
2 1
1 12 1
To compute percentiles,
simply swap the x's and y's in the formula:
x ( )x x
x y yy y
PSYC 6130, PROF. J. ELDER 29
Measures of Central Tendency
• The mode – applies to ratio, interval, ordinal or nominal scales.
• The median – applies to ratio, interval and ordinal scales
• The mean – applies to ratio and interval scales
Mean Median ModeAGE 37.1 37 41
PSYC 6130, PROF. J. ELDER 30
The Mode
• Defined as the most frequent value (the peak)
• Applies to ratio, interval, ordinal and nominal scales
• Sensitive to sampling error (noise)
• Distributions may be referred to as unimodal, bimodal or multimodal, depending upon the number of peaks
Mode = 41
PSYC 6130, PROF. J. ELDER 31
The Median
• Defined as the 50th percentile
• Applies to ratio, interval and ordinal scales
• Can be used for open-ended distributions
Median 37
PSYC 6130, PROF. J. ELDER 32
The Mean
• Applies only to ratio or interval scales
• Sensitive to outliers
1
1Population mean
N
ii
XN
1
1Sample mean
N
ii
X XN
37.1X
PSYC 6130, PROF. J. ELDER 33
Properties of the Mean
Then the mean also increases (decreases) by :C
X X C
Suppose a constant is added (or subtracted) to every score in your sample:
i i
C
X X C1.
Then the mean is also multiplied (divided) by :C
X CX
Suppose every score in your sample is multiplied (divided) by a constant :
i i
C
X CX2.
( ) 0iX X3.
PSYC 6130, PROF. J. ELDER 34
Properties of the Mean (Cntd…)
2 2
Least-squares property: the mean minimizes the sum of squared deviations:
( ) ( ) i iX X X X X
2
2 2 2
2
Proof:
( ) has a minimum where ( ) 0 and ( ) 0i i i
d dX X X X X X
dX dX
2 1( ) 2 ( ) 0i i i
dX X X X X X X
dX N
2
2
2( ) 2 0i
dX X N
dX
PSYC 6130, PROF. J. ELDER 35
Measures of Variability (Dispersion)
• Range – applies to ratio, interval, ordinal scales
• Semi-interquartile range – applies to ratio, interval, ordinal scales
• Variance (standard deviation) – applies to ratio, interval scales
PSYC 6130, PROF. J. ELDER 36
Range
• Interval between lowest and highest values
• Generally unreliable – changing one value (highest or lowest) can cause large change in range.
Range = 79 drinks
PSYC 6130, PROF. J. ELDER 37
Semi-Interquartile Range• The interquartile range is the interval between the first and third
quartile, i.e. between the 25th and 75th percentile.
• The semi-interquartile range is half the interquartile range.
• Can be used with open-ended distributions
• Unaffected by extreme scores
N Valid 19769Missing 6004
Median 4Percentiles 25 2
50 475 7
SIQ = 2.5 drinks
PSYC 6130, PROF. J. ELDER 38
Population Variance and Standard Deviation
dev iis at kno ionwn as the of sample iX i
2Thus ( ) is known as t sum of squared deviah te ions.iSS X
2
2 2
The population is simply the mean squared deviation:
1(
varianc
)
e
iXN
2
The population standard deviation is simply the square-root of the variance:
1( )iXN
The standard deviation is particularly sensitive to outliers, due to the squaring operation.
PSYC 6130, PROF. J. ELDER 39
Sample Variance and Standard Deviation
de is known as the viation of sample iX X i
2Thus ( ) is known as t sum of squared deviationshe .iSS X X
2
2
1The mean squared sample deviation ( )
is a biased estimator of the population variance
- it tends to underestimate .
iX XN
2
2 2
A minor modification makes the sample variance unbiased:
1( )
1 i
s
s X XN
2
The corrected sample standard deviation is given by
1( )
1 is not an unbiased estimator of , but is close enough for most purposes.
is X XN
s
PSYC 6130, PROF. J. ELDER 40
Degrees of Freedom
The is the number of independent measurements
available for estimating a p
degrees of freedom
opulation parame
ter.
df
2The calculation of involves . Knowing and 1 of the sample values
allows us to infer the value of the remaining sample value. Thus only
1 of the sample values are independent, and 1.
s X X N
N df N
PSYC 6130, PROF. J. ELDER 41
Computational Formulas for Variance
2The formula for the sum of squares: devi (ation l )a iSS X X
2 2computationalMore efficient to use the formula: iSS X NX
Why are these equivalent?
2 2 2( ) ( 2 )i i iX X X X X X
2 22i iX X X X
2 2 22iX NX NX
2 2iX NX
2 2 2
Thus
1s
1 iX NXN
PSYC 6130, PROF. J. ELDER 42
Properties of the Standard Deviation
Suppose a constant is added (or subtracted) to every score in your sample:
i i
C
X X C
Then the standard devia does not chation nge.
1.
PSYC 6130, PROF. J. ELDER 43
Properties of the Standard Deviation (cntd…)
Suppose every score in your sample is multiplied (divided) by a constant :
i i
C
X CX
2.
Then the standard deviation is also multiplied (divided) by :C
s Cs
2
Proof:
1( )
1old is X XN
21
( )1new is CX CX
N
21
( )1 iC X X
N
oldCs
PSYC 6130, PROF. J. ELDER 44
Standard Deviation Example
5.7 drinks
5.8 drinks
X
s
cf. SIQ = 2.5 drinks
range = 79 drinks
PSYC 6130, PROF. J. ELDER 45
Skew
• The mean and median are identical for symmetric distributions.
• Skew tends to push the mean away from the median, toward the tail (but not always)
Median=3
Mean=6.7
PSYC 6130, PROF. J. ELDER 46
Skewness
• Properties of skewness
– Positive for positive skew (tail to the right)
– Negative for negative skew (tail to the left)
– Dimensionless
– Invariant to shifting or scaling data (adding or multiplying constants)
3
3
( )Sample skewness =
2 ( 1)iX XN
N N s
PSYC 6130, PROF. J. ELDER 47
Dealing with Outliers
• Trimming:
– Throw out the top and bottom k% of values (k=5%, for example).
– May be justified if there is evidence for confounding process interfering with the dependent variable being studied
• Example: participant blinks during presentation of a visual stimulus
• Example: participant misunderstands a question on a questionnaire.
• Transforming
– Scores are transformed by some function (e.g., log, square root)
– Often done to reduce or eliminate skewness
PSYC 6130, PROF. J. ELDER 48
Log-Transforming Data
skewness=0.67 skewness=0.08
End of Lecture 1
Sept 10, 2008
PSYC 6130, PROF. J. ELDER 50
Kurtosis
kurtosis>0: leptokurtic (Laplacian)
kurtosis=0: mesokurtic (Gaussian)kurtosis<0: platykurtic
4 2
4
( )N(N+1) ( 1)Sample kurtosis = 3
(N-2)(N-3) ( 1) ( 2)( 3)iX X N
N s N N
PSYC 6130, PROF. J. ELDER 51
Summary
• Measures of central tendency
• Measures of dispersion
• Skew
• Kurtosis