data screening
TRANSCRIPT
DATA SCREENINGWei-Jiun, Shen Ph. D.
Anything that can go wrong will go wrong
Why do we need to screen data?
Purpose
Detect and correct data errors Detect and treat missing data Detect and handle insufficiently sampled
variables Conduct transformations and standardizations Detect and handle outliers
First concern
Accuracy of data file Descriptive statistics Graphic representations
Honest correlations Missing data
Pattern or amount Random or not
Outliers
MISSING DATA“blank” part in data set
Why is missing data a problem?
Systematical problem Bias sampling
Demographic variables
Inappropriate measuring procedure Behavioral items
Insufficient amount for analysis Small sample
Misleading research results Biased data in, _______ out
Probability distribution of missingness
Consider the probability of missingness Are certain groups more likely to have missing
values? Respondents in female less likely to report age?
Are certain responses more likely to be missing? Respondents with high SPA less likely to report
anxiety?
Certain analysis methods assume a certain probability distribution
Missing completely at random (MCAR)
Missing data is independent of any other measured variable (y2) and independent of the variable itself (y1)
I.e., SES=y2; depression=y1. If participants dropped out across a range of SES
levels, then the missing on depression would be independent of SES
Little’s MCAR test in MVA indicates whether MCAR or not (want ns)
Missing at random (MAR)
Missing data may be dependent on another measured variable (y2), but is independent of the variable itself (y1). I.e., SES=y2; depression=y1. If participants only from high levels of SES
dropped out , then the missing on depression would be dependent on SES. SES.
MAR can be inferred if Little’s test is significant but missingness predictable from other vars (other than the variable itself) –tested by Separate Variance Test. MNAR indicated if this test reveals missingness related to the DV
Treatment for missing data
Deleting cases or variables Descriptive statistics
Estimating missing data Using missing data correlation matrix Treating missing data as data Repeating analyses with and without missing data
Choosing among methods for dealing with missing data Pattern or amount
Deletion or preservation?
Deletion <5% MCAR/MAR
Preservation MNAR Small sample
Replacement Mean (grand or group) Regression (predict missing value by other IVs) Expectation Maximization (form missing data r matrix by
assumed distribution)
OUTLIERCases with extreme value on variables
Why is outlier a problem?
Systematical problem Bias sampling
Wrong population
Statistical problem ↑error variance ↓statistical power ↑typeⅠ, Ⅱ error ↓normality
Misleading research results Biased data in, _______ out
Influence of outlier
Leverage × discrepancy
Treatment for outlier
Estimating outlier Standardized score (z>2, 2.5, 3) Graphical methods (p-p, q-q plot) Mahalanobis distance (χ2 test)
Deletion or transformation Critical to analysis or not Preservation
Transformation Score alternation
NORMALITY, LINEARITY &
HOMOSCEDASTICITYBasic assumption
Key assumptions in GLM
Normality Linearity Homogeneity of variance Interval level data Independence of observations
Normality
Normal distribution
Test for normality
Skewness & Kurtosis
Test for normality
T-test for skewness & kurtosis score Kolmogorov-Smirnov test & Shaprio-wilk test
Z
w
Test for normality
Plotting cumulative distribution function
Test for normality
P-P plot (probability) & Q-Q plot (quantile)
Linearity
Straight-line relationship between 2 variables
Homoscedasticity
Homogeneity of variance Homogeneity of variance-covariance matrix
Homoscedasticity
Residual
COMMON DATA TRANSFORMATIONS
Data transformations
Direction
Skewness Treatment
+
Moderate New X = SQRT (X)Substantial New X = LG10 (X)
Substantial with zero
New X = LG10 (X+C)
Severe New X = 1/XL-shaped with zero New X = 1 (X+C)
-Moderate New X = SQRT (K-X)
Substantial New X = LG10 (K-X)J-shaped New X = 1 (K-X)C = a constant added to each score so that the smallest score is 1.
K = a constant from which each score is subtracted so that the smallest score is 1; usually equal to the largest score + 1.
PRACTICE
Check list
Descriptive statistics Range Mean & SD Skewness & kurtosis
Missing data (missing value analysis) Normal distribution
Kolmogorov-Smirnov test (n>50) Shapiro-Wilk test (n<50) Skewness & kurtosis PP plot
Outlier (single/multiple: z-score/Mahalanobis distance)
Linearilty Homoscedasticity Multiconllinearity
Report
Try