compsci 590.01 spring 2017 statistical distortion: …...consequences of data cleaning data cleaning...
TRANSCRIPT
Statistical Distortion: Consequences of Data Cleaning
Data Cleaning & IntegrationCompSci 590.01 Spring 2017
Junyang Gao, Amir Rahimzadeh Ilkhechi, Yuhao Wen
Some contents were based on :Tamraparni Dasu’s DSAA Tutorial, 2016
Tamraparni Dasu, Ji Meng Loh. “Statistical Distortion: Consequences of Data Cleaning.” VLDB, 2012
Biggest take-away points?
(For us:)
● Cleaner data do not necessarily imply more useful or useable data
● In practice, simple cleaning strategy may outperform a more sophisticated method that have assumptions not suitable over the data
Outline
● Introduction & Experimental Framework● Methodology & Formulation● Experiments & Analysis
Data Cleaning or Data Mangling?● Changed the shape:
a. Most frequent values (Mode)
b. Least frequent values (Anomalies)
● Moved good values● Turned good values to
glitches
How to measure data cleaning strategies?
● Three dimensional data quality metric:a. Statistical Distortionb. Glitch Improvementc. Cost
Experimental Framework
1. Glitch Index● Weighted Sum:
● The lower the glitch index,
The “cleaner” the data set
Name City State
0 0 0
0 0 0
0 0 1
0 0 1
0 0 1
0 0 0
0 0 0
0 0 0
0 0 0
Glitch Vector
Experimental Framework
2. Statistical Distortion
Distance between two distributions
1. Kullback-Liebler “distance”● P,Q are two probability distributions over the same event
space●
Distance between two distributions
1. Kullback-Liebler divergence●
Entropy of PCross-Entropy of P,Q
Distance between two distributions
2. Jensen–Shannon divergence
● symmetrized and smoothed version of the Kullback–Leibler divergence
● , where
Distance between two distributions
3. Earth Mover’s distance
● Minimum cost of converting P to Q, transportation problem●
Experimental Framework
3. Cost
● Highly context dependent● In this paper: glitch-based (percentage of glitches
removed)
Experimental Framework
“Best” strategies depends on user’s tolerance
● Statistical Distortion● Glitch Improvement● Cost
Outline
● Introduction & Experimental Framework● Methodology & Formulation● Experiments & Analysis
Applicable to: ● Structured,● Hierarchical,● Spatio-temporal,● Unstructured data
A hierarchical network exampleN_1
N_13
N_132
A hierarchical network example (cont.)
● Each node measures v variables (time series)● For N_ijk: represents the collected data at time t● F_t: the history up to time t-1● represents the window of time-step history from t-⍵ up
to time t-1
Glitch Types:● Multitype
● Co-occurring
● Stand alone
Glitch DetectionGlitch detector is a function of
X^t
Missing values
Glitch Detection(cont.)
Glitch detector is a function of X^t
Inconsistent values
Glitch Detection(cont.)
Glitch detector is also a function of other parameters
Outlier values
Glitch Detection(cont.)
Glitch Matrix:
Glitch Index*
*1 ⨉ p is a typo in the paper
Statistical Distortion Measure(EMD)*
Slides adopted from Pete Barnum presentation*
Statistical Distortion Measure(EMD)
Statistical Distortion Measure(EMD)
Linear programming approach for EMD:
Linear programming approach for EMD:
Linear programming approach for EMD:
Linear programming approach for EMD:
Linear programming approach for EMD (as an instance of transportation problem):
Linear programming approach for EMD (constraint 1):
Linear programming approach for EMD (constraint 2):
Linear programming approach for EMD (constraint 3):
Linear programming approach for EMD (constraint 4):
Final result for EMD:
Final result for EMD:
Outline
● Introduction & Experimental Framework● Methodology & Formulation● Experiments & Analysis
ExperimentDataset:
● 20,000 time series , Length at most 170, 3 variables
Glitches
● Inconsistencies○ A1 >= 0○ 0<= A3 <=1○ If A3 is missing, A1 should not be populated
● Outliers● Missing values
All graph in the following slides are from T. Dasu and J. Loh. "Statistical Distortion: Consequences of Data Cleaning." VLDB 2012.
ExperimentSampling
● From D & DI with replacement, 50 pairs in total● Sample size: 100, 500 (no significant impact)
Factors concerned:
● Attribute transformations● Strategies● Cost● Sample size
Experiment
Analysis
● Cleaning Strategies● Studying Cost● Data Transformation and Cleaning● Strategies and Attribute Distributions● Strategies Evaluation● Cleaning Cost
Analysis - Cleaning StrategiesApplied 5 strategies to each of the 100 test pairs of data streams.
Strategies Missing values Inconsistent values Outliers
1 Impute using SAS PROC MI Winsorization by attribute basis
2 Impute using SAS PROC MI ignore
3 ignore ignore Winsorization by attribute basis
4 Replace with mean attribute from ideal dataset ignore
5 Replace with mean attribute from ideal dataset & Winsorization (outlier only)
weight 0.25 0.25 0.5
Analysis - Studying Cost
● Cost ~ Proportion of the glitches cleaned● Process:
○ Compute normalized glitch score for each time series○ Rank them○ Top x% cleaned (x=0 -> Nothing cleaned, x=100 -> everything cleaned)
Analysis - Data Transformation and CleaningStrategy: 1
Gray:
Imputed missing values
X=Y:
Untouched data
Back dots:
Winsorized values
Analysis - Strategies and Attribute Distributions
Analysis - Strategies Evaluation
Figure 6
Analysis - Strategies Evaluation
● Single cleaning method○ SAS PROC MI○ Mean
● Winsorization only● Using two methods
○ Impute+ Winsorize○ Mean + Winsorize
Analysis - Cleaning Cost
Strategy: Imputation + Winsorization