compsci 590.01 spring 2017 statistical distortion: …...consequences of data cleaning data cleaning...

Statistical Distortion: Consequences of Data Cleaning

Data Cleaning & IntegrationCompSci 590.01 Spring 2017

Junyang Gao, Amir Rahimzadeh Ilkhechi, Yuhao Wen

Some contents were based on :Tamraparni Dasu’s DSAA Tutorial, 2016

Tamraparni Dasu, Ji Meng Loh. “Statistical Distortion: Consequences of Data Cleaning.” VLDB, 2012

Biggest take-away points?

(For us:)

● Cleaner data do not necessarily imply more useful or useable data

● In practice, simple cleaning strategy may outperform a more sophisticated method that have assumptions not suitable over the data

Outline

● Introduction & Experimental Framework● Methodology & Formulation● Experiments & Analysis

Data Cleaning or Data Mangling?● Changed the shape:

a. Most frequent values (Mode)

b. Least frequent values (Anomalies)

● Moved good values● Turned good values to

glitches

How to measure data cleaning strategies?

● Three dimensional data quality metric:a. Statistical Distortionb. Glitch Improvementc. Cost

Experimental Framework

1. Glitch Index● Weighted Sum:

● The lower the glitch index,

The “cleaner” the data set

Name City State

0 0 0

0 0 0

0 0 1

0 0 1

0 0 1

0 0 0

0 0 0

0 0 0

0 0 0

Glitch Vector


2. Statistical Distortion

Distance between two distributions

1. Kullback-Liebler “distance”● P,Q are two probability distributions over the same event

space●


1. Kullback-Liebler divergence●

Entropy of PCross-Entropy of P,Q


2. Jensen–Shannon divergence

● symmetrized and smoothed version of the Kullback–Leibler divergence

● , where


3. Earth Mover’s distance

● Minimum cost of converting P to Q, transportation problem●


3. Cost

● Highly context dependent● In this paper: glitch-based (percentage of glitches

removed)


“Best” strategies depends on user’s tolerance

● Statistical Distortion● Glitch Improvement● Cost

Outline


Applicable to: ● Structured,● Hierarchical,● Spatio-temporal,● Unstructured data

A hierarchical network exampleN_1

N_13

N_132

A hierarchical network example (cont.)

● Each node measures v variables (time series)● For N_ijk: represents the collected data at time t● F_t: the history up to time t-1● represents the window of time-step history from t-⍵ up

to time t-1

Glitch Types:● Multitype

● Co-occurring

● Stand alone

Glitch DetectionGlitch detector is a function of

X^t

Missing values

Glitch Detection(cont.)

Glitch detector is a function of X^t

Inconsistent values


Glitch detector is also a function of other parameters

Outlier values


Glitch Matrix:

Glitch Index*

*1 ⨉ p is a typo in the paper

Statistical Distortion Measure(EMD)*

Slides adopted from Pete Barnum presentation*

Statistical Distortion Measure(EMD)

Linear programming approach for EMD:

Linear programming approach for EMD (as an instance of transportation problem):

Linear programming approach for EMD (constraint 1):

Final result for EMD:

Outline


ExperimentDataset:

● 20,000 time series , Length at most 170, 3 variables

Glitches

● Inconsistencies○ A1 >= 0○ 0<= A3 <=1○ If A3 is missing, A1 should not be populated

● Outliers● Missing values

All graph in the following slides are from T. Dasu and J. Loh. "Statistical Distortion: Consequences of Data Cleaning." VLDB 2012.

ExperimentSampling

● From D & DI with replacement, 50 pairs in total● Sample size: 100, 500 (no significant impact)

Factors concerned:

● Attribute transformations● Strategies● Cost● Sample size

Experiment

Analysis

● Cleaning Strategies● Studying Cost● Data Transformation and Cleaning● Strategies and Attribute Distributions● Strategies Evaluation● Cleaning Cost

Analysis - Cleaning StrategiesApplied 5 strategies to each of the 100 test pairs of data streams.

Strategies Missing values Inconsistent values Outliers

1 Impute using SAS PROC MI Winsorization by attribute basis

2 Impute using SAS PROC MI ignore

3 ignore ignore Winsorization by attribute basis

4 Replace with mean attribute from ideal dataset ignore

5 Replace with mean attribute from ideal dataset & Winsorization (outlier only)

weight 0.25 0.25 0.5

Analysis - Studying Cost

● Cost ~ Proportion of the glitches cleaned● Process:

○ Compute normalized glitch score for each time series○ Rank them○ Top x% cleaned (x=0 -> Nothing cleaned, x=100 -> everything cleaned)

Analysis - Data Transformation and CleaningStrategy: 1

Gray:

Imputed missing values

X=Y:

Untouched data

Back dots:

Winsorized values

Analysis - Strategies and Attribute Distributions

Analysis - Strategies Evaluation

Figure 6

Analysis - Strategies Evaluation

● Single cleaning method○ SAS PROC MI○ Mean

● Winsorization only● Using two methods

○ Impute+ Winsorize○ Mean + Winsorize

Analysis - Cleaning Cost

Strategy: Imputation + Winsorization

compsci 590.01 spring 2017 statistical distortion: …...consequences of data cleaning data cleaning...

Documents