time series data cleaning · 2020-06-03 · –sensor monitoring –gps trajectory figure 8:...

TimeSeriesDataCleaning

Shaoxu Song

http://ise.thss.tsinghua.edu.cn/sxsong/

DirtyTimeSeriesData

• Unreliable Readings

– Sensor monitoring

– GPS trajectory

Figure 8: Inaccurate GPS points (a) in rivers, (b) in the ocean, and (c) outside North America.

forgot to log the end of a trip after dropping off passengers. Nevertheless, they certainly affect further analysison the data, such as data-based human mobility models [42].

In the 2010 taxi dataset, for the month of May, there were 7.1 million ghost trips. Given the 154 milliontrips that took place that month, this corresponds to an error rate of about 4.60%. To better understand whichof the overlapping trips are defective, we would need domain knowledge from expert users and TLC to performdata cleaning: all the trips or just a subset may be erroneous. The number of ghost trips is much smaller for the2011 dataset: the error rate is only 0.20%. Since the taxi dataset for 2011 has considerably fewer invalid valuescompared to 2010, as described in Section 2.1, one possible explanation is that different cleaning procedureswere used for these two years, and inconsistencies such as ghost trips were removed before the release of the2011 dataset.

4 Discussion

In this paper, we discussed some of the challenges involved in cleaning spatio-temporal urban data. We presenteda series of case studies using the NYC taxi data that illustrate data cleaning challenges and suggested potentialmethodologies to address these challenges. These methodologies form the basis for integrating cleaning withdata exploration. Data cleaning is necessary for data exploration, and through data exploration, users can attaina better understanding of the data which can lead to the discovery of cleaning constraints and enable them todiscern between errors and features. Data exploration, however, requires a complex trial-and-error process.Thus, usable tools are needed to guide and assist users in the cleaning process. As the case studies we discussedillustrate, this is particularly true for spatio-temporal data, where visual analytics and event detection techniquesat different resolutions are essential to identify quality issues.

The case studies presented in Section 3 show that some cleaning decisions are not clear cut. Often, multipledatasets are required to help an expert decide whether a data point is erroneous or represents an important feature.While there has been preliminary work on the discovery of relationships across datasets [8], there are still manyopen problems in identifying relevant data that can be used to explain events within a large collection of datasetsand in a systematic fashion.

Lack of sufficient knowledge is another issue that hampers data cleaning. Even though experts can (andshould) be involved in most of the process, they may be unavailable, or it may be expensive to hire them forcleaning large datasets. Crowdsourcing systems could help the data analyst clean data more efficiently: user

74

J. Freire, A. Bessa, F. Chirigati, H. T. Vo, K. Zhao: Exploring What not to Clean in Urban Data: A Study Using New York City Taxi Trips. IEEE Data Eng. Bull.39(2): 63-77 (2016)

DirtyTimeSeriesData

• Misuse

• Flight: Accuracy of Travelocity is 0.95

• Stock: Accuracy of Stock in Yahoo! Finance is 0.93

Xian Li, Xin Luna Dong, Kenneth B. Lyons, Weiyi Meng, Divesh Srivastava: Truth Finding on the deep web: Is the problem solved? PVLDB, 6(2) (2013)

0

0.5

1

1.5

2

2.5

3

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

Observation Smooth Truth

Existingcleaningmethods

• Smoothing Filter

– Moving Average

– WMA

– EWMA

• Problem: modify almost all the data values

E. S. Gardner Jr. Exponential smoothing: The state of the art{part ii. International Journal of Forecasting, 22(4):637-666, 2006.

0

0.5

1

1.5

2

2.5

3

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

Observation AR Truth

Existingcleaningmethods

• Prediction Model

– Modify the observation by predication if the predication is far distant from the observation

– autoregressive (AR) model

– AR(I)MA

• May over change the data

– Owing to “far distant”

Yamanishi, Kenji, and Jun-ichi Takeuchi. "A unifying framework for detecting outliers and change points from non-stationary time series data." In SIGKDD, pages 676-681, 2002

Repairingdirtydatahelps

• Time series classification

0

0.2

0.4

0.6

0.8

1

Synt

hetic

Con

trol

Gun

Poin

t

CBF

Face

All

OSU

Leaf

Swed

ishL

eaf

50w

ords

Trac

e

TwoP

atte

rns

waf

er

AC

CU

RAC

Y

DATASET

Clean Dirty IMR SCREEN EWMA ARX AR

Contents

Constraint-based method (SIGMOD 2015)

large spike errors

Statistical method (SIGMOD 2016)

small errors

Supervised method (VLDB 2017)

consecutive errors

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

Observation Truth

0

0.5

1

1.5

2

2.5

3

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

IntuitiononSpeedConstraints

• “Jump” of values is often constrained

– Daily limit: in financial and commodity markets

– Temperatures in a week

– Fuel consumption

• Use speed constraints to identify dirty data

SCREEN

• Given

– Time series 𝑥 = {𝑥 1 , 𝑥 2 , … }– Constraints 𝑠 = (𝑠+,-, 𝑠+./)

on min/max speeds

• Find repair a repair 𝑥′ of 𝑥– Constraint satisfaction:

0 ≤ 𝑡5 − 𝑡, ≤ 𝑤,

𝑠+,- ≤/9:/;<9:<;

≤ 𝑠+./

– Change minimization:∑ |𝑥, − 𝑥,′|�/;∈/ is minimized

Stream Data Cleaning under Speed Constraints

0

0.5

1

1.5

2

2.5

3

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

Dirty Smooth Screen

𝑥AB

𝑠+./

Time

Valu

e

𝑡A 𝑡,

𝑥,B

𝑥,

𝑤

𝑠+,-

𝑥AB + 𝑠+./(𝑡, − 𝑡A)𝑥AB + 𝑠+,-(𝑡, − 𝑡A)

EmployExistingRepairingApproach

• Holistic algorithm

– Repairing relational data

– Under denial constraints

• Adaption

– Time series as a relation

– Express speed constraints by denial constraints roughly

• Problem

– High computational costs

– Not guaranteed to eliminate all violations

X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458-469, 2013.

ID Timestamp Value

1 1 1.13

2 2 1.24

3 3 1.19

4 4 1.3

¬(𝑡5 < 𝑡, + 𝑤 ∧ 𝑥5 > 𝑥, + (𝑡5 − 𝑡,) ⋅ 𝑠+./)¬(𝑡5 < 𝑡, + 𝑤 ∧ 𝑥5 < 𝑥, + (𝑡5 − 𝑡,) ⋅ 𝑠+IJ)

ALightweightWeapon

• Unlike NP-hard problems in most data repairing scenarios

• The speed constraint-based repairing can be solved

– as a LP problem in 𝑂 𝑛M.O𝐿– considers the entire sequence as a whole (global optimal)

• Online computing, over streaming data

– Consider local optimum in the current sliding window

– Using Median Principle in 𝑂 𝑛𝑤 time

0

1

2

3

4

5

Repa

ir co

st 𝑥A+,Q

𝑠+./

Time

Valu

e

𝑡A 𝑡,

𝑥,

𝑤

𝑠+,-

𝑥, + 𝑠+./(𝑡A − 𝑡,)

𝑥, + 𝑠+,-(𝑡A − 𝑡,)𝑥A

EffectivenessandEfficiency

• Global: the highest accuracy

• Local: much faster than Holistic

• Trade-off

0

0.05

0.1

0.15

0.2

0.25

0.3

0.05 0.15 0.25 0.35 0.45

RM

S

Error rate

Global Local Holistic EWMA

0.001

0.01

0.1

1

10

100

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Tim

e co

st(s

)

Error Rate

Global Local Holistic EWMA

Contents


large spike errors


small errors


consecutive errors

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

Observation Truth

365

370

375

380

385

1540 1545 1550 1555 1560 1565 1570 1575

Observation SCREEN Truth

FurtherIssue

• Speed Constraint based method

– Large spike error: modify to max/min values allowed

– Small error: fail to identify

IntuitiononSpeedChange

• Consider the likelihood of speeds within the allowed range

• No clear distribution pattern is observed on speeds

• Interesting pattern on speed changes in consecutive points

00.05

0.10.150.2

0.25

(0,1]

(0,2]

(0,3]

(0,4]

(0,5]

(0,6]

(0,7]

(0,8]

(0,9]

(0,10]

(0,11]

Prob

abili

ty

Speed(m/s)

𝑥AB

𝑠+./

Time

Valu

e

𝑥,B𝑥,

𝑠+,-

0

0.1

0.2

0.3

0.4

(-6.0,-5

.2]

(-5.2,-4

.4]

(-4.4,-3

.6]

(-3.6,-2

.8]

(-2.8,-2

.0]

(-2.0,-1

.2]

(-1.2,-0

.4]

(-0.4,0.4]

(0.4,1.2]

(1.2,2.0]

(2.0,2.8]

(2.8,3.6]

(3.6,4.4]

(4.4,5.2]

(5.2,6.0]

(6.0,6.8]

Prob

abili

ty

Speed Change

p.m.f. p.d.f.

𝑢M = 𝑣M − 𝑣T = −4

𝑣M = −1𝑣T = 3

Time

Valu

e

StatisticalApproach

• Calculate the likelihood of a sequence w.r.t. the speed change

– employ the probability distribution of speed changes

• The cleaning problem is thus to find a repaired sequence with themaximum likelihood about speed change

– instead of minimum change towards speed constraint satisfaction

365

370

375

380

385

1540 1545 1550 1555 1560 1565 1570 1575

Observation SCREEN Likelihood Truth

Maximumlikelihoodrepairproblem

• Given

– Time series 𝑥– repair cost budget 𝛿– Distribution on speed changes

• Find repair a repair 𝑥′ of 𝑥– ∆(𝑥, 𝑥′) ≤ 𝛿– the likelihood 𝐿(𝑥′) is maximized.

• NP-hard

• Pseudo-polynomial time solvable

DP, dynamic programming 𝑂(𝑛𝜃+./M 𝛿) Exact

DPC, constant-factor approximation 𝑂(𝑛T𝜃+./M ) Large budget

DPL, linear time heuristics 𝑂(𝑛𝑑[) Fast, higher error

QP, quadratic programming Approximate distribution

SG, simple greedy 𝑂(max(𝑛, 𝛿)) Fastest

EffectivenessandEfficiency

• Significantly better accuracy than SCREEN

• SG is efficient, comparable to SCREEN, and still with better accuracy

0

1

2

3

4

5

6

7

8

1000 3000 5000 7000 9000 11000 13000

RM

S

Data size

DP DPC DPL SG SCREEN QP

0.001

0.01

0.1

1

10

100

1000

10000

1000 3000 5000 7000 9000 11000 13000

Tim

e co

st(s

)

Data size

DP DPC DPL SG SCREEN QP

Contents


large spike errors


small errors


consecutive errors

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

Observation Truth

ConsecutiveErrors

• Speed constraints handle well “Spike” errors, but not consecutive ones

23.5

24.5

25.5

26.5

27.5

360 365 370 375 380 385 390

Valu

eTime

Truth Observation SCREEN

Intuition

• Supervised by labeled truth of errors

• Labeling by user

– Check-in

• Labeling by machine

– precise equipment reports accurate air quality data in a relatively long sensing period

– crowd and participatory sensing generates unreliable observations in a constant manner

Erroneous location

Labeled truth

Y. Zheng, F. Liu, and H. Hsieh. U-air: when urban air quality inference meets big data. In KDD, pages 1436–1444, 2013.

Approach

• Instead of modeling directly the values

– by AR model (autoregression), ignoring erroneous observations

• We model and predicate the difference between errors and their corresponding labeled truths

– by ARX model (autoregressive model with exogenous inputs)

23.5

24

24.5

25

25.5

26

26.5

27

27.5

360 365 370 375 380 385 390

Valu

e

Time

Observation IMR AR

IterativeMinimumRepair(IMR)

• Rather than in chronological order

• Iterative repairing

– minimally changes one point a time to obtain the most confident repair only

– high confidence repairs in the former iterations could help the latter repairing

• Major concerns

– Convergence

– Incremental computationamong iterations

23.524

24.525

25.526

26.527

27.5

360 365 370 375 380 385 390

Valu

e

Time

Observation IMR ARX

Dealingwithconsecutiveerrors

• IMR shows significantly better results when there is a large number of consecutive errors

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30

RMS

# Consecutive errors

IMR SCREEN EWMA ARX AR

Contents


large spike errors


small errors


consecutive errors

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

Observation Truth

FutureStudy

• More error types

– Periodical

• Timestamp error

– A single ride takes 20 years

26http://gd.sina.com.cn/news/sz/2016-07-08/detail-ifxtwihp9790656.shtml

Thanks

Full text available at

1. Shaoxu Song, Aoqian Zhang, Jianmin Wang, Philip S. Yu. SCREEN: Stream Data Cleaning under Speed Constraints. ACM SIGMOD International Conference on Management of Data, SIGMOD, 2015.

2. Aoqian Zhang, Shaoxu Song, Jianmin Wang. Sequential Data Cleaning: A Statistical Approach. ACM SIGMOD International Conference on Management of Data, SIGMOD, 2016.

3. Aoqian Zhang, Shaoxu Song, Jianmin Wang, Philip S. Yu. Time Series Data Cleaning: From Anomaly Detection to Anomaly Repairing. International Conference on Very Large Data Bases, VLDB, 2017.

http://ise.thss.tsinghua.edu.cn/sxsong/

time series data cleaning · 2020-06-03 · –sensor monitoring –gps trajectory figure 8:...

Documents