time series data cleaning · 2020-06-03 · –sensor monitoring –gps trajectory figure 8:...
TRANSCRIPT
TimeSeriesDataCleaning
Shaoxu Song
http://ise.thss.tsinghua.edu.cn/sxsong/
DirtyTimeSeriesData
• Unreliable Readings
– Sensor monitoring
– GPS trajectory
Figure 8: Inaccurate GPS points (a) in rivers, (b) in the ocean, and (c) outside North America.
forgot to log the end of a trip after dropping off passengers. Nevertheless, they certainly affect further analysison the data, such as data-based human mobility models [42].
In the 2010 taxi dataset, for the month of May, there were 7.1 million ghost trips. Given the 154 milliontrips that took place that month, this corresponds to an error rate of about 4.60%. To better understand whichof the overlapping trips are defective, we would need domain knowledge from expert users and TLC to performdata cleaning: all the trips or just a subset may be erroneous. The number of ghost trips is much smaller for the2011 dataset: the error rate is only 0.20%. Since the taxi dataset for 2011 has considerably fewer invalid valuescompared to 2010, as described in Section 2.1, one possible explanation is that different cleaning procedureswere used for these two years, and inconsistencies such as ghost trips were removed before the release of the2011 dataset.
4 Discussion
In this paper, we discussed some of the challenges involved in cleaning spatio-temporal urban data. We presenteda series of case studies using the NYC taxi data that illustrate data cleaning challenges and suggested potentialmethodologies to address these challenges. These methodologies form the basis for integrating cleaning withdata exploration. Data cleaning is necessary for data exploration, and through data exploration, users can attaina better understanding of the data which can lead to the discovery of cleaning constraints and enable them todiscern between errors and features. Data exploration, however, requires a complex trial-and-error process.Thus, usable tools are needed to guide and assist users in the cleaning process. As the case studies we discussedillustrate, this is particularly true for spatio-temporal data, where visual analytics and event detection techniquesat different resolutions are essential to identify quality issues.
The case studies presented in Section 3 show that some cleaning decisions are not clear cut. Often, multipledatasets are required to help an expert decide whether a data point is erroneous or represents an important feature.While there has been preliminary work on the discovery of relationships across datasets [8], there are still manyopen problems in identifying relevant data that can be used to explain events within a large collection of datasetsand in a systematic fashion.
Lack of sufficient knowledge is another issue that hampers data cleaning. Even though experts can (andshould) be involved in most of the process, they may be unavailable, or it may be expensive to hire them forcleaning large datasets. Crowdsourcing systems could help the data analyst clean data more efficiently: user
74
J. Freire, A. Bessa, F. Chirigati, H. T. Vo, K. Zhao: Exploring What not to Clean in Urban Data: A Study Using New York City Taxi Trips. IEEE Data Eng. Bull.39(2): 63-77 (2016)
DirtyTimeSeriesData
• Misuse
• Flight: Accuracy of Travelocity is 0.95
• Stock: Accuracy of Stock in Yahoo! Finance is 0.93
Xian Li, Xin Luna Dong, Kenneth B. Lyons, Weiyi Meng, Divesh Srivastava: Truth Finding on the deep web: Is the problem solved? PVLDB, 6(2) (2013)
0
0.5
1
1.5
2
2.5
3
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Observation Smooth Truth
Existingcleaningmethods
• Smoothing Filter
– Moving Average
– WMA
– EWMA
• Problem: modify almost all the data values
E. S. Gardner Jr. Exponential smoothing: The state of the art{part ii. International Journal of Forecasting, 22(4):637-666, 2006.
0
0.5
1
1.5
2
2.5
3
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Observation AR Truth
Existingcleaningmethods
• Prediction Model
– Modify the observation by predication if the predication is far distant from the observation
– autoregressive (AR) model
– AR(I)MA
• May over change the data
– Owing to “far distant”
Yamanishi, Kenji, and Jun-ichi Takeuchi. "A unifying framework for detecting outliers and change points from non-stationary time series data." In SIGKDD, pages 676-681, 2002
Repairingdirtydatahelps
• Time series classification
0
0.2
0.4
0.6
0.8
1
Synt
hetic
Con
trol
Gun
Poin
t
CBF
Face
All
OSU
Leaf
Swed
ishL
eaf
50w
ords
Trac
e
TwoP
atte
rns
waf
er
AC
CU
RAC
Y
DATASET
Clean Dirty IMR SCREEN EWMA ARX AR
Contents
Constraint-based method (SIGMOD 2015)
large spike errors
Statistical method (SIGMOD 2016)
small errors
Supervised method (VLDB 2017)
consecutive errors
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Observation Truth
0
0.5
1
1.5
2
2.5
3
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
IntuitiononSpeedConstraints
• “Jump” of values is often constrained
– Daily limit: in financial and commodity markets
– Temperatures in a week
– Fuel consumption
• Use speed constraints to identify dirty data
SCREEN
• Given
– Time series 𝑥 = {𝑥 1 , 𝑥 2 , … }– Constraints 𝑠 = (𝑠+,-, 𝑠+./)
on min/max speeds
• Find repair a repair 𝑥′ of 𝑥– Constraint satisfaction:
0 ≤ 𝑡5 − 𝑡, ≤ 𝑤,
𝑠+,- ≤/9:/;<9:<;
≤ 𝑠+./
– Change minimization:∑ |𝑥, − 𝑥,′|�/;∈/ is minimized
Stream Data Cleaning under Speed Constraints
0
0.5
1
1.5
2
2.5
3
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Dirty Smooth Screen
𝑥AB
𝑠+./
Time
Valu
e
𝑡A 𝑡,
𝑥,B
𝑥,
𝑤
𝑠+,-
𝑥AB + 𝑠+./(𝑡, − 𝑡A)𝑥AB + 𝑠+,-(𝑡, − 𝑡A)
EmployExistingRepairingApproach
• Holistic algorithm
– Repairing relational data
– Under denial constraints
• Adaption
– Time series as a relation
– Express speed constraints by denial constraints roughly
• Problem
– High computational costs
– Not guaranteed to eliminate all violations
X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458-469, 2013.
ID Timestamp Value
1 1 1.13
2 2 1.24
3 3 1.19
4 4 1.3
¬(𝑡5 < 𝑡, + 𝑤 ∧ 𝑥5 > 𝑥, + (𝑡5 − 𝑡,) ⋅ 𝑠+./)¬(𝑡5 < 𝑡, + 𝑤 ∧ 𝑥5 < 𝑥, + (𝑡5 − 𝑡,) ⋅ 𝑠+IJ)
ALightweightWeapon
• Unlike NP-hard problems in most data repairing scenarios
• The speed constraint-based repairing can be solved
– as a LP problem in 𝑂 𝑛M.O𝐿– considers the entire sequence as a whole (global optimal)
• Online computing, over streaming data
– Consider local optimum in the current sliding window
– Using Median Principle in 𝑂 𝑛𝑤 time
0
1
2
3
4
5
Repa
ir co
st 𝑥A+,Q
𝑠+./
Time
Valu
e
𝑡A 𝑡,
𝑥,
𝑤
𝑠+,-
𝑥, + 𝑠+./(𝑡A − 𝑡,)
𝑥, + 𝑠+,-(𝑡A − 𝑡,)𝑥A
EffectivenessandEfficiency
• Global: the highest accuracy
• Local: much faster than Holistic
• Trade-off
0
0.05
0.1
0.15
0.2
0.25
0.3
0.05 0.15 0.25 0.35 0.45
RM
S
Error rate
Global Local Holistic EWMA
0.001
0.01
0.1
1
10
100
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Tim
e co
st(s
)
Error Rate
Global Local Holistic EWMA
Contents
Constraint-based method (SIGMOD 2015)
large spike errors
Statistical method (SIGMOD 2016)
small errors
Supervised method (VLDB 2017)
consecutive errors
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Observation Truth
365
370
375
380
385
1540 1545 1550 1555 1560 1565 1570 1575
Observation SCREEN Truth
FurtherIssue
• Speed Constraint based method
– Large spike error: modify to max/min values allowed
– Small error: fail to identify
IntuitiononSpeedChange
• Consider the likelihood of speeds within the allowed range
• No clear distribution pattern is observed on speeds
• Interesting pattern on speed changes in consecutive points
00.05
0.10.150.2
0.25
(0,1]
(0,2]
(0,3]
(0,4]
(0,5]
(0,6]
(0,7]
(0,8]
(0,9]
(0,10]
(0,11]
Prob
abili
ty
Speed(m/s)
𝑥AB
𝑠+./
Time
Valu
e
𝑥,B𝑥,
𝑠+,-
0
0.1
0.2
0.3
0.4
(-6.0,-5
.2]
(-5.2,-4
.4]
(-4.4,-3
.6]
(-3.6,-2
.8]
(-2.8,-2
.0]
(-2.0,-1
.2]
(-1.2,-0
.4]
(-0.4,0.4]
(0.4,1.2]
(1.2,2.0]
(2.0,2.8]
(2.8,3.6]
(3.6,4.4]
(4.4,5.2]
(5.2,6.0]
(6.0,6.8]
Prob
abili
ty
Speed Change
p.m.f. p.d.f.
𝑢M = 𝑣M − 𝑣T = −4
𝑣M = −1𝑣T = 3
Time
Valu
e
StatisticalApproach
• Calculate the likelihood of a sequence w.r.t. the speed change
– employ the probability distribution of speed changes
• The cleaning problem is thus to find a repaired sequence with themaximum likelihood about speed change
– instead of minimum change towards speed constraint satisfaction
365
370
375
380
385
1540 1545 1550 1555 1560 1565 1570 1575
Observation SCREEN Likelihood Truth
Maximumlikelihoodrepairproblem
• Given
– Time series 𝑥– repair cost budget 𝛿– Distribution on speed changes
• Find repair a repair 𝑥′ of 𝑥– ∆(𝑥, 𝑥′) ≤ 𝛿– the likelihood 𝐿(𝑥′) is maximized.
• NP-hard
• Pseudo-polynomial time solvable
DP, dynamic programming 𝑂(𝑛𝜃+./M 𝛿) Exact
DPC, constant-factor approximation 𝑂(𝑛T𝜃+./M ) Large budget
DPL, linear time heuristics 𝑂(𝑛𝑑[) Fast, higher error
QP, quadratic programming Approximate distribution
SG, simple greedy 𝑂(max(𝑛, 𝛿)) Fastest
EffectivenessandEfficiency
• Significantly better accuracy than SCREEN
• SG is efficient, comparable to SCREEN, and still with better accuracy
0
1
2
3
4
5
6
7
8
1000 3000 5000 7000 9000 11000 13000
RM
S
Data size
DP DPC DPL SG SCREEN QP
0.001
0.01
0.1
1
10
100
1000
10000
1000 3000 5000 7000 9000 11000 13000
Tim
e co
st(s
)
Data size
DP DPC DPL SG SCREEN QP
Contents
Constraint-based method (SIGMOD 2015)
large spike errors
Statistical method (SIGMOD 2016)
small errors
Supervised method (VLDB 2017)
consecutive errors
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Observation Truth
ConsecutiveErrors
• Speed constraints handle well “Spike” errors, but not consecutive ones
23.5
24.5
25.5
26.5
27.5
360 365 370 375 380 385 390
Valu
eTime
Truth Observation SCREEN
Intuition
• Supervised by labeled truth of errors
• Labeling by user
– Check-in
• Labeling by machine
– precise equipment reports accurate air quality data in a relatively long sensing period
– crowd and participatory sensing generates unreliable observations in a constant manner
Erroneous location
Labeled truth
Y. Zheng, F. Liu, and H. Hsieh. U-air: when urban air quality inference meets big data. In KDD, pages 1436–1444, 2013.
Approach
• Instead of modeling directly the values
– by AR model (autoregression), ignoring erroneous observations
• We model and predicate the difference between errors and their corresponding labeled truths
– by ARX model (autoregressive model with exogenous inputs)
23.5
24
24.5
25
25.5
26
26.5
27
27.5
360 365 370 375 380 385 390
Valu
e
Time
Observation IMR AR
IterativeMinimumRepair(IMR)
• Rather than in chronological order
• Iterative repairing
– minimally changes one point a time to obtain the most confident repair only
– high confidence repairs in the former iterations could help the latter repairing
• Major concerns
– Convergence
– Incremental computationamong iterations
23.524
24.525
25.526
26.527
27.5
360 365 370 375 380 385 390
Valu
e
Time
Observation IMR ARX
Dealingwithconsecutiveerrors
• IMR shows significantly better results when there is a large number of consecutive errors
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25 30
RMS
# Consecutive errors
IMR SCREEN EWMA ARX AR
Contents
Constraint-based method (SIGMOD 2015)
large spike errors
Statistical method (SIGMOD 2016)
small errors
Supervised method (VLDB 2017)
consecutive errors
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Observation Truth
FutureStudy
• More error types
– Periodical
• Timestamp error
– A single ride takes 20 years
26http://gd.sina.com.cn/news/sz/2016-07-08/detail-ifxtwihp9790656.shtml
Thanks
Full text available at
1. Shaoxu Song, Aoqian Zhang, Jianmin Wang, Philip S. Yu. SCREEN: Stream Data Cleaning under Speed Constraints. ACM SIGMOD International Conference on Management of Data, SIGMOD, 2015.
2. Aoqian Zhang, Shaoxu Song, Jianmin Wang. Sequential Data Cleaning: A Statistical Approach. ACM SIGMOD International Conference on Management of Data, SIGMOD, 2016.
3. Aoqian Zhang, Shaoxu Song, Jianmin Wang, Philip S. Yu. Time Series Data Cleaning: From Anomaly Detection to Anomaly Repairing. International Conference on Very Large Data Bases, VLDB, 2017.
http://ise.thss.tsinghua.edu.cn/sxsong/