ieee transactions on knowledge and data engineering, vol. 24, no. 3, march 2012...

14
A Framework for Similarity Search of Time Series Cliques with Natural Relations Bin Cui, Senior Member, IEEE, Zhe Zhao, and Wee Hyong Tok Abstract—A Time Series Clique (TSC) consists of multiple time series which are related to each other by natural relations. The natural relations that are found between the time series depend on the application domains. For example, a TSC can consist of time series which are trajectories in video that have spatial relations. In conventional time series retrieval, such natural relations between the time series are not considered. In this paper, we formalize the problem of similarity search over a TSC database. We develop a novel framework for efficient similarity search on TSC data. The framework addresses the following issues. First, it provides a compact representation for TSC data. Second, it uses a multidimensional relation vector to capture the natural relations between the multiple time series in a TSC. Lastly, the framework defines a novel similarity measure that uses the compact representation and the relation vector. We conduct an extensive performance study, using both real-life and synthetic data sets. From the performance study, we show that our proposed framework is both effective and efficient for TSC retrieval. Index Terms—Time series clique, natural relation, compact representation, similarity search. Ç 1 INTRODUCTION T IME series databases are prevalent in multimedia, finance, health, and traffic monitoring applications. These applications provide data analysts with the ability to perform sophisticated analysis on the time series data. In order to effectively and efficiently analyze time series data, many techniques [1], [2], [3], [4], [5], [6] have been proposed. These techniques are for processing multiple, individual time series. For example, motion data are extracted from the video clips of a basketball game, and stored in a time series database. In the time series database, there are multiple, individual time series. Each of the time series can refer to the information for a player at different time points in the game. Applications can then be built to provide users with the ability to analyze the scenes in the video which relates to a user-specified query. These results (i.e., scenes) that are returned can be ranked based on the similarity to the query. Existing techniques cannot handle complex time series queries which require the natural relations between the time series to be used during similarity matching. The natural relations are prevalent in many time series databases, where each time series is correlated with several other time series in the database. For example, in the time series database that we described earlier, multiple basketball players from Teams A and B appear in a scene. Each of the players in the game moves and performs a specific action. In one scene, a player from Team A passes a ball to a teammate. The teammate aims and shoots the ball into the basket. At the same time, players from Team B are actively trying to prevent the player from getting the ball and shooting the ball into the basket. The moving trajectories of all the players in the game are correlated to each other. The position and action of each player in the game is reflective of the current game situation. We can easily observe the natural relations that exist in the time series database. If we process each time series independently, we will not be able to make use of the inherent natural relations. We refer to groups of time series with natural relations as Time Series Cliques (TSC), which can be defined as fT 1 ;T 2 ; ... ;T n g, where T i ¼ fðx m ;t m Þkl ¼ 1; 2; ... ;L i g refers to a single time series, n is the number of the time series in the TSC and can be different values in different TSCs. For each T i , x m 2 IR k denotes an observed value at a certain time, where k represents the dimensionality of each element in the time series, t m 2 IR 1 describes a time tick, and L i is the length of the ith time series. The patterns of each time series and their natural relations can be extracted from the tuples in fT 1 ;T 2 ; ... ;T n g. In each TSC, the time series exhibit natural relations with each other. Efficient similarity retrieval on TSC data can be used in many real-world applications. An example of TSC is illustrated in Fig. 1. The TSC data consist of the motion feature vectors of multiple players in a hockey game, e.g., Fig. 1a shows the trajectories of five players in the court, and Fig. 1b represents time interval for each trajectory, respectively. Other real-world applications, where TSC data can be found, include: video, and audio clips, electrocardiogram (ECG) readings, read- ings from seismic stations, etc. These examples motivate the need for effective and efficient TSC similarity search algorithms. Based on the definition of TSC, we formally define TSC similarity search as follows: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 385 . B. Cui and Z. Zhao are with the Department of Computer Science and Key Laboratory of High Confidence Software Technologies (Ministry of Education), Peking University, Science Building 1, Beijing 100871, China. E-mail: {bin.cui, zpegt}@pku.edu.cn. . W.H. Tok is with the Microsoft Corporation, Blk 635, Veerasamy Road #10-162, Singapore 200635. E-mail: [email protected]. Manuscript received 11 Mar. 2010; revised 20 July 2010; accepted 6 Sept. 2010; published online 23 Dec. 2010. Recommended for acceptance by Q. Li. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TKDE-2010-03-0137. Digital Object Identifier no. 10.1109/TKDE.2010.270. 1041-4347/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society

Upload: others

Post on 16-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 …net.pku.edu.cn/~cuibin/Papers/2012 tkde.pdf · 2012-10-17 · A Framework for Similarity Search

A Framework for Similarity Search ofTime Series Cliques with Natural Relations

Bin Cui, Senior Member, IEEE, Zhe Zhao, and Wee Hyong Tok

Abstract—A Time Series Clique (TSC) consists of multiple time series which are related to each other by natural relations. The natural

relations that are found between the time series depend on the application domains. For example, a TSC can consist of time series

which are trajectories in video that have spatial relations. In conventional time series retrieval, such natural relations between the time

series are not considered. In this paper, we formalize the problem of similarity search over a TSC database. We develop a novel

framework for efficient similarity search on TSC data. The framework addresses the following issues. First, it provides a compact

representation for TSC data. Second, it uses a multidimensional relation vector to capture the natural relations between the multiple

time series in a TSC. Lastly, the framework defines a novel similarity measure that uses the compact representation and the relation

vector. We conduct an extensive performance study, using both real-life and synthetic data sets. From the performance study, we

show that our proposed framework is both effective and efficient for TSC retrieval.

Index Terms—Time series clique, natural relation, compact representation, similarity search.

Ç

1 INTRODUCTION

TIME series databases are prevalent in multimedia,finance, health, and traffic monitoring applications.

These applications provide data analysts with the ability toperform sophisticated analysis on the time series data. Inorder to effectively and efficiently analyze time series data,many techniques [1], [2], [3], [4], [5], [6] have beenproposed. These techniques are for processing multiple,individual time series. For example, motion data areextracted from the video clips of a basketball game, andstored in a time series database. In the time series database,there are multiple, individual time series. Each of the timeseries can refer to the information for a player at differenttime points in the game. Applications can then be built toprovide users with the ability to analyze the scenes in thevideo which relates to a user-specified query. These results(i.e., scenes) that are returned can be ranked based on thesimilarity to the query.

Existing techniques cannot handle complex time series

queries which require the natural relations between the time

series to be used during similarity matching. The natural

relations are prevalent in many time series databases, where

each time series is correlated with several other time series

in the database. For example, in the time series database that

we described earlier, multiple basketball players from

Teams A and B appear in a scene. Each of the players in

the game moves and performs a specific action. In onescene, a player from Team A passes a ball to a teammate.The teammate aims and shoots the ball into the basket. Atthe same time, players from Team B are actively trying toprevent the player from getting the ball and shootingthe ball into the basket. The moving trajectories of all theplayers in the game are correlated to each other. Theposition and action of each player in the game is reflective ofthe current game situation. We can easily observe thenatural relations that exist in the time series database. Ifwe process each time series independently, we will not beable to make use of the inherent natural relations.

We refer to groups of time series with natural relations asTime Series Cliques (TSC), which can be defined asfT1; T2; . . . ; Tng, where Ti ¼ fðxm; tmÞkl ¼ 1; 2; . . . ; Lig refersto a single time series, n is the number of the time series inthe TSC and can be different values in different TSCs. Foreach Ti, xm 2 IRk denotes an observed value at a certaintime, where k represents the dimensionality of each elementin the time series, tm 2 IR1 describes a time tick, and Li isthe length of the ith time series. The patterns of each timeseries and their natural relations can be extracted from thetuples in fT1; T2; . . . ; Tng. In each TSC, the time series exhibitnatural relations with each other. Efficient similarityretrieval on TSC data can be used in many real-worldapplications. An example of TSC is illustrated in Fig. 1. TheTSC data consist of the motion feature vectors of multipleplayers in a hockey game, e.g., Fig. 1a shows the trajectoriesof five players in the court, and Fig. 1b represents timeinterval for each trajectory, respectively. Other real-worldapplications, where TSC data can be found, include: video,and audio clips, electrocardiogram (ECG) readings, read-ings from seismic stations, etc.

These examples motivate the need for effective andefficient TSC similarity search algorithms. Based on thedefinition of TSC, we formally define TSC similarity searchas follows:

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 385

. B. Cui and Z. Zhao are with the Department of Computer Science and KeyLaboratory of High Confidence Software Technologies (Ministry ofEducation), Peking University, Science Building 1, Beijing 100871, China.E-mail: {bin.cui, zpegt}@pku.edu.cn.

. W.H. Tok is with the Microsoft Corporation, Blk 635, Veerasamy Road#10-162, Singapore 200635. E-mail: [email protected].

Manuscript received 11 Mar. 2010; revised 20 July 2010; accepted 6 Sept.2010; published online 23 Dec. 2010.Recommended for acceptance by Q. Li.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2010-03-0137.Digital Object Identifier no. 10.1109/TKDE.2010.270.

1041-4347/12/$31.00 � 2012 IEEE Published by the IEEE Computer Society

Page 2: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 …net.pku.edu.cn/~cuibin/Papers/2012 tkde.pdf · 2012-10-17 · A Framework for Similarity Search

Given a query TSC objectQ ¼ fq1; q2; . . . ; qlg, where l is thenumber of the time series in Q and each time series qi ¼fðxm; tmÞkm ¼ 1; 2; . . . ; Lig, and a TSC database D ¼ fTSC1;TSC2; . . . ; TSCNg. Each TSC object TSCi ¼ fTi1; Ti2; . . . ;TiNig, and in TSCi, Tij ¼ fðxm; tmÞkm ¼ 1; 2; . . . ; nijg, where

Ni is the number of the time series in the ith element ofdatabaseD and nij is the length of the ith time series in TSCi.The similarity search on TSC is to find the most similar TSCsfrom database D via a predefined function distðQ;TSCiÞ,where

s ¼ arg mini¼1;2;...;N

ðdistðQ; TSCiÞÞ:

A key difference between TSC and existing time seriessimilarity search is that the former requires the use ofnatural relations between the time series. In contrast, if thedata are treated as single time series, the natural relationsbetween the multiple time series will be lost. The TSCsimilarity search problem is a novel fusion of time seriesand spatiotemporal similarity search. It provides a powerfulgeneralization of existing time series and spatiotemporalretrieval problems.

In order to reduce the computational complexity of theTSC similarity search problem, and avoid pairwise compar-ison between all the time series in multiple TSCs, a suitablecompact data structure, which captures the intrinsic proper-ties of the TSC and the inherent natural relations need to beidentified. In contrast, in the absence of such a datastructure, a brute-force approach will need to compare thetrajectories of one scene with the trajectories in other scenes.A TSC similarity search algorithm can effectively andefficiently make use of the compact data structure toidentify similar patterns in multiple time series. DuringTSC similarity search, the algorithm must be able to takeinto account the natural relations between the group of timeseries in TSCs. Most importantly, the similarity measuremust be generic, and easily used to capture natural relationsfor different application domains.

In this paper, we focus on the new problem ofperforming similarity search over TSC databases. In orderto validate the requirements for TSC processing, wepresented our preliminary study of TSC processing in [7].In this paper, we extended [7] with an in-depth investiga-tion and performance analysis of TSC processing usingdifferent algorithms. Specifically, this paper makes thefollowing additional contributions: First, we provide a

comprehensive analysis of related work. Second, we

present an in-depth discussion of the issues and solutions

for feature extraction, feature transformation, and similarity

matching algorithms for TSC processing. Third, we conduct

an extensive performance analysis, using both synthetic and

real-life data sets.The main contributions of the paper are

. We propose two data structures for Time SeriesClique processing, which merges the relation of timeseries with their shapes or patterns. First, wepropose a relation vector for TSC (RT) whicheffectively describes the intrinsic natural relationsthat exist between the multiple time series in a TSC.Second, we propose a Compact Representation forTSC (CT), which captures the dominant informationin a TSC.

. We propose algorithms that leverages RT and CT toeffectively and efficiently compute the distancebetween two TSCs.

. We conduct extensive performance analysis of theproposed algorithms using both synthetic and real-life data sets. The empirical results show that theproposed approach is superior compared with otherTSC processing systems.

The rest of this paper is organized as follows: Section 2

introduces the related work. In Section 3, we introduce a

brute-force matching method for TSC similarity search,

which performs exhaustive search over all the TSCs. In

Section 4, we propose a novel algorithm for similarity

search in TSC databases. In Section 5, we present a

comprehensive performance analysis of the proposed

algorithms. We conclude in Section 6.

2 RELATED WORK

There is a long stream of research on time series databases.

We classify the existing work into three categories, i.e., time

series matching, pattern and correlation mining on multiple

time series, and feature extraction and representation of

multiple time series relations.

2.1 Time Series Matching

In existing similarity search over time series databases, the

time series is transformed from its original form (fðXm;

tmÞkm ¼ 1; 2; . . . ; Lg) into a more compact representation.

386 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012

Fig. 1. Examples of TSC in hockey games. (a) Players’ trajectories in TSC1. (b) Player’s moving interval in TSC1. (c) Players’ trajectories in TSC2.(d) Player’s moving interval in TSC2.

Page 3: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 …net.pku.edu.cn/~cuibin/Papers/2012 tkde.pdf · 2012-10-17 · A Framework for Similarity Search

The search algorithm leverages on two steps: dimension-ality reduction [8], [9], [10], [1], [3], [11] and datarepresentation in the transformed space.

Various dimensionality reduction techniques have beenproposed for time series data transformation. Theseincludes: Discrete Fourier Transform (DFT), Singular ValueDecomposition (SVD) [12], [8], [13], Discrete WaveletTransform (DWT) [9], [10], and Piecewise AggregateApproximation (PAA) [8]. Another approach for dimen-sionality reduction is to make use of time series segmenta-tion [1], [3], [11].

Besides dimensionality reduction, the choice of timeseries representation is also important. Two types of timeseries representations are commonly used: numeric andsymbolic. One of the commonly used numeric representa-tion is the real sequence [6]. Symbolic representation isgenerated by a symbol table that reflects each data vector ofsequence into symbols [5]. The symbol table can be eitherpredefined or built from data sets.

Trajectory data can be considered as a specific form oftime series, which has been applied in moving object searchor video retrieval fields [3], [4], [11], [14], [6]. In these works,the trajectories first are segmented by their control points[4], [11], [14], or inflextions [3], and then differentrepresentation methods, i.e., either a scalable numericrepresentation in [14], [6] or symbolic representations in[3], [4], [11], [5] have been adopted for similarity measure.

2.2 Time Series Mining

Multiple time series mining has also been recently explored.Existing multiple time series research have focused onpattern mining and finding correlation between multipletime series, over patterns and observed values from groupof individual time series. Papadimitriou et al. [15] proposedthe SPIRIT system. SPIRIT performs incremental PrincipalComponent Analysis (PCA) over stream data, and deliversresults in real time. SPIRIT discovers the hidden variablesamong n input streams and automatically determines thenumber of hidden variables that will be used. The observedvalues of the hidden variables present the general pattern ofmultiple input series according to their distributions andcorrelations. BRAID [16] addressed the problem of dis-covering lag correlations between data steams. BRAIDfocuses on a time and space efficient method for finding theearliest and highest peak in the cross-correlation functionsbetween all pairs of streams.

Another closely related area of work is research oncomputing the group nearest neighbor query [17]. Thequery of Group KNN is a set of high-dimensional datapoints. The group nearest neighbor query returns a group ofdata points in the database which are similar to the set ofquery based on the patterns and relations. The data set usedin the Group KNN problem consists of independent points,and does not consider whether natural relations existbetween these points.

These existing approaches cannot be easily extended forthe TSC similarity search problem because of the lack of aclearly defined similarity measure. Most importantly,existing approaches are unable to deal with the naturalrelations that exist between the multiple time series.

2.3 Representation of Time Series Relations

Several approaches for capturing natural relations among

multiple time series have been proposed. Allen [18] defined

13 interval relations that exist between multiple time series.

These include: before, overlaps, during, etc. These relations

are used to describe the relative position of two intervals.

Fleischamn et al. [19] deployed Allen’s descriptor to capture

temporal information by representing the events in video

data based on a lexicon of hierarchical patterns of move-

ments. An improved interval relation description method

for local temporal relations, i.e., Time Series Knowledge

Representation (TSKR) was proposed by Morchen and

Ultsch in [20]. TSKR expresses temporal knowledge in time

series data. In addition, the temporal relation, 3D Z-String

[21] was used to represent moving objects’ spatiotemporal

relations. In 3D Z-string, the objects in a video are projected

onto the x-, y-, and time-axis to form three strings

representing the relations and relative positions. The

temporal overlapping and spatial position are well defined

in 3D Z-String.In summary, none of the above methods can address the

challenges of effective and efficient TSC similarity search.The existing methods lack a compact, powerful, andmeasurable representation for the patterns and the naturalrelations that are inherent in the TSCs. In addition, it is hardto identify a generic relation descriptor that can be appliedto various application domains.

3 BRUTE-FORCE TSC MATCHING

In this section, we first present a straightforward approach

for solving the TSC matching problem, which exhaustively

compares the time series in TSCs. In such a Brute-Force

approach for TSC similarity matching, all the possible

matches between two TSCs are identified. Once the possible

matches are found, the similarity between all the time series

in each matching result are computed. The maximum

similarity (i.e., minimum distance) is then used to represent

the similarity between two TSCs.Let us revisit the example shown in Fig. 1. We can

observe that there are two similar TSCs, i.e., TSC1 (Figs. 1aand 1b) and TSC2 (Figs. 1c and 1d). These TSCs areextracted from a computer simulation of a hockey ball game[22]. Specifically, Figs. 1a and 1c are the trajectories ofplayers in the TSC1 and TSC2, Figs. 1b and 1d are the timeintervals of respective trajectories in the TSCs.

To measure the similarity between the above TSCs, i.e.,TSC1 and TSC2, the Brute-Force approach is to find all thecorresponding players in TSC2 for all the players in theTSC1, which are the players whose trajectories andintervals in Figs. 1b and 1d are of the same color as theplayers’ in Figs. 1a and 1c. After finding the matchedplayers, we can measure the similarity of each pair bydistance function between trajectories, and calculate thesum of their distance as the value of distðTSC1; TSC2Þ.

The above approach can be easily generalized to findthe matching result between the time series in TSC1 andTSC2 for different application domains. A distancefunction DISðTi; TjÞ, where Ti 2 TSC1 and Tj 2 TSC2,can be used for comparing the similarities between two

CUI ET AL.: A FRAMEWORK FOR SIMILARITY SEARCH OF TIME SERIES CLIQUES WITH NATURAL RELATIONS 387

Page 4: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 …net.pku.edu.cn/~cuibin/Papers/2012 tkde.pdf · 2012-10-17 · A Framework for Similarity Search

TSCs. The function DISðTi; TjÞ measures the similaritybetween two single time series. In the literature, variousdistance functions have been proposed. These range fromthe simple euclidean distance between two vectors to thestate-of-art distance functions described in [6]. Duringthe matching process, the sum of DISðTi; TjÞ for all thematched pairs is computed. Thereafter, the minimumdistance is recorded as the distance distBF between twoTSCs TSC1 and TSC2. The Brute-Force approach guaran-tees the minimum ofX

Ti2TSC1^T 0i2TSC2

DISðTi; T 0i Þ;

where T 0i 2 TSC2 matches Ti in TSC1. In summary, byenumerating all possible matches between the time series intwo TSCs, we can calculate the minimum distance of Brute-Force approach by the following equation:

distBF ðTSC1; TSC2Þ ¼ mink

X<Tm;Tn>2Mk

ðDISðTm; TnÞÞ !

;

ð1Þ

where Mk is subset of the Cartesian product between the setof time series in TSC1 and TSC2, each Mk represents apossible matching result between the two sets, Tm 2 TSC1

and Tn 2 TSC2.The Brute-Force TSC similarity matching approach is an

exhaustive matching-based algorithm. The approach isvery costly, as it requires an enumeration between allpairwise single time series. Meanwhile, this method canmeasure the patterns between time series in two TSCswell, but ignores the natural relations of time series insidethe TSC. In order to reduce the computational cost andsolve the challenge of evaluating the natural relations, wepresent a novel approach for TSC similarity search in nextsection, which exploits compact representation and rela-tion vector of TSCs.

4 OUR APPROACH

Our goal is to design a representation that describe andmeasure the general information extracted from a TSC. Therepresentation needs to capture both the intrinsic patternsand natural relations in a TSC. Overall, the computationalcomplexity of the proposed TSC similarity search algo-rithm, which leverages the representation, must be sig-nificantly smaller compared to the Brute-Force approach.

In order to achieve this, we propose two data structures,called CT (Compact representation of TSC) and RT(Relation vector of TSC), as a compact way of representinga TSC and its inherent natural relations. CT consists ofprincipal components (PC) that are extracted from themultiple time series in a TSC. RT captures the naturalrelations among the time series in a TSC, and is derived byanalyzing the relations which is represented in a matrix. ATSC is completely described by a combination of its CT andRT. The CT and RT can be used to effectively compute theoverall distance ( i.e., distðTSCi; TSCjÞ).

Using CT and RT representations for the TSCs, wepropose a novel framework for similarity search in TSCdatabases. The overall framework of our approach is shown

in Fig. 2. First, we generate the principal components of theobserved values in multiple time series using PrincipalComponent Analysis. Second, we make use of the trans-formed values on the first few PCs to form a (multiple)generalized time series—CT. Third, we make use of thefeature correlation of multiple time series to form relationdescriptors Rij for each pair of time series Ti and Tj in aTSC. Taking the earthquake waves (EW) in the flowchart asexample, the figure shows three time series extracted fromthree earthquake observations in Tibet, as well as the spatiallocations of these observations on the top right. It would bevery useful for geologists to detect earthquakes with similarpatterns, e.g., the nearby epicenter, the waveform of theearthquake, etc. We use PCA to extract the time series of thehidden variables to build CT. And to build RT, we firstgenerate relation descriptors for each pair of the time series,based on their shapes and the location information, andthen build relation matrix, which is factorized by SingularValue Decomposition to generate the RT. The CT and RT areused together to measure the similarity between two TSCs.

In general, there are n2 relation descriptors in a TSCcontaining n time series. Fourth, these high-dimensionalrelation descriptors are transformed into lower dimensionalvectors fRij based on their PCs. These transformed relationdescriptors are used to form the relation matrices RM ffRijg.Finally, we generate a signature using Singular ValueDecomposition of this matrix. The highest signatures ofthe series of matrices and the first transformation vector ofPCA W1 are used to obtain the RT feature.

4.1 Compact Representation of TSC

One of the key challenges of performing TSC similaritysearch is to efficiently match the patterns of multiple timeseries in TSCs. Using a compact representation can avoidthe need to perform pairwise comparisons of all the timeseries, which is computationally expensive.

In the database literature, PCA is commonly used in timeseries analysis [3], [15] for dimensionality reduction, and to

388 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012

Fig. 2. The flowchart of our approach.

Page 5: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 …net.pku.edu.cn/~cuibin/Papers/2012 tkde.pdf · 2012-10-17 · A Framework for Similarity Search

facilitate fast and accurate retrieval. Fig. 3 shows severalgroups of objects and their corresponding principalcomponents. The key idea in PCA is to transform a set ofobserved values in the high-dimensional vector space into anew vector space. Given a data matrix X, and a transforma-tion matrix W . Each column of X corresponds to a datavector in the original space. Each row of W corresponds to atransformation vector. The transformation vector trans-forms the data vectors from the original space to the valueof certain principal components. The PCA transformation isgiven as:

Y ¼ X �W:

Y denotes the data matrix in the new vector space. In thenew vector space, each dimension corresponds to theprincipal component of the original space. It is generatedfrom the original dimensions based on the statisticaldistribution of the observed values.

In our work, we make us of PCA to produce CT, acompact representation of the patterns of multiple timeseries in a TSC. Each CT consists of one or several timeseries of the principal components extracted from theoriginal TSC. In the SPIRIT system [15], these principalcomponents, also known as hidden variables, are usedprimarily to perform pattern detection for data streammanagement. In our scenario, a CT is used to represent thegeneral feature of multiple time series patterns and tomeasure the similarity of TSCs. CT provides a powerfulsummary of the time series in a TSC. It is generated basedon each time series’ observed values at every time tick andtheir (dis)/similarity.

Given a TSC, which consists of n time seriesT1; T2; . . . ; Tn, for each time series Ti, the observed value intime tick tj is xij . Denote l is the number of time ticks, i.e.,the length of time series, the TSC can also be represented byln-dimensional vectors where the vectors are the values of ntime series in a certain time tick. For each time tick, we builda n-dimensional vector Xj ¼ ðx1j ; x2j ; . . . ; xnjÞ. Each dimen-sion refers to a time series in T1; T2; . . . ; Tn. Note that, we cannormalize the time series with different lengths, which is ageneral solution for time series matching.

We perform PCA transformation on these n-dimensionalvectors to identify the first few principal components. In theimplementation, we choose the first n0; n0 � n PCs. These n0

PCs can capture the most dominant information of theoriginal data, even if n0 � n. We only consider these PCsand their associated observed values which have beentransformed in PCA dimensions.

As n is the dimensionality of the original space, the n�ntransformation matrix W is given as follows:

W ¼ ðW1;W2; . . . ;WnÞWi ¼ ðwi1 ; wi2 ; . . . ; winÞ

T ;

where each row Wi; i ¼ 1; . . . ; n in W is a n-dimensionaltransformation vector of a Principal Component. Thetransformation vector transforms the data from the originalspace to the dimensions of principal components by linearcombination.

Thus, the new observed values of time tick tj on thedimensions of transformed PCA space are

fXj ¼ Xj �W ¼ ðXj �W1; Xj �W2; . . . ; Xj �WnÞ: ð2Þ

The derived values in each time tick tj in CT are the first n0

principal components of Xj, which are the first n0 dimen-sions of fXj. The transformation matrix which transforms theorigin vector space into first n0 dimensions of PCs is

W 0 ¼ ðW1;W2; . . . ;Wn0 Þ: ð3Þ

Therefore, CT can be represented as

CT ¼ X �W 0 ¼ ðX1 �W 0; X2 �W 0; . . . ; Xl �W 0Þ; ð4Þ

where l is the length of time series. The l� n0 data matrix inthis equation represents n0 time series (CT) which aregeneralized from the multiple patterns of original n timeseries.

With the PCA transformation, the derived values in CTare the combination of each dimension of Xj; j ¼ 1; 2; . . . ; land such combination guarantees that the most dispersionrepresents the most useful information; hence, the patternsof multiple time series are generalized in CT according totheir appearances. Therefore, CT can be the signaturedescribing the general features of patterns in TSC, andusing CT in TSC similarity search can avoid matching timeseries between two TSCs. Comparing to the Brute-Forcemethod which finds the best matches between every timeseries pair, CT is a more general feature and is easier tomeasure the similarity, although information will be lost byonly reserving the first one or few principal components,measuring the similarity between CTs is much moreefficient than Brute-Force method. Moreover, the lostinformation can be compensated using RT which will beintroduced in next section.

4.2 Relation Vector Generation for TSC

The natural relations that are inherent in a TSC need to becaptured for effective TSC retrieval. In the previous section,we saw how the key information of multiple time series canbe represented using the CT feature. In this section, weshow how the natural relations that are inherent in the timeseries can be captured.

One of the key challenges for representation of thenatural relations is defining a general measure for capturingand describing the relations from different applicationdomains. In addition, the natural relations between multi-ple time series can be complex. In [21], [20], [19], [18],spatiotemporal relation descriptors of moving objects andtime series intervals have been proposed. However, such

CUI ET AL.: A FRAMEWORK FOR SIMILARITY SEARCH OF TIME SERIES CLIQUES WITH NATURAL RELATIONS 389

Fig. 3. Example of PCs. (a) PCs of a group of objects. (b) First PCs oftwo groups.

Page 6: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 …net.pku.edu.cn/~cuibin/Papers/2012 tkde.pdf · 2012-10-17 · A Framework for Similarity Search

descriptors are generally not comparable and relevant to

specific application domains.We propose a novel concept of Relation vector for TSC

(RT). A RT is a multidimensional vector, which can be used

to describe the natural relations among multiple time series

in a TSC. The RT consists of a set of signatures, which are

extracted from a relation matrix. The relation matrix is

obtained based on the relation descriptors of every time

series pair in TSC. We first perform dimensionality

reduction on all the relation descriptors to transform the

relation descriptors to fewer meaningful dimensions, i.e.,

their principal components, according to the correlation of

dimensions in the relation descriptors of TSC. Then, we use

these transformed vectors to form a relation matrix of the

TSC. The extraction of RT can be summarized as follows:

. For every two time series in a TSC, a relationdescriptor is used to capture their intrinsic relation-ship. The relation descriptor is represented by ahigh-dimensional vector. Each dimension of thevector measures the difference between specificproperties of two time series. For example, thedimension can capture the difference of the samplingpositions for each time series, as well as the locationor interval in which the time series are obtained.

. We make use of the PCA transformation to trans-form the high-dimensional relation descriptors into asummarized space. The transformation vector corre-sponding to the distribution of the vectors in originspace is recorded as one component of RT. Thetransformed relation descriptors are used to form theother part of RT in the next step.

. The relation descriptors which are obtained afterperforming PCA constitutes a series of n� n ma-trices. We refer to these matrices as the RelationMatrix of the TSC. Without loss of generality, we usen to represent the number of time series in TSCobjects. Each matrix corresponds to one dimension ofthe principal component of the relation descriptors,and the ath row and bth column of the matricescorresponds to the relation descriptor between theath time series and bth time series. For each matrix,we use the singular value from the Singular ValueDecomposition as the signature. The signatures of theseries of matrices form the other part of RT feature.

It is important to note that the row or column order of

time series is generally not specified in the relation matrix.

Each column/row describes a time series’ relation with

other members in TSC. Hence, when the rows or columns of

the relation matrix are swapped, the similarity between the

RTs is preserved. This leads to the invariance nature of

signatures of the relation matrix to elementary row and

column transformations on the matrix.

4.2.1 Relation Descriptor between Time Series

In this section, we propose the notion of relation descriptor.

The relation descriptor is used to describe the natural

relation between two time series, e.g., Tm and Tn. Relation

descriptor is a high-dimensional vector, where each

dimension measures the difference between time series on

a specific property, e.g., the difference of the start location oftrajectories, the difference of begin time, etc.

The basic natural relations between Tm and Tn refer to

the differences between the typical observed values in the

time series and the associated time intervals. Such relation

is commonly found in various application fields, e.g., the

spatiotemporal relation of moving trajectories [21] in video

clips. Since the observed values and interval of a time series

are generally represented in uniform data format despite of

different application areas, we can define the difference on

these relations in the relation descriptor. In this paper, we

define two basic relations, i.e., variance in spatial domain

(VIS) and variance in temporal domain (VIT).Variance in spatial domain. Given a time series as

follows: fðx1; t1Þ; ðx2; t2Þ; . . . ; ðxl; tlÞg, where l is the length,fxig is the set of observed values, and ftig is the set of timeticks. The variance in spatial domain is a high-dimensionalvector, each dimension of which captures the differencebetween the sampled observed values of Tm and Tn

ðkxms1� xns1 k; kxms2

� xns2 k; . . . ; kxms�� xns�kÞ; ð5Þ

where xms1; xms2

; . . . ; xms�is the sampled observed values of

Tm, and xns1 ; xns2 ; . . . ; xns� is the sampled observed values ofTn. Note that s1; s2; . . . ; sl can be generated based on somesampling method; in our approach, we only adopted equallength sampling for simplicity.

Variance in temporal domain. VIT is a descriptor of therelation between intervals of Tm and Tn. Motivated by thediscussion in [18], [20], we propose a 4D vector V IT ðTm; TnÞto describe the differences of intervals in our approach

V IT ðTm; TnÞ ¼ ðtm1� tn1

; tml� tnl ; t3m;n; t4m;nÞ; ð6Þ

where tm1and tn1

are the first time ticks of Tm and Tn, tml

and tnl are the last time ticks, respectively, the first twodimensions of VIT are the differences of two time series’begin time and end time, which indicate the relation ofwhich time series begins first and which one ends first;besides, t3m;n and t4m;n, which are used to describe theoverlap status between two time series’ interval, are definedas follows:

t3m;n ¼0; m ¼ n; ð7Þtm1� tnl ; tm1

� tn1; ð70Þ

�ðtn1� tml

Þ; tmi< tn1

: ð700Þ

8<:t4m;n ¼

0; m ¼ n; ð8Þtml� tn1

; tm1� tn1

; ð80Þ�ðtnl � tm1

Þ; tm1> tnl : ð800Þ

8<:These four features can effectively model the pairwise

temporal relation of time series, such as duration andoverlapping status. It is also necessary to make V IT ðTm; TnÞand V IT ðTn; TmÞ satisfy the antisymmetry relation:V IT ðTm; TnÞ ¼ �V IT ðTn; TmÞ, because of the followingtwo reasons:

. VIT describes the relation between the temporalintervals of two time series, which will be used tomeasure the distance between two temporal intervals.

390 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012

Page 7: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 …net.pku.edu.cn/~cuibin/Papers/2012 tkde.pdf · 2012-10-17 · A Framework for Similarity Search

Therefore, it must satisfy: kV IT ðTm; TnÞk ¼ kV IT ðTn;TmÞk in measuring the similarity of two temporalinterval Tm and Tn.

. VIT is a part of RT which is used to represent andmeasure the similarity of pairwise natural relation.Therefore, VIT should be measurable to compute thesimilarity of two temporal relations.

The four components of proposed VIT measurementsatisfy the antisymmetry relation, the combination of whichcan represent all possible temporal relations between twotime series and can be further integrated with VIS to form acomplete relation descriptor for multiple time series in aclique.

In summary, the relation descriptor of Tm and Tn is thecombination of two components, i.e., the differencesbetween observed values and intervals. The relation de-scriptor is a high-dimensional vector which summarizes theoverall intrinsic relation in both spatial and temporaldomains by considering various properties of Tm and Tn.Such descriptor is simple and robust for different applica-tions as well as can be easily processed due to its vector form.

Note that, the VIS and VIT features are only two typicalinstances to describe the general information of relationsbetween time series. Besides these relations on spatial andtemporal domains, we can also explore other application-specific features to capture extra relations existing in timeseries cliques, such as the location of monitor station forearthquake waves, the personal ability or scores of theplayers that move in different trajectories. In many real-lifeapplications, these information may play a critical role indescribing the characters and organization of a TSC. Theseextra relations can be easily accumulated by appending newrelations to the relation descriptors, e.g., the locationdifference between earthquake monitor stations, and thecapability difference between players. In many applications,the feature selection is critical in similarity search problems.However, in this work, our purpose is to design a generalrelation descriptor to describe the overview information ofnatural relations among time series in a TSC for similaritymeasure base on our proposed framework; thus, we onlyadopt the common VIS and VIT features for clarity ofpresentation, and more domain-specific relations can beeasily integrated.

4.2.2 Relation Matrix Construction for TSC

Inspired by the ERMH-BoW method which generates arelative motion histogram in a matrix form to describerelations of local video motion feature [23], we generate aRelation Matrix for TSC based on the relation descriptorsintroduced previously to better represent the naturalrelations and perform further feature extraction.

The relation descriptors of each time series pair in TSCare vectors in a high-dimensional vector space. Since therelation descriptors are generated from the differences ofobserved values and time intervals, there may existsredundant information as well the high dimensionality ofsuch vectors may introduce expensive computational cost atthe similarity search stage. Furthermore, the dimensions ofrelation descriptors are isolated from each other; thus,natural relations based on multiple properties cannot be

well represented. Therefore, dimensionality reduction anddimension reorganization on the raw relation descriptorsneeds to be performed in order to obtain a goodrepresentation. In our approach, we make use of PrincipalComponent Analysis to transform the relation descriptorsin high-dimensional vector space to a new space. The newspace consists of fewer dimensions, which exhibit highercorrelation between the dimensions.

Given a TSC object which has n time series T1; T2; . . . ; Tn,we can generate n2 relation descriptors Rij of each pair oftime series Ti and Tj from the TSC object, where Rij ¼fV ISðTi; TjÞ; V IT ðTi; TjÞg if we only consider the spatialand temporal relations. Using these n2 vectors, we performPCA to find the Principal components according to thedata’s latent correlation and distribution on these dimen-sions. Set r as the dimensionality of each Rij, and r� rtransformation matrix W for PCA as ðW1;W2; . . . ;WrÞ, wereserve only first few PCs which contain the majorityinformation. The number of PCs remained is denoted as r0.Thus, we obtain the transformed relation descriptors fRij

fRij ¼ ðRij �W1; Rij �W2; . . . ; Rij �Wr0 Þ; ð9Þ

where we utilize r0 PCs and transform all the relationdescriptors Rij into r0-dimensional vectors fRij.

In the next step, we use the obtained n� nr0-dimensionalvectors fRij; i; j ¼ 1; . . . ; n to form Relation Matrix RM of theTSC object. Each element in RM’s ith row and jth column isthe relation descriptor of time series Ti and Tj

RMij ¼ fRij; i; j ¼ 1; 2; . . . ; l:

Because fRij is an r0-dimensional vector, RM is actually aseries of matrices RM1; RM2; . . . ; RMr0 , each of whichcorresponds to one dimension of fRij, represented by

fR1ij;fR2ij; . . . ; fRr0

ij

RM1ij ¼ fR1ij ¼ Rij �W1

RM2ij ¼ fR2ij ¼ Rij �W2

� � �RMr0ij ¼ fRr0

ij ¼ Rij �Wr0 :

8>>><>>>: ð10Þ

4.2.3 Generation of RT

The relation matrix has been generated to represent therelations in TSCs, however such data structure cannot bedirectly deployed for similarity measure of TSC objects. Afeasible way is to use vectors to represent natural relationsin TSCs for efficient processing, e.g., find a group ofsignatures to represent the relation matrix.

In the matrix RM, the ith row, i.e., ðfRi1; fRi2; . . . ;fRilÞ,represents the relation between time series Ti and othertime series in the TSC object. The similar meaning can alsobe shown from the ith column of RM.

In order to improve the efficiency of TSC similarityretrieval, the costly pairwise matching of time series needsto be avoided. Each row or each column of the RM describesa certain time series’ natural relation to other time series. Itis important to note that the row or column order of time

CUI ET AL.: A FRAMEWORK FOR SIMILARITY SEARCH OF TIME SERIES CLIQUES WITH NATURAL RELATIONS 391

Page 8: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 …net.pku.edu.cn/~cuibin/Papers/2012 tkde.pdf · 2012-10-17 · A Framework for Similarity Search

series in an RM is not specified. This allows row or columnswap in the RM, which does not change the value of thesignatures. This ensures that the signatures extracted fromthe RM are invariant to elementary row and columntransformation. Consequently, the singular value for eachmatrix in RMi is used as the signature. The RM isrepresented by an r0-dimensional vector, where each valuein the ith dimension is the highest singular value of thecorresponding matrix RMi.

The singular value can be computed using the SingularValue Decomposition of each matrix. It is invariant toelementary row and column transformation. The basic ideaof SVD is as follows: For every matrix, A 2 IRm�r can bedecomposed into

A ¼ U�V T ;

where U 2 IRm�er, V 2 IRr�er, and � 2 IRer�er, with er � minðm;nÞ as the rank of A. Each row vi of V ½v1; . . . ; ver is the rightsingular vector of A and forms an orthonormal basis of itsrow space. Similarly, each column ui of U ½u1; . . . ; uer is theleft singular vector and forms a basis of the column space ofA. � diag½�1; . . . ; �er is a diagonal matrix with positivevalues �i, called the singular values of A.

SVD comes from two orthonormal transformations, i.e.,A ¼ PV T and A ¼ UQT , where U and V are orthonormalbasis of the row space and column space, respectively. TheSVD is a special case that V and U correspond to each other,i.e., PV T ¼ UQT ¼ U�V T , when the orthonormal transfor-mation satisfies that P and Q can effectively represent Awith a few first columns or rows, respectively. Suchstratification means that the transformation V or U trans-forms the vectors in A to a new space where the first fewvalues can effectively represent these vectors. In such case,the singular values �1; �2; . . . ; �er form the diagonal matrix �which can be used as the signatures of A. The importance ofeach value decreases in the order of �1; �2; . . . ; �er.

We use the first singular values of RM1; RM2 . . . ; RMr0 toform an r0-dimensional vector as the signature of therelation matrix. The elements of RM are the relationdescriptors after transforming to the first r0 PCs, and thesingular values from RMs of different TSCs is influenced bythe their transformation vectors of principal components;therefore, we use the singular value signature of RM alongwith the W1 in (9), which is the first transformation vector ofthe relation descriptors, as Relation Vector of Time SeriesClique (RT). The signature is a generalized descriptor of RMwhich stores the natural relation information extracted fromthe PCA space of the original relation descriptors, and W1

represents the transformation direction of the first PC.Therefore, RT is an r0 þ r-dimensional vector, where thefirst r0 dimensions store the Relation Matrix signature andthe other r dimensions store the transformation vector W1.The algorithm of generating the RT feature from TimeSeries Cliques is shown in Fig. 4.

A RT consists of two parts: The transform vector W1 in(9), and the singular values of the Relation Matrices. Thesetwo parts of a RT, correspond to the distribution of therelation descriptor and signatures of the Relation Matrix,respectively. In our proposed framework, PCA and SVD areused mainly to summarize all the Relation Descriptors to a

single descriptor. The single descriptor describes thepattern of the inner relation of the TSC, and is used forsimilarity matching. The W1 generated by PCA is used torecord the distribution information of the relation descrip-tors in a TSC. However, the W1 does not capture thedetailed information of the natural relation in the TSC. Foreach TSC, the relation matrices after processing PCAdescribe the pattern of TSC’s inner natural relations.Therefore, SVD is used to extract the signatures for thematrices, i.e., the first singular value, which are used asthe other component of RT. The use of PCA and SVD on thetransformed data are applied to relation descriptors fromdifferent domains. Hence, it is important to note that it isnot purely a dimensionality reduction process. PCAreduces the dimensionality and records the distributioninformation on the dimensions of extracted features ofrelation descriptors. SVD summarizes the matrices into avector using unique singular values by mixing the dimen-sions of time series.

The feature RT generated by our framework is apowerful way for representing the natural relations of aTSC. It has the following desirable properties. First, thesingular value captured in the RT is invariant to elementaryrow and column transformations in RM. This allows thesignature of a RM to be preserved even if the ordering oftwo time series in the relation matrix are changed. Thisenables multiple time series matching to be performedeffectively. Second, the RT is a simple, multidimensionalvector. This allow the distance between TSCs to bemeasured efficiently. Third, the RT is generated fromrelation descriptors of pairwise time series in a TSC, andthe relation descriptor is a simple and robust high-dimensional vector. This allows it to be used in manyreal-life applications.

4.3 Similarity Search on Time Series Cliques

We extract features from Time Series Cliques and generateCT and RT as the feature vectors. Specifically, CT provides acompact representation of characteristic patterns of time

392 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012

Fig. 4. The algorithm for RT feature extraction.

Page 9: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 …net.pku.edu.cn/~cuibin/Papers/2012 tkde.pdf · 2012-10-17 · A Framework for Similarity Search

series in a TSC, and RT represents the natural relationsamong the time series in a TSC. Both of these two descriptorsare generalized from detailed features, and hence can beefficiently used for similarity search in TSC databases.

Given two TSC objects which are represented byaforementioned features, e.g., TSCi ¼ fCTi; RTig andTSCj ¼ fCTj;RTjg, we define the distance function be-tween two TSCs distðTSCi; TSCjÞ based on the similaritymeasure for compact representation CT, i.e., DIS CðCTi;CTjÞ and relation vector RT, i.e., DIS RðRTi; RTjÞ. CTexpressed in (4) is the time series of PCs; thus, the distancefunction of DIS CðCTi; CTjÞ can be the same as the distancefunction between time series DISðTi; TjÞ in (1). Note that,based on the demand of specific application, any existingdistance function of time series matching can be appliedhere, as it is orthogonal to our mechanism. For simplicity ofpresentation, we deploy the euclidean Distance as thedefault distance function. As reported in Eamonn Keogh’stalk in Interface 2003 [24], he studied 11 different distancemeasures on time series matching, and concluded that nomethod performed better than traditional euclidean dis-tance and Time Warping distance overall.

RT is a ðr0 þ rÞ-dimensional vector including the RelationMatrix Signature and the transformation vector W1 for thefirst PC in the PCA of relation descriptors. The euclideandistance can be used to measure the similarity between theRelation Matrix Signatures, because similar Relation Ma-trices have similar distributions; thus, the � in A ¼ U�VT issimilar. We can see the relation between the Relation MatrixSignature and W1 from Fig. 3b which shows two groups ofobjects and their first PCs. In Fig. 3b, the smaller anglebetween the two first PCs and the more similar the datadistributions on the first PCs are, the more similar the twogroups of objects are. For these two groups of objects, wecan measure their similarity by their principal component’sdirection and the data distribution in such direction.Inspired by this example, the distance function ofDIS RðRTi; RTjÞ is defined as follows:

DIS Rð:; :Þ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiEucðSðRMiÞ; SðRMjÞÞ � sin<Wi1 ;Wj1

>q

ð11Þ

SðRMÞ ¼ ðsvdðRM1Þ; svdðRM2Þ; . . . ; svdðRMr0 ÞÞ; ð12Þ

where svdðRMiÞ represents the first singular value of RMi,and EucðSðRMiÞ; SðRMjÞÞ is the euclidean distance be-tween SðRMiÞ and SðRMjÞ, sin<Wi1 ;Wj1

> is the sine valueof the angle between the first PCs from two RTs.

According to the distance functions defined on the twofeatures, the overall distance between two TSC objectsdistðTSCi; TSCjÞ can be calculated by the geometric valueof the two distances:

distðTSCi; TSCjÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiDIS CðCTi; CTjÞ �DIS RðRTi; RTjÞ

q:

ð13Þ

The distance function distðTSCi; TSCjÞ to compute TSCsimilarity fuses the distance functions of DIS C and DIS Rwhich are the similarity measurement of time seriespatterns and their natural relations, respectively. In our

approach, we extract generalized pattern information frommultiple time series as CT and extract intrinsic naturalrelation vector RT based on pairwise relation matrices. Byusing these two components for TSC similarity evaluation,we solve the two challenges of effectively evaluating naturalrelationships and measuring the patterns of multiple timeseries in TSC; therefore, our approach can performeffectively and efficiently for similarity search in TSCdatabases. Note that, we can give different weights to twofeatures according to user preference or domain applica-tions if necessary.

5 PERFORMANCE ANALYSIS

In this section, we conduct an extensive performanceanalysis to evaluate the performance of the proposed TSCsimilarity search framework. We measure both the queryeffectiveness and the execution time for TSC retrieval.We compare our proposed approach named TSC with theBrute-Force approach presented in Section 3, and with thehidden-variable method in SPIRIT system [15]. SPIRIT isextended for TSC retrieval by using the hidden variables forsimilarity measurement between groups of time series. Thevalues of the hidden variables capture the general pattern ofmultiple input time series based on the value distributionand correlations.

Note that, the selection of distance function is orthogonalto the approaches, since we can simply apply differentfunctions to time series matching, such as euclideandistance, time wrapping distance. In our experiments, wehave evaluated different distance functions which yieldsimilar tendency on the performance comparison, whichdrew similar conclusion as in [24]. Due to the spaceconstraint, we only detail the evaluation on euclideandistance for performance analysis.

All experiments are conducted using Matlab. The hard-ware consists of a server with a Core 8 2.8 GHZ CPU, anduses the Ubuntu Linux OS. The performance analysisresults clearly show the superiority of our proposedapproach for similarity search over TSC databases.

5.1 Data Description

As there are no benchmark data set that can be used for TSCsimilarity search problems, we devise a TSC data set forperformance analysis. The TSC data set consists of bothreal-life and synthetic data sets. The real-life TSC data setconsists of earthquake wave and video trajectory (VT) data.The synthetic data sets are generated with differentconfigurations and parameters.

5.1.1 NHL Data Set

The first synthetic TSC data set is created based on the NHLdata set to simulate the groups of players in sport games.The NHL data set was used in [22], which consists ofNational Hockey League players’ 2D trajectories. Thetrajectories are obtained by digitizing the PhiladelphiaFlyers’ hockey games during the NHL 2001-2002 season.Every trajectory in the NHL data set is a 2D time seriesfðXi; YiÞji ¼ 1; 2; . . . ; lg, where the length l is equal to 256.

By checking the players’ movements in some sportsgame videos, we observe that the players’ trajectories in

CUI ET AL.: A FRAMEWORK FOR SIMILARITY SEARCH OF TIME SERIES CLIQUES WITH NATURAL RELATIONS 393

Page 10: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 …net.pku.edu.cn/~cuibin/Papers/2012 tkde.pdf · 2012-10-17 · A Framework for Similarity Search

scenes have some common characters, e.g., a scene usuallyhas a few key players, the players move with various speed,and their appearances have different time interval withsome temporal overlaps. Based on these observation, wefirst use 1,000 NHL trajectories to form 200 time seriescliques, each of which contains five different trajectories.Fig. 1 shows some examples of the TSCs generated by NHLdata, which simulates the multiple players in the Hockeycourt. In each clique, we generate temporal intervals for thetrajectories based on the moving distance Dis and anaverage running speed. The running speed and start time ofthe players are created following the Gaussian distribution.The temporal duration of each trajectory is generalized bydividing Dis with speed.

The generated TSCs are pairwise dissimilar in general, asthey consist of different trajectories and are with differentspeed and duration. Using the 200 TSCs as seeds, we expandthem to 200 TSC categories, each of which is with 80 TSCs.We produce variations of TSCs by adding small timewarping (e.g., 5 percent of the series length), and inter-polated Gaussian noise on frequency [25], and varying thenumber of trajectories in the TSC from 3 to 5. Consequently,our synthetic NHL data set contains 200 categories and eachhas 80 similar TSCs, i.e., the TSC database has 16,000 TSCs intotal. The retrieval performance can be evaluated by theaverage precision and recall of a group of queries fromdifferent categories in the data set. In our approach, wechoose the original 200 TSCs seeds as the query set.

5.1.2 EW Data Set

The second TSC data set is a real-life data set whichrepresents the Earthquake Wave of a certain region. TheEarthquake Wave data set is provided by the Department ofPhysical Geography, Peking University. The data setcontains the earthquake waves captured by a distributednetwork of seismographic stations for more than one year(from June 30th 2004 to Sept. 4th 2005). Twenty-two stationsare deployed in Tibet, China. The locations of these stationsare from latitude 31 to 27.5 and longitude 90.5 to 85.5. Eachof the station captures a 3D time series 24 hours a day,which represents the observed earthquake information. Weperform preprocessing to smooth and resample the EWsusing Discrete Cosine Transformation, and split the timeseries in the interval of every hour. We generate 10,368 TSCobjects and 140,424 time series in total from this data set.Each TSC has multiple of time series (from 1 to 22) andrepresents the earthquake waves of the region within acertain period (1 hr). Fig. 5 shows an example of the EWdata set, which illustrates the earthquake waves from seven

earthquake monitor stations in 7 am, Jan. 1 2005, when asignificant earthquake took place. We can see that the burstof earthquake wave is detected by all stations.

It is meaningful that we can detect earthquake behaviorbased on the patterns and natural relations of the earth-quake waves from the stations. In our approach, we applyour TSC similarity matching method in an earthquakeretrieval application. We obtain the earthquake informationfrom the website of US Geological Survey, EarthquakeHazards Program.1 We query for the information of all theglobal earthquakes whose magnitudes are higher than 6,from June 30th 2004 to Sept. 4th, 2005 and get 168 earth-quakes as a result. One hundred forty two of these globalearthquakes can be detected by our stations in Tibet. We usethe set of 142 TSCs during the earthquake took place as thequery set.

Comparing earthquakes to other noises, e.g., a vehicle’spassing or a station’s breakdown, earthquake can bedetected by nearly all the stations and the earthquake wavesexpress differently based on the characters of the geologicallocation and distance between the seismic vertical. There-fore, based on the patterns of earthquake (similar to Fig. 5,but different in every earthquake) and the natural relationsof these stations (the delay of earthquake spreading to eachstations and their geological location), it is meaningful usingthe TSCs of the earthquakes to retrieve other earthquakesbased on the TSC similarity.

5.1.3 VT Data Set

Additionally, we use 54 hours of video clips downloadedfrom Youtube. We evaluate our methods on trajectory-based video retrieval. We segment the video clips into29,604 shots by comparing the difference of adjacent frameswith a predefined threshold and extract Motion FlowsTrajectory [26] from the video shots. For each shot, wegenerate a TSC with multiple trajectories. This videotrajectory data set contains video clips from different scenesand the quality of these clips varies. This data set is used toevaluate the robustness of our approach.

The retrieval performance in NHL data set and EW dataset is measured by precision/recall, which are commonlyused in information retrieval community. Precision mea-sures how precise the search results are (the number ofcorrect results divided by the number of all returnedresults). Recall measures the completeness of availablerelevant results (the number of correct results divided bythe number of results that should have been returned). Asthere is no annotated labels for VT data set to evaluate theretrieval performance, P@N, one of the popular metricsused in information retrieval, is used to measure thefraction of the top N videos retrieved that are relevant tothe users interest. We choose 20 different video shots, whichcontain different type of motions, e.g., fast or slow motionwith/without camera motion, as the set of queries and foreach query the returned videos in top N with the highestsimilarity, e.g., 3, 5, 7, and 10 will be presented to threeevaluators who determine whether the top-N returnedimages are relevant to the query video.

394 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012

1. http://neic.usgs.gov/neis/epic/epic_global.html.

Fig. 5. An example of earthquake wave data.

Page 11: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 …net.pku.edu.cn/~cuibin/Papers/2012 tkde.pdf · 2012-10-17 · A Framework for Similarity Search

5.2 Performance Tuning

In the first experiment, we investigate two importantparameters of the proposed TSC retrieval method. The firstparameter is the number of PCs reserved in CT featuregeneration. Note in Section 4.1, we extract CT feature byusing first n0 PCs of multiple time series. The secondparameter is the reduced dimensionality of RelationDescriptor in RT in Section 4.2, i.e., r0 in (13), which is usedto capture the natural relations among the time series in aTSC. We use the synthetic NHL data set on this experiment.

We first evaluate the effect of CT feature on TSC retrievalperformance by varying the number of PCs n0. As themaximum number of PCs in CT generation is the number oftime series in TSC, and the minimum number of time seriesin the TSCs of our data set is 3, we conduct the experimentsby varying n0 from 1 to 3. The results of precision/recall forTSC retrieval are shown in Fig. 6.

We can see from Fig. 6a that the best performance isobtained when only the first PC is reserved. This is becausethe first PC summarizes sufficient general information ofmultiple patterns in TSCs, while the retrieval performanceby using more PCs may suffer from redundant and uselessinformation. Note that, when n0 > 1, we generate multiplerepresentative time series as CT features, and hence theincurred multiple time series matching may degrade overallperformance. Given relatively small number of time series ina TSC, our experimental result is consistent with the findings

in SPIRIT [15], where only 1 or 2 hidden variables aresuggested to represent the patterns of multiple time series.

Next, we investigate the performance of TSC retrieval byvarying the reduced dimensionality of Relation Descriptorin RT feature, r0, which is shown in Fig. 6b. From the figure,we can observe that as more PCs are used, the performanceis improved. This is because more information is includedin the relation descriptor of the RT. Different fromthe utilization of PCs in CT , the additional PCs increasethe dimensionality of relation vector. We can also observediminishing performance improvements as the PCs in-crease. As the most dominant information is reserved in thefirst few PCs, we can see that the performance improvementbecomes marginal when r0 is large than 5.

In the following experiments, we fix the two parametersfor performance comparison, i.e., n0 ¼ 1 and r0 ¼ 5.

5.3 Comparison with Other Methods

In this experiment, we compare the retrieval performance ofour approach with the Brute-Force matching algorithm andthe SPIRIT [15] method. The SPIRIT uses hidden variablesto represent the multiple time series that corresponds to theCT feature used in our approach. However, the SPIRIT isenhanced with adaptively choosing the number of hiddenvariables for better pattern representations. We use NHLdata set, EW data set, and VT data set for this experiment,and the results are shown in Fig. 7.

CUI ET AL.: A FRAMEWORK FOR SIMILARITY SEARCH OF TIME SERIES CLIQUES WITH NATURAL RELATIONS 395

Fig. 6. Performance on parameter tuning. (a) Varying the number of PCs in CT. (b) Varying the number of PCs in RT.

Fig. 7. Performance comparison of effectiveness. (a) Precision recall on NHL data set. (b) Precision recall on EW data set.

Page 12: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 …net.pku.edu.cn/~cuibin/Papers/2012 tkde.pdf · 2012-10-17 · A Framework for Similarity Search

In Fig. 7a, we can see that the SPIRIT performs the worstamong the three methods on the NHL data set. This isbecause the compact representation of multiple time seriesin SPIRIT, i.e., hidden variables, can only capture thelimited pattern information, while ignore the intrinsicnatural relations between the multiple time series in theTSC, which are more valuable to distinguish different TSCs.The Brute-Force algorithm, comparing with the SPIRIT,yields better performance, because the algorithm matchesall the time series in TSC exhaustively to explore thepairwise time series similarity. This method can address thechallenge of matching different time series in TSC retrieval;however, the Brute-Force method also ignores the naturalrelations among the multiple time series in TSC. Therefore,the overall retrieval performance of Brute-Force method isalso limited.

We can see that in Fig. 7a, our proposed TSC approachusing the combination of CT and RT yields the bestperformance among the three algorithms. CT featurematches the similarity of group of time series by general-izing their characteristic patterns, and the RT informationcan capture the natural relations among time series in TSCs.The TSC approach can take both factors into effect for TSCretrieval, and hence improve the retrieval effectiveness inTSC databases. While the natural relations in a TSC cannotbe captured by the hidden variables of SPIRIT and Brute-Force method, which are important to evaluate thesimilarity between time series cliques.

The results in Fig. 7b are from the EW data set. Theseresults also show that the TSC approach yields the bestperformance, though all approaches degenerate comparedwith the results on NHL data set. This is because the real-life EW data have much noise and temporal variance, andtheir waves vary from different stations as shown in Fig. 5.Our approach, on the other side, although facing the sameproblem in generating CT, the RT generated from relationdescriptors describing the natural relations can compensatethe insufficiencies caused by data characters; hence, theCT+RT framework is more robust in different applications.

Fig. 8 shows the trajectory-based video retrieval resultsof three approaches on VT data set, where the performanceof our method is more than 10 percent better than SPIRITand Brute-Force methods. The combination of the twofeature descriptors in our approach, (CT and RT) provides a

scalable, effective, and efficient solution for TSC similaritysearch. In addition, the TSC retrieval mechanism integratesthe compact representation of multiple time series and thenatural relations among time series in cliques, and hencecan be effectively applied in various application domains.

5.4 Scalability and Robustness

In this section, we investigate the scalability and Robustnessof the proposed approaches. We first evaluate the timeefficiency for the three approaches in terms of the scalabilityof the database size in NHL data set. We vary the size of TSCdata set by utilizing 40, 80, 120, 160, 200 categories of TSCs,which results in the data size from 3,200 to 16,000 TSCs.

Fig. 9 shows the average time cost of these differentapproaches for one TSC retrieval in these data sets. We canobserve that SPIRIT and the proposed approach arecomparable and perform much better than the Brute-Forcemethod. The SPIRIT utilizes the hidden variables by PCAfor clique matching which is most efficient. Our TSCapproach explores both CT and RT for similarity computa-tion, and the extra computation on RT feature is neededcompared with SPIRIT. While the Brute-Force algorithmperforms badly by two orders of magnitude. This is becausethe Brute-Force algorithm needs to calculate all the possiblematches for each time series pair in the cliques. Note that,the Brute-Force method has to compute the similarity forAn2n1

combinations, where n1 and n2 are the numbers of timeseries in two TSCs and n1 � n2.

Table 1 shows the precision and recall of differentapproaches by varying the size on NHL and EW data sets.On NHL data set, we can see that, when the data size isrelatively small, e.g., 3,200 or 6,400, the performance ofBrute-Force is comparable with our TSC method. Byexhaustively calculating the similarity between time seriescliques, Brute-Force can yield satisfactory result on smalldata set where the cliques are more distinguishable.However, when the data size increases, our approach showssignificant advantage over the other two competitors; thesimilarity measurement based on CT and RT can success-fully identify the differences among cliques in a large TSCdatabase. On the EW data set, we can see that our methodhas the best precision when recall is 0.1 in the two-monthsubset of 1,440 TSCs, followed by the SPIRIT algorithm. Andthe precisions of all three approaches on recall larger than

396 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012

Fig. 8. Comparison on VT data set.

Fig. 9. Time efficiency of different approaches.

Page 13: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 …net.pku.edu.cn/~cuibin/Papers/2012 tkde.pdf · 2012-10-17 · A Framework for Similarity Search

0.1 are all very small, while the Brute-Force method have thebest results. When the data size grow to 10,368, five timeslarger than the subset, we can see that our methoddominates other methods, although the complete setcontains a lot of noise and some local earthquakes whosemagnitudes are lower than 6 but around 5 can be identifiedas earthquakes as well. Therefore, our method is morerobust in terms of the data size scalability. Overall, ourproposed TSC approach shows the superiority for both twoperformance metrics, i.e., efficiency and effectiveness.

We next proceed to evaluate the sensitivity to noise ofproposed method on synthetic NHL data set by producingdifferent variations of TSCs. Specifically, we vary theGaussian Noise on frequency domain from 10 to 25 percentof its origin value, and vary the time warping noise from 5to 20 percent. Note that, the default noise is 5 percent andthe results have been shown in Table 1. Table 2 demon-strates the results on precision and recall of differentapproaches. The results show the TSC’s robustness withrespect to noise in the sense that it can take sufficiently largedegree of variations. The proposed CT and RT featurevectors provide an effective and robust representation ofTSC data. Brute-Force method is more sensitive to the noisethough it compares each pair of time series from two TSCsexhaustively. The SPIRIT uses very compact representationwhich is sensitive to the noise on length and frequency;thus, the SPIRIT performs worst in this scenario although itis the most efficient.

6 CONCLUSION

In this paper, we focus on a novel Time Series Cliquessimilarity search problem. The goal is to retrieve a group oftime series with natural relations. This topic has directimpact on many real-world applications, e.g., object-basedvideo retrieval, earthquake monitoring, etc.

In order to capture the intrinsic properties of a TSC, wepropose two novel, compact structures, called CT and RTfeature vectors for TSC objects. Using a fusion of CT and RT

feature vectors, we can effectively capture the generalpattern of multiple time series and the natural relations thatexist between the time series. Most importantly, the vectorscan be used during TSC similarity search to eliminate theneed for pairwise comparisons of time series in multipleTSCs. To evaluate the effectiveness and efficiency of ourapproach, we conducted a comprehensive performanceanalysis using both synthetic and real-life data sets. To ourknowledge, the problem of time series clique similaritysearch has not been explored by previous works. Our workprovides the first and exploratory investigation of the timeseries clique similarity search problem. The proposedCT+RT framework provides a good foundation for futureresearch in time series clique similarity search for variousapplication domains.

ACKNOWLEDGMENTS

This research was supported by the National Natural ScienceFoundation of China under Grant Nos. 61073019, 60933004,60873063, and HGJ Grant No. 2011ZX01042-001-001.

REFERENCES

[1] E. Keogh, S. Chu, D. Hart, and M. Pazzani, “Segmenting TimeSeries: A Survey and Novel Approach,” Data Mining in Time SeriesDatabases, World Scientific, 2004.

[2] E.J. Keogh, S. Chu, D. Hart, and M.J. Pazzani, “An OnlineAlgorithm for Segmenting Time Series,” Proc. IEEE Int’l Conf. DataMining (ICDM), pp. 289-296, 2001.

[3] F.I. Bashir, A.A. Khokhar, and D. Schonfeld, “Real-Time MotionTrajectory-Based Indexing and Retrieval of Video Sequences,”IEEE Trans. Multimedia, vol. 9, no. 1, pp. 58-65, Jan. 2007.

[4] J.-W. Hsieh, S.-L. Yu, and Y.-S. Chen, “Motion-Based VideoRetrieval by Trajectory Matching,” IEEE Trans. Circuits andSystems for Video Technology, vol. 16, no. 3, pp. 396-409, Mar. 2006.

[5] L. Chen, M.T. Ozsu, and V. Oria, “Symbolic Representation andRetrieval of Moving Object Trajectories,” Proc. ACM SIGMM Int’lWorkshop Multimedia Information Retrieval, pp. 227-234, 2004.

[6] L. Chen, M.T. Ozsu, and V. Oria, “Robust and Fast SimilaritySearch for Moving Object Trajectories,” Proc. ACM SIGMOD Int’lConf. Management of Data (SIGMOD), pp. 491-502, 2005.

[7] Z. Zhao, B. Cui, W.H. Tok, and J. Zhao, “Efficient SimilarityMatching of Time Series Cliques with Natural Relations,” Proc.IEEE Int’l Conf. Data Eng. (ICDE), 2010.

[8] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra,“Dimensionality Reduction for Fast Similarity Search in LargeTime Series Databases,” J. Knowledge and Information Systems,vol. 3, pp. 263-286, 2000.

[9] T. Kahveci and A.K. Singh, “Variable Length Queries for TimeSeries Data,” Proc. IEEE Int’l Conf. Data Eng. (ICDE), pp. 273-282,2001.

CUI ET AL.: A FRAMEWORK FOR SIMILARITY SEARCH OF TIME SERIES CLIQUES WITH NATURAL RELATIONS 397

TABLE 1Precision/Recall by Varying the Data Size

TABLE 2Sensitivity to Noise

Page 14: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 …net.pku.edu.cn/~cuibin/Papers/2012 tkde.pdf · 2012-10-17 · A Framework for Similarity Search

[10] Y.-L. Wu, D. Agrawal, and A. El Abbadi, “A Comparison of DFTand DWT Based Similarity Search in Time-Series Databases,” Proc.Ninth Int’l Conf. Information and Knowledge Management (CIKM),pp. 488-495, 2000.

[11] T.-L. Le, A. Boucher, and M. Thonnat, “Subtrajectory-Based VideoIndexing and Retrieval,” Proc. 13th Int’l MultiMedia Modeling Conf.(MMM), pp. 418-427, 2007.

[12] K.V. Ravi Kanth, D. Agrawal, and A. Singh, “DimensionalityReduction for Similarity Searching in Dynamic Databases,”SIGMOD Record, vol. 27, no. 2, pp. 166-176, 1998.

[13] F. Korn, H.V. Jagadish, and C. Faloutsos, “Efficiently SupportingAd Hoc Queries in Large Data Sets of Time Sequences,” Proc.ACM SIGMOD Int’l Conf. Management of Data, pp. 289-300, 1997.

[14] C.-W. Su, H.-Y. Liao, H.-R. Tyan, C.-W. Lin, D.-Y. Chen, and K.-C.Fan, “Motion Flow-Based Video Retrieval,” IEEE Trans. Multi-media, vol. 9, no. 6, pp. 1193-1201, Oct. 2007.

[15] S. Papadimitriou, J. Sun, and C. Faloutsos, “Streaming PatternDiscovery in Multiple Time-Series,” Proc. 31st Int’l Conf. Very LargeData Bases (VLDB), pp. 697-708, 2005.

[16] Y. Sakurai, S. Papadimitriou, and C. Faloutsos, “Braid: StreamMining through Group Lag Correlations,” Proc. ACM SIGMODInt’l Conf. Management of Data, pp. 599-610, 2005.

[17] K. Deng, H. Xu, S. Sadiq, Y. Lu, G. Fung, and H. Shen, “ProcessingGroup Nearest Group Query,” Proc. Int’l Conf. Data Eng. (ICDE),2009.

[18] J.F. Allen, “Maintaining Knowledge about Temporal Intervals,”Comm. ACM, vol. 26, no. 11, pp. 832-843, 1983.

[19] M. Fleischman, P. Decamp, and D. Roy, “Mining TemporalPatterns of Movement for Video Content Classification,” Proc.ACM Workshop Multimedia Information Retireval (MIR), pp. 183-192,2006.

[20] F. Morchen and A. Ultsch, “Efficient Mining of UnderstandablePatterns from Multivariate Interval Time Series,” Data Mining andKnowledge Discovery, vol. 15, no. 2, pp. 181-215, 2007.

[21] A.J.T. Lee, P. Yu, H.-P. Chiu, and R.-W. Hong, “3D Z-String: ANew Knowledge Structure to Represent Spatio-Temporal Rela-tions between Objects in a Video,” Pattern Recognition Letters,vol. 26, pp. 2500-2508, 2005.

[22] Y. Cai and N. Raymond, “Indexing Spatio-Temporal Trajectorieswith Chebyshev Polynomials,” Proc. ACM SIGMOD Int’l Conf.Management of Data, pp. 599-610, 2004.

[23] F. Wang, Y.-G. Jiang, and C.-W. Ngo, “Video Event DetectionUsing Motion Relativity and Visual Relatedness,” Proc. ACM Int’lConf. Multimedia, 2008.

[24] E. Keogh, “Doing Something Useless Slightly Faster: The State ofthe Art in Time Series Data Mining?” Proc. 35th Symp. Interface ofComputing Science and Statistics, 2003.

[25] M. Vlachos, J. Lin, E. Keogh, and D. Gunopulos, “A Wavelet-Based Anytime Algorithm for k-Means Clustering of Time Series,”Proc. Third SIAM Int’l Conf. Data Mining, 2003.

[26] C.-W. Su, H.-Y. Liao, H.-R. Tyan, C.-W. Lin, D.-Y. Chen, and K.-C.Fan, “Motion Flow-Based Video Retrieval,” IEEE Trans. Multi-media, vol. 9, no. 6, pp. 1193-1201, Oct. 2007.

Bin Cui is a professor in the Department ofComputer Science, Peking University. Hismajor research interests include index techni-ques, multi/high databases, multimedia retrie-val, and database performance. He is currentlyan associate editor of IEEE Transactions onKnowledge and Data Engineering, and hasserved on program committees of severaldatabase conferences, including SIGMOD,VLDB, and ICDE. He is a member of the

ACM, and a senior member of the IEEE.

Zhe Zhao received the BS degree from theDepartment of Computer Science, Peking Uni-versity, in 2008. He is currently working towardthe master’s degree in the Database Laboratory,Department of Computer Science, Peking Uni-versity, China. His research interests includetime series database, multimedia database, andweb mining. He is a student member of theACM.

Wee Hyong Tok received the PhD degree fromthe School of Computing, National University ofSingapore in 2009. He is a program managerwith the Microsoft SQL Server China Researchand Development group (SQLCRD). His re-search interest includes progressive algorithmsfor data streams, data management in socialnetworks, and database system architecture forthe cloud.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

398 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012