closest periodic vectors in spaces

11
Theoretical Computer Science 533 (2014) 26–36 Contents lists available at ScienceDirect Theoretical Computer Science www.elsevier.com/locate/tcs Closest periodic vectors in L p spaces Amihood Amir a,b,1 , Estrella Eisenberg a , Avivit Levy c,d,,2 , Noa Lewenstein e a Department of Computer Science, Bar-Ilan University, Ramat-Gan 52900, Israel b Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, United States c Department of Software Engineering, Shenkar College, 12 Anna Frank, Ramat-Gan, Israel d CRI, Haifa University, Mount Carmel, Haifa 31905, Israel e Netanya College, Netanya, Israel article info abstract Article history: Received 13 March 2013 Received in revised form 4 March 2014 Accepted 7 March 2014 Communicated by M. Kiwi Keywords: String algorithms Approximate periodicity Closest vector The problem of finding the period of a vector V is central to many applications. Let V be a periodic vector closest to V under some metric. We seek this V , or more precisely we seek the smallest period that generates V . In this paper we consider the problem of finding the closest periodic vector in L p spaces. The measures of “closeness” that we consider are the metrics in the different L p spaces. Specifically, we consider the L 1 , L 2 and L metrics. In particular, for a given n-dimensional vector V , we develop O (n 2 ) time algorithms (a different algorithm for each metric) that construct the smallest period that defines such a periodic n-dimensional vector V . We call that vector the closest periodic vector of V under the appropriate metric. We also show (three) ˜ O (n) time constant approximation algorithms for the period of the approximate closest periodic vector. © 2014 Elsevier B.V. All rights reserved. 1. Introduction Exact data periodicity has been amply researched over the years [17]. Linear time algorithms for exploring the peri- odic nature of data represented as strings were suggested (e.g. [11]). Multidimensional periodicity [2,13,19] and periodicity in parameterized strings [7] was also explored. In addition, periodicity has played a role in efficient parallel string algo- rithms [12,3,4,8,9]. Many phenomena in the real world have a particular type of event that repeats periodically during a certain period of time. The ubiquity of cyclic phenomena in nature, in such diverse areas as Astronomy, Geology, Earth Science, Oceanography, Meteorology, Biological Systems, the Genome, Economics, and more, has led to a recent interest in periodicity. Examples of highly periodic events include road traffic peaks, load peaks on web servers, monitoring events in computer networks and many others. Finding periodicity in real-world data often leads to useful insights, because it sheds light on the structure of the data, and gives a basis to predict future events. Moreover, in some applications periodic patterns can point out a problem: In a computer network, for example, repeating error messages can indicate a misconfiguration, or even a security intrusion such as a port scan [18]. However, real data generally contain errors, either because they are inherent in the data, or because they are introduced by the data gathering process. Nevertheless, it is still valuable to detect and utilize the underlying periodicity. This calls for the notion of approximate periodicity. Given a data vector, we may not be confident of the measurement or suspect the * Corresponding author. E-mail addresses: [email protected] (A. Amir), [email protected] (A. Levy), [email protected] (N. Lewenstein). 1 Partly supported by NSF Grant CCR-09-04581 and ISF Grant 347/09. 2 Partly supported by ISF Grant 347/09. http://dx.doi.org/10.1016/j.tcs.2014.03.019 0304-3975/© 2014 Elsevier B.V. All rights reserved.

Upload: noa

Post on 30-Dec-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Theoretical Computer Science 533 (2014) 26–36

Contents lists available at ScienceDirect

Theoretical Computer Science

www.elsevier.com/locate/tcs

Closest periodic vectors in Lp spaces

Amihood Amir a,b,1, Estrella Eisenberg a, Avivit Levy c,d,∗,2, Noa Lewenstein e

a Department of Computer Science, Bar-Ilan University, Ramat-Gan 52900, Israelb Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, United Statesc Department of Software Engineering, Shenkar College, 12 Anna Frank, Ramat-Gan, Israeld CRI, Haifa University, Mount Carmel, Haifa 31905, Israele Netanya College, Netanya, Israel

a r t i c l e i n f o a b s t r a c t

Article history:Received 13 March 2013Received in revised form 4 March 2014Accepted 7 March 2014Communicated by M. Kiwi

Keywords:String algorithmsApproximate periodicityClosest vector

The problem of finding the period of a vector V is central to many applications. Let V ′ be aperiodic vector closest to V under some metric. We seek this V ′, or more precisely we seekthe smallest period that generates V ′ . In this paper we consider the problem of finding theclosest periodic vector in Lp spaces. The measures of “closeness” that we consider are themetrics in the different Lp spaces. Specifically, we consider the L1, L2 and L∞ metrics.In particular, for a given n-dimensional vector V , we develop O (n2) time algorithms(a different algorithm for each metric) that construct the smallest period that defines sucha periodic n-dimensional vector V ′ . We call that vector the closest periodic vector of V underthe appropriate metric. We also show (three) O (n) time constant approximation algorithmsfor the period of the approximate closest periodic vector.

© 2014 Elsevier B.V. All rights reserved.

1. Introduction

Exact data periodicity has been amply researched over the years [17]. Linear time algorithms for exploring the peri-odic nature of data represented as strings were suggested (e.g. [11]). Multidimensional periodicity [2,13,19] and periodicityin parameterized strings [7] was also explored. In addition, periodicity has played a role in efficient parallel string algo-rithms [12,3,4,8,9]. Many phenomena in the real world have a particular type of event that repeats periodically during acertain period of time. The ubiquity of cyclic phenomena in nature, in such diverse areas as Astronomy, Geology, EarthScience, Oceanography, Meteorology, Biological Systems, the Genome, Economics, and more, has led to a recent interest inperiodicity. Examples of highly periodic events include road traffic peaks, load peaks on web servers, monitoring events incomputer networks and many others. Finding periodicity in real-world data often leads to useful insights, because it shedslight on the structure of the data, and gives a basis to predict future events. Moreover, in some applications periodic patternscan point out a problem: In a computer network, for example, repeating error messages can indicate a misconfiguration, oreven a security intrusion such as a port scan [18].

However, real data generally contain errors, either because they are inherent in the data, or because they are introducedby the data gathering process. Nevertheless, it is still valuable to detect and utilize the underlying periodicity. This callsfor the notion of approximate periodicity. Given a data vector, we may not be confident of the measurement or suspect the

* Corresponding author.E-mail addresses: [email protected] (A. Amir), [email protected] (A. Levy), [email protected] (N. Lewenstein).

1 Partly supported by NSF Grant CCR-09-04581 and ISF Grant 347/09.2 Partly supported by ISF Grant 347/09.

http://dx.doi.org/10.1016/j.tcs.2014.03.0190304-3975/© 2014 Elsevier B.V. All rights reserved.

A. Amir et al. / Theoretical Computer Science 533 (2014) 26–36 27

periodic process to be inexact. We may, therefore, be interested in finding the closest periodic nature of the data, i.e., whatis the smallest period of the closest periodic vector to the given input vector, where the term closest means that it hasthe smallest number of errors. It is natural to ask if such a period can be found efficiently. The error cause varies with thedifferent phenomena. This fact is formalized by considering different metrics to define the error.

Different aspects of approximate periodicity, inspired by various applications, were studied in the literature. In someapplications, such as computational biology, the search is for periodic patterns (or, as coined in the literature, approximatemultiple tandem repeats) and not necessarily full periodicity (e.g. [15]). Another example of such applications is monitoringevents in computer networks which inspired [14] to study the problem of finding approximate arithmetic progressions in asequence of numbers. In many other applications such as Astronomy, Geology, Earth Science, Oceanography and Meteorol-ogy, the studied phenomena have a full (inexact) periodic nature that should be discovered. The term approximate periodicityis typically used only for such applications, where terms such as approximate periodic patterns are used otherwise. In [5] ap-proximate periodicity was studied under two metrics: the Hamming distance and the swap distance. Both these metrics arepseudo-local, which intuitively means that any error causes at most a constant number of mismatches (see [6]). In [6] it wasshown that for a guaranteed small distance in pseudo-local metrics, the original cycle can be detected and corrected in thecorrupted data.

The focus of this paper is on vector spaces. The common and natural metrics for vector spaces are L1, L2 and L∞ . Thesemetrics are not pseudo-local and, therefore, the methods of [6] do not apply. In this paper we tackle the problem of findingthe period of the closest periodic vector under the L1, L2, and L∞ metrics. Specifically, given a vector V ∈ R

n , and a metricL1, L2, or L∞ , we seek a natural number p � n

2 , and a period P of length p, such that the distance between P �n/p� P ′ andV is smallest, where P i denotes P concatenated to itself i times, P ′ is the prefix of P of length n − p · � n

p �, and the metricis L1, L2, or L∞ . We prove the following theorems:

Theorem 1. Given a vector V ∈ Rn, for each of the metrics L1 , L2 or L∞ , a vector P which is the period of the closest periodic vector

under the metric can be found in O (n2) time.

Theorem 2. Given a vector V ∈ Σn, where Σ = {1, . . . , |Σ |}, then:

1. The period P of the closest periodic vector of V under the L2 metric can be approximated to a factor of√

6 in O (n log n) time.2. For any ε > 0, the period P of the closest periodic vector of V under the L1 metric can be approximated to a factor of 3 + ε in

O ( 1ε2 n log n log |Σ |) time.

3. For any ε > 0, the period P of the closest periodic vector under the L∞ metric can be approximated to a factor of 3 + ε inO ( 1

ε n log n log |Σ |) time.

Remark. Note that if ε is a constant and |Σ | = O (nc) for some constant c > 0, then the algorithms of Theorem 2 have timecomplexity O (n).

The proof of Theorem 1 is described in Section 3 and the proof of Theorem 2 is described in Section 4.

2. Preliminaries

In this section we give basic definitions of periodicity and related issues as well as a formal definition of the problem.

Definition 1. Let V = 〈V [1], V [2], . . . , V [n]〉 be an n-dimensional vector. A sub-vector of V is a vector T = 〈V [i], . . . , V [ j]〉,where 1 � i � j � n. Clearly, the dimension of T is j − i + 1.

The sub-vector T = 〈V [1], . . . , V [ j]〉 is called a prefix of V and the sub-vector T = 〈V [i], . . . , V [n]〉 is called a suffix of V .For two vectors V = 〈V [1], V [2], . . . , V [n]〉, T = 〈T [1], T [2], . . . , T [m]〉 we say that the vector R = 〈V [1], V [2], . . . ,

V [n], T [1], T [2], . . . , T [m]〉 is the concatenation of V and T .

Definition 2. Let V be a vector. Denote by |V | the dimension of V and let |V | = n. V is called periodic if V = P ipref (P ),where i ∈ N, i � 2, P is a sub-vector of V be such that 1 � |P | � n/2, P i is the concatenation of P to itself i times, andpref (P ) is a prefix of P . The smallest such sub-vector P is called the period of V . If V is not periodic it is called aperiodic.

Notation. Throughout the paper we use p to denote a period dimension and P the period vector, i.e., |P | = p.

Definition 3. Let P be a vector of dimension p. Let n ∈ N be such that 2 · p � n. The vector V P is defined to be a periodic

vector of dimension n with period P , i.e., V P = P � np �pref (P ), where pref (P ) is the prefix of P of dimension n − � n � · p.

p

28 A. Amir et al. / Theoretical Computer Science 533 (2014) 26–36

Example. Given the vector P = 〈A, B, C〉 of dimension p = 3, let n = 10 then V P = 〈A, B, C, A, B, C, A, B, C, A〉.

Definition 4. Let V be an n-dimensional vector over R. Let d be a metric defined on Rn . V is called α-close to periodic under

the metric d, if there exists a vector P over Rp , p ∈ N, p � n/2, such that d(V P , V ) � α. The vector P is called an α-close

period of V .

The Problem Definition. The problem is formally defined below.

Definition 5. Given a metric d, the Closest Periodic Vector Problem under the metric d, is the following:

INPUT: Vector V of dimension n over R.OUTPUT: The period P of the closest periodic vector of V under the metric d, and α such that P is an α-close period of Vunder d.

We will also need the following definition.

Definition 6. Given a metric d, let P be the vector of dimension p such that d(V , V P ) is minimal over all possible vectorsof dimension p. We call P the p-dimension close period under d.

2.1. Lipsky–Porat approximated pattern matching with the L1 , L2 and L∞ metrics

The approximation algorithms we design in Section 4, use the approximated pattern matching algorithms of Lipsky andPorat [16]. For the sake of completeness, we briefly state their results and their basic technique. The problems solved byLipsky–Porat algorithms are the following:

Definition 7.

– The String Matching with L2-Distance is:Input: Text vector T = t1, t2, . . . , tn and P = p1, p2, . . . , pm , where ti, pi ∈ {1, . . . , |Σ |}.

Output: Results vector O [1, . . . ,n − m + 1], where for every i, O [i] =√∑m

j=1(ti+ j−1 − p j)2.

– The Approximated String Matching with L1-Distance is:Input: Text vector T = t1, t2, . . . , tn and P = p1, p2, . . . , pm , where ti, pi ∈ {1, . . . , |Σ |}, 0 < ε < 1.Output: Results vector O [1, . . . ,n − m + 1], s.t. O [i]� O [i]� (1 + ε)O [i] for every i, where O [i] = ∑m

j=1 |ti+ j−1 − p j |.– The Approximated String Matching with L∞-Distance is:

Input: Text vector T = t1, t2, . . . , tn and P = p1, p2, . . . , pm , where ti, pi ∈ {1, . . . , |Σ |}, 0 < ε < 1.Output: Results vector O [1, . . . ,n − m + 1], s.t. O [i]� O [i]� (1 + ε)O [i] for every i, where O [i] = maxm

j=1 |ti+ j−1 − p j|.

The basic tool used in their algorithms is convolution, defined as follows:

Definition 8. The convolution vector of two vectors T , P , |T | = n, |P | = m, denoted by T ⊗ P , is defined as vector w suchthat:

w[i] =m∑

j=1

T [i + j − 1]P [ j].

The convolution can be computed in O (n log m) time, in a computational model with word size of O (log m), by using theFast Fourier Transform (FFT) [10]. Lipsky and Porat [16] show that the efficient computation of convolutions can be utilizedin order to achieve efficient algorithms for the above problems. Specifically, they prove the following:

– The string matching with L2-distance problem can be solved in time O (n log m).– The approximated string matching with L1-distance problem can be solved in time O ( 1

ε2 n log m log |Σ |) with approxi-mation factor 1 + ε , for any given ε .

– The approximated string matching with L∞-distance problem can be solved in time O ( 1ε n log m log |Σ |) with approxi-

mation factor 1 + ε , for any given ε .

We combine these results with an adaptation from [5] of the self-convolution vector, which in this paper we define slightlydifferent, as follows:

A. Amir et al. / Theoretical Computer Science 533 (2014) 26–36 29

Definition 9. Let V be a vector of length n over alphabet Σ = {1, . . . , |Σ |}, and let V be the vector V concatenatedwith n 0’s. The self-convolution vector of V , v , is defined for every i, 1 � i � n,

v[i] =n∑

j=1

V [i + j − 1] · V [ j].

Remark. Note that in the problems of Lipsky–Porat we can take T = V and P = V and use the self-convolution of V insteadof the convolution T ⊗ P in their algorithms. This way we get that for t = 1,2,∞, for every 1 � s � n, the result vector oftheir Lt -algorithm in position s gives the (maybe approximated) dLt (V pre, V suf ), where V pre (respectively, V suf ) is the prefix(respectively, suffix) of V of length n − s + 1, and dLt is the Lt -distance.

3. Closest periodic vectors in L1, L2 and L∞ spaces

3.1. Closest periodic vector in L2 space

We first examine the case where the metric d is L2. Given a vector V ∈Rn , we seek another vector P , 1 � |P |� n

2 , whichminimizes the L2 distance between V and V P . Formally,

INPUT: Vector V ∈ Rn .

OUTPUT: P ∈ Rp,1 � p � n

2 minimizing dL2 (V P , V ) =√∑n

i=1(V P [i] − V [i])2.

We begin by studying the case of a monochromatic vector and showing that it is easy to find a monochromatic vectorclosest to a given vector under the L2 metric. A monochromatic vector refers to a vector containing the same scalar in allthe coordinates, namely for a scalar x the monochromatic vector is 〈x, x, . . . , x〉. We denote by xk the k-dimensional vector〈x, x, . . . , x〉. The well-known Lemma 1 (e.g. [20]) is the cornerstone of our method.

Lemma 1. Let V ∈Rk. Then, the scalar x such that dL2 (V , xk) is minimal can be found in time O (k). Moreover, x equals the average of

V [1], . . . , V [k].

Using Lemma 1, we can now show that given a dimension p the p-dimension close period of V under the L2 metric canbe computed in linear time.

Lemma 2. Let V ∈ Rn and let p be a dimension p � n

2 . Then, P ∈ Rp such that dL2 (V P , V ) is minimal over all vectors of dimension p

can be found in O (n) time.

Proof. Let j = n − p� np �, and let P = 〈x1, x2, . . . , xp〉. By the definition of the L2 metric we have that:

dL2(V P , V ) =� n

p �−1∑h=0

dL2

(〈x1, x2, . . . , xp〉, 〈vh·p+1, vh·p+2, . . . , vh·p+p〉)

+ dL2

(〈x1, . . . , x j〉, 〈v� np �p+1, v� n

p �p+2, . . . , vn〉)

=j∑

i=1

dL2

(x� n

p �+1

i ,

⟨V [i], V [i + p], V [i + 2p], . . . , V

[i +

⌊n

p

⌋p

]⟩)

+p∑

i= j+1

dL2

(x� n

p �i ,

⟨V [i], V [i + p], V [i + 2p], . . . , V

[i +

(⌊n

p

⌋− 1

)p

]⟩)

Hence, it follows from Lemma 1 that P = 〈x1, . . . , xp〉, where for 1 � i � j, we have:

xi =∑� n

p �h=0 V [i + h · p]

� np � + 1

,

and for j < i � p, we have:

xi =∑� n

p �−1

h=0 V [i + h · p]� n

p � .

Moreover, each xi , 1 � i � p, can be found in time O (n/p). The lemma follows. �

30 A. Amir et al. / Theoretical Computer Science 533 (2014) 26–36

We can now use the above in order to find the period of the closest periodic vector under the L2 metric. According toLemma 2, the p-dimension close period for any given dimension p can be found in O (n) time. Hence, we can perform abrute-force search over every dimension between 1 and n

2 and check for the best over all p-dimension periods yielding ourdesired period P . Since finding the p-dimension close period for a given dimension takes O (n) time and there are O (n)

different potential dimensions, the time for this algorithm is O (n2). This proves Theorem 1 for the L2 metric.

3.2. Closest periodic vector in L∞ space

We now examine the case where the metric is L∞ . Given a vector V ∈ Rn , we seek another vector P , 1 � |P |� n

2 , whichminimizes the L∞ distance between V and V P . Formally,

INPUT: Vector V ∈Rn .

OUTPUT: P ∈Rp,1 � p � n

2 minimizing dL∞(V P , V ) = maxni=1 |V P [i] − V [i]|.

As in Subsection 3.1, we use the fact that the closest monochromatic vector under the L∞-metric can be easily found [20].

Lemma 3. Let V ∈Rk. The scalar x such that dL∞ (V , xk) is minimal can be found in time O (k). Moreover, x equals the average between

the maximal and minimal values among V [1], . . . , V [k].

From Lemma 3 we get that given a dimension p, finding the p-dimension period of V under the L∞ metric can beimplemented efficiently. The framework of the proof is similar to that of Lemma 2.

Lemma 4. Let V ∈ Rn and let p be a dimension p � n

2 . Then, P ∈ Rp such that dL∞(V P , V ) is minimal over all vectors of dimension

p can be found in O (n) time.

From Lemma 4, we deduce similar to the proof of Theorem 1 for L2 that finding the closest periodic vector to V underthe L∞ metric can be implemented efficiently. This concludes the proof of Theorem 1 for the L∞ metric.

3.3. Closest periodic vector in L1 space

We now turn to the case where the metric is L1. Given a vector V ∈ Rn , we seek another vector P , 1 � |P | � n

2 , whichminimizes the L1 distance between V and V P . Formally,

INPUT: Vector V ∈Rn .

OUTPUT: P ∈Rp,1 � p � n

2 minimizing dL1 (V P , V ) = ∑ni=1 |V P [i] − V [i]|.

Once again we focus on finding a monochromatic vector, xk = 〈x, x, . . . , x〉, closest to a given vector in Rk . For the L1 metric,

x is the median value of the input vector values, as the following well-known lemma shows [20].

Lemma 5. Let V ∈Rk. The scalar x such that dL1 (V , xk) is minimal is the median of V [1], . . . , V [k].

Given a dimension p, the p-dimension period of V under the L1 metric can be implemented efficiently, using Lemma 5.The framework of the proof is similar to that of Lemma 2.

Lemma 6. Let V ∈ Rn and let p be a dimension p � n

2 . Then, P ∈ Rp such that dL1 (V P , V ) is minimal over all vectors of dimension p

can be found in time p × T (�np �), where T (� n

p �) is the time to construct the � np �-dimension monochromatic vector closest to a given

� np �-dimension vector.

Given Lemma 6, we need to compute the closest monochromatic vector xk from a vector V ∈ Rk . Since by Lemma 5,

x is the median of V [1], . . . , V [k], which can be computed in O (k) time, this gives an O (n) computation for a given perioddimension p (by Lemma 6). This gives an overall O (n2) time algorithm for finding the closest periodic vector in the L1distance.

This concludes the proof of Theorem 1 for the L1 metric.

4. Fast approximations of the closest periodic vector in L1, L2 and L∞ spaces

We have seen that the closest periodic vector, under the L1, L2 and L∞ metrics, of an n-dimensional vector can be foundin time O (n2). In this section we show that the closest periodic vector can be approximated in almost linear time (withlogarithmic factors). We need the following definition:

A. Amir et al. / Theoretical Computer Science 533 (2014) 26–36 31

Definition 10. Let V be an n-dimensional vector. The s-shift vector of V , denoted by V (s) , is the concatenation of the s-lengthprefix of V and the (n − s)-length prefix of V , i.e., V (s) is the n-dimensional vector whose elements are: V (s)[i] = V [i], fori = 1, . . . , s, and V (s)[i] = V [i − s], for i = s + 1, . . . ,n.

Example. Let V = 〈1,2,3,4,5,6,7〉. Then V (3) = 〈1,2,3,1,2,3,4〉.

The interesting property of the shift vector V (s) is the following.

Observation 1. If V pre is the prefix of V of length n − s and V suf is the suffix of V of length n − s, then: dLt (V , V (s)) =dLt (V pre, V suf ), for t = 1,2,∞.

Example. For the above V , V (3) , dL1 (V , V (3)) = |1 − 1| + |2 − 2| + |3 − 3| + |4 − 1| + |5 − 2| + |6 − 3| + |7 − 4| =dL1 (〈1,2,3,4〉, 〈4,5,6,7〉).

Remark. From Subsection 2.1 and Observation 1, we get that dLt (V , V (s)) for all 1 � s � n can be computed or approxi-mated to a factor of 1 + ε using the algorithms of [16] in times O (n log n), O ( 1

ε2 n log n log |Σ |) and O ( 1ε n log n log |Σ |), for

t = 2,1,∞, respectively.

4.1. Approximation in L2 space

We begin by showing that the closest periodic vector under the L2 metric can be approximated in time O (n log n). Ouralgorithm will find a period length pmin for which the distance of the pmin-dimension close period will be no larger than√

6 times the distance of the closest periodic vector.

Let Ap = ∑� np −1

i=1 d2L2

(V , V (i·p)), where 1 � p � n/2. Note that, given dL2 (V , V (s)), for all 1 � s � n, we can compute Ap ,for all 1 � p � n/2 in time O (n log n). Lemma 7 below uses Ap to find the period length that approximates the closestperiodic vector of V .

Lemma 7. Let V be a vector of dimension n, and let 1 � p � n/2. Let V P be the n-dimensional vector for which the p-dimensionalvector P is the p-dimension close period of V under the L2 metric. Then

(a) 1√2(� n

p −1)

√Ap � dL2 (V , V P )

(b) dL2 (V , V P )�√

2� n

p �√

Ap

Proof. The key observation is that p splits V into �n/p� p-tuples, T1, . . . , T�n/p� , and, possibly, one more h-tuple, T�n/p�+1where h < p. We henceforth, in this proof, denote by d(V , W ) the L2 distance squared, i.e.,

d(

V [1], . . . , V [n], W [1], . . . , W [n]) =n∑

i=1

(V [i] − W [i])2

.

Note that although d(V , W ) is not a metric and, therefore, the triangle inequality does not hold, we can show anotherinequality. Let V 1, V 2, V 3 be vectors of any given dimension. Denote by a = dL2 (V 1, V 2), b = dL2(V 2, V 3) and c = dL2 (V 1, V 3).We want to show the relation between d(V 1, V 2)+d(V 2, V 3) = a2 +b2 and d(V 1, V 3) = c2. The worst case is when c = a+b,for which (a + b)2 = a2 + b2 + 2ab � 2(a2 + b2) by the generalized means inequality (see [1]). We, therefore, have thefollowing observation:

Observation 2. Let V 1, V 2, V 3 be vectors of any given dimension, then:

d(V 1, V 3) � 2[d(V 1, V 2) + d(V 2, V 3)

].

Now, d(V , V (p)) = ∑�n/p�i=1 d(Ti, Ti+1), where the last term might be a distance of an h-tuple and a p-tuple defined as

the distance of the h-tuple with the h-length prefix of the p-tuple.

Lower Bound. We have that d(V , V P ) = ∑�n/p i=1 d(Ti, P ). Rearranging the summations and multiplying by � n

p − 1 gives:

(⌈n

p

⌉− 1

)d(V , V P ) =

∑ [d(Ti, P ) + d(T j, P )

].

1�i< j��n/p

32 A. Amir et al. / Theoretical Computer Science 533 (2014) 26–36

By Observation 2 we have that:

2[d(Ti, P ) + d(T j, P )

]� d(Ti, T j), 1 � i < j � �n/p .

Therefore,

∑1�i< j��n/p

d(Ti, T j)� 2∑

1�i< j��n/p

[d(Ti, P ) + d(T j, P )

] = 2

(⌈n

p

⌉− 1

)d(V , V P ).

However,

Ap =� n

p −1∑i=1

d(

V , V (ip)) =

∑1�i< j��n/p

d(Ti, T j) � 2

(⌈n

p

⌉− 1

)d(V , V P ).

Dividing by 2(� np − 1) and taking the square roots of both sides gives inequality (a).

Upper Bound. Again, d(V , V P ) = ∑�n/p i=1 d(Ti, P ) (recall that the last term might be a distance of an h-tuple and a p-tuple

defined as the distance of the h-tuple with the h-length prefix of the p-tuple). By the definition of close periodic vector weknow that ∀1�i��n/p�d(V , V P ) � d(V , V Ti ). Therefore,

⌊n

p

⌋d(V , V P ) �

� np �∑

i=1

d(V , V Ti ).

Rewriting the summation gives:

� np �∑

i=1

d(V , V Ti ) = 2∑

1�i< j�� np �

d(Ti, T j) +� n

p �∑i=1

d(Ti, T� np �+1).

Note that the last term in the summation is needed only if p does not divide n, i.e., if � np � �= � n

p . Now,

2Ap = 2

� np −1∑i=1

d(

V , V (ip)) = 2

∑1�i< j�� n

p d(Ti, T j).

Thus,

2∑

1�i< j�� np �

d(Ti, T j) +� n

p �∑i=1

d(Ti, T� np �+1) � 2

∑1�i< j�� n

p d(Ti, T j) = 2Ap .

It follows that:

⌊n

p

⌋d(V , V P ) �

� np �∑

i=1

d(V , V Ti ) � 2Ap i.e. d(V , V P ) � 2

� np � Ap .

Taking the square root from both sides we get inequality (b). �Lemma 8 is needed for the proof of Lemma 9.

Lemma 8 (Concatenation Lemma). Let V ∈ Rn. Let P be the p-dimensional close period of V , p � n

4 , and let V P be the n-dimensionalvector for which P is the p-dimensional period. Let P ′ be a 2p-dimensional close period of V and V P ′ the n-dimensional vector forwhich P ′ is a 2p-dimensional period. Then dLt (V , V P ′ ) � dLt (V , V P ), for t = 1,2,∞.

Proof. Clearly, if P ′′ is the concatenation of P with itself, we get dLt (V , V P ′′ ) = dLt (V , V P ), t = 1,2,∞. Thus the distance ofthe 2p-dimensional close period of V is no more than that value. �Lemma 9. Given a vector V ∈ R

n, and given pmin, n4 < pmin � n

2 for which Apmin is smallest. Then the pmin-dimension close period

Pmin approximates the period P of the closest periodic vector of V to within a factor of√

6.

A. Amir et al. / Theoretical Computer Science 533 (2014) 26–36 33

Proof. Lemma 8 allows us to consider only vectors whose length is in the range ( n4 , n

2 ]. Let Pmin be the pmin-dimensionalclose period of V , and let P of length p be the period of the closest periodic vector of V . By the definition of theclosest periodic vector we have: dL2 (V , V P ) � dL2 (V , V Pmin ). By Lemma 7(b) and the fact that pmin � n

2 we know that:dL2 (V , V Pmin ) �

√Apmin . Now, because pmin was chosen as the one giving the minimum Ap , then:

√Apmin �

√Ap . Also,

because of Lemma 7(a) and the fact that p > n4 , which implies that

√� n

p − 1 �√

3, we get:√

Ap6 � dL2 (V , V P ). Conclude

that dL2 (V , V P ) � dL2 (V , V Pmin )�√

6dL2 (V , V P ). �We are now ready to present the approximation algorithm.

Approximation Algorithm for L2.

1. Compute Ap for all p = � n4 , . . . , � n

2 �.2. Choose pmin for which Apmin is smallest.3. Compute the pmin-dimension close period of V .

Time. As explained above, we get from [16] that all Ap can be computed in time O (n log n), choosing the minimum is donein linear time, as is constructing the pmin-dimension close period. Thus the total time is O (n log n).

We have, therefore, proven the first part of Theorem 2.

4.2. Approximation in L1 space

We now describe a fast approximation of the closest periodic vector in L1 space. We first need a bounding lemma,similar to Lemma 7.

Let B p = ∑� np −1

i=1 dL1 (V , V (i·p)), where 1 � p � n/2. Note that, given dL1 (V , V (s)), for all 1 � s � n, we can compute B p ,for all 1 � p � n/2 in time O (n log n).

Lemma 10. Let V be a vector of dimension n, and let 1 � p � n/2. Let V P be the n-dimensional vector for which the p-dimensionalvector P is the p-dimension close period of V under the L1 metric. Then

(a) 1� n

p −1B p � dL1 (V , V P )

(b) dL1 (V , V P )� 2� n

p � B p

Proof. As above, p splits V into �n/p� p-tuples, T1, . . . , T�n/p� , and, possibly, one more h-tuple, T�n/p�+1 where h < p. Wehenceforth, in this proof, denote by d(V , W ) the L1 distance, i.e.,

d(

V [1], . . . , V [n], W [1], . . . , W [n]) =n∑

i=1

∣∣V [i] − W [i]∣∣.

Now, d(V , V (p)) = ∑�n/p�i=1 d(Ti, Ti+1), where the last term might be a distance of an h-tuple and a p-tuple defined as

the distance of the h-tuple with the h-length prefix of the p-tuple.

Lower Bound. We have that d(V , V P ) = ∑�n/p i=1 d(Ti, P ). Rearranging the summations and multiplying by � n

p − 1 gives:

(⌈n

p

⌉− 1

)d(V , V P ) =

∑1�i< j��n/p

[d(Ti, P ) + d(T j, P )

].

By the triangle inequality we have that:

d(Ti, P ) + d(T j, P ) � d(Ti, T j), 1 � i < j � �n/p .Therefore,

∑1�i< j��n/p

d(Ti, T j) �∑

1�i< j��n/p

[d(Ti, P ) + d(T j, P )

] =(⌈

n

p

⌉− 1

)d(V , V P ).

However,

34 A. Amir et al. / Theoretical Computer Science 533 (2014) 26–36

B p =� n

p −1∑i=1

d(

V , V (ip)) =

∑1�i< j��n/p

d(Ti, T j) �(⌈

n

p

⌉− 1

)d(V , V P ).

Dividing by � np − 1 gives inequality (a).

Upper Bound. Again, d(V , V P ) = ∑�n/p i=1 d(Ti, P ) (recall that the last term might be a distance of an h-tuple and a p-tuple

defined as the distance of the h-tuple with the h-length prefix of the p-tuple). By the definition of close periodic vector weknow that ∀1�i��n/p�d(V , V P ) � d(V , V Ti ). Therefore,

⌊n

p

⌋d(V , V P ) �

� np �∑

i=1

d(V , V Ti ).

Rewriting the summation gives:

� np �∑

i=1

d(V , V Ti ) = 2∑

1�i< j�� np �

d(Ti, T j) +� n

p �∑i=1

d(Ti, T� np �+1).

Note that the last term in the summation is needed only if p does not divide n, i.e., if � np � �= � n

p . Now,

2B p = 2

� np −1∑i=1

d(

V , V (ip)) = 2

∑1�i< j�� n

p d(Ti, T j).

Thus,

2∑

1�i< j�� np �

d(Ti, T j) +� n

p �∑i=1

d(Ti, T� np �+1) � 2

∑1�i< j�� n

p d(Ti, T j) = 2B p.

It follows that:

⌊n

p

⌋d(V , V P ) �

� np �∑

i=1

d(V , V Ti ) � 2B p i.e. d(V , V P ) � 2

� np � B p. �

The concatenation lemma gives:

Lemma 11. Given a vector V ∈ Rn, and given pmin, n

4 < pmin � n2 for which B pmin is smallest. Then the pmin-dimension close period

P approximates the period of the closest periodic vector of V to within a factor of 3.

Approximation Algorithm for L1.

1. Compute B p for all p = � n4 , . . . , � n

2 �.2. Choose pmin for which B pmin is smallest.3. Compute the pmin-dimension close period of V .

Time. As explained above, we get from [16] that, for any ε > 0, B p can be approximated to within a factor of (1 + ε) intime O ( 1

ε2 n log n log |Σ |), choosing the minimum is done in linear time, as is constructing the pmin-dimension close period.

Thus the total time is O ( 1ε2 n log n log |Σ |). It should be noted, however, that the approximation factor is 3 + ε because of

the approximation of the distances of the shift vectors.

We have, therefore, proven the second part of Theorem 2.

4.3. Approximation in L∞ space

Finally, we describe a fast approximation of the closest periodic vector in L∞ space. As in the previous cases, we need abounding lemma for L∞ , however, the approach is quite different from the previous two.

A. Amir et al. / Theoretical Computer Science 533 (2014) 26–36 35

Lemma 12. Let V be a vector of dimension n, and let 1 � p � n/2. Let V P be the n-dimensional vector for which the p-dimensionalvector P is the p-dimension close period of V under the L∞ metric. Then

(a) 12 dL∞(V , V (p))� dL∞(V , V P ).

(b) dL∞(V , V P ) � �n/p −12 dL∞(V , V (p)).

Proof. Recall that p splits V into �n/p� p-tuples, T1, . . . , T�n/p� , and, possibly, one more h-tuple, T�n/p�+1 where h < p, so

dL∞(V , V P ) = max�n/p i=1 dL∞(Ti, P ) (the last term might be a distance of an h-tuple and a p-tuple defined as the distance of

the h-tuple with the h-length prefix of the p-tuple). However,

dL∞(

V , V (p)) =

� np −1

maxi=1

dL∞(Ti, Ti+1).

Assume this maximum was achieved by the pair Ti0 [ j0], Ti0+1[ j0], i.e. |Ti0 [ j0]− Ti0+1[ j0]| is the largest distance in the shiftvector.

Lower Bound. The smallest value that dL∞(V , V P ) can achieve is|Ti0 [ j0]−Ti0+1[ j0]|

2 . Thus

1

2dL∞

(V , V (p)

)� dL∞(V , V P ).

Upper Bound. For each i, i = 1, . . . , � np − 1, we have dL∞(Ti, Ti+1) = maxp

j=1 |Ti[ j]− Ti+1[ j]|. Assume the maximum differ-

ence is the pair |Ti0 [ j0] − Ti0+1[ j0]|, i.e. dL∞(V , V (p)) = |Ti0 [ j0] − Ti0+1[ j0]|.For each j, 1 � j � p, sort the � n

p numbers Ti[ j], i = 1, . . . , � np and let m j be the difference between the largest and

smallest number. Let m = maxpj=1 m j and let j1 be such that m = m j1 . Note that dL∞(V , V P ) = m

2 .Consider all � n

p − 1 pairs |T1[ j1] − T2[ j1]|, |T2[ j1] − T3[ j1]|, . . . , |T�n/p −1[ j1] − T�n/p [ j1]|. Now, if there is an i1 for

which |Ti1 [ j1] − Ti1+1[ j1]| = m then dL∞(V , V (p)) = |Ti0 [ j0] − Ti0+1[ j0]| � |Ti1 [ j1] − Ti1+1[ j1]| = m � m2 = dL∞(V , V P ). If

no such pair exists then by averaging consideration, we know that there exists i1 for which |Ti1 [ j1] − Ti1+1[ j1]| � m�n/p −1 .

Thus, we get, dL∞(V , V (p)) = |Ti0 [ j0] − Ti0+1[ j0]|� |Ti1 [ j1] − Ti1+1[ j1]|� m�n/p −1 .

Therefore

�n/p − 1

2dL∞

(V , V (p)

)� m

2.

Thus

dL∞(V , V P ) � �n/p − 1

2dL∞

(V , V (p)

). �

The concatenation lemma gives:

Lemma 13. Given a vector V ∈ Rn, and given pmin, n

4 < pmin � n2 for which dL∞(V , V (pmin)) is smallest. Then the pmin-dimension

close period P approximates the period of the closest periodic vector of V to within a factor of 3.

Approximation Algorithm for L∞.

1. Compute dL∞(V , V (p)) for all p = � n4 , . . . , � n

2 �.2. Choose pmin for which dL∞(V , V (pmin)) is smallest.3. Compute the pmin-dimension close period of V .

Time. As explained above, we get from [16] that, for any ε > 0, all dL∞(V , V (p)) can be approximated to within a factorof (1 + ε) in time O ( 1

ε n log n log |Σ |), choosing the minimum is done in linear time, as is constructing the pmin-dimension

close period. Thus the total time is O ( 1ε n log n log |Σ |). It should be noted, however, that the approximation factor is 3 + ε

because of the approximation of the distances of the shift vectors.

We have, therefore, proven the third, final part of Theorem 2.

Acknowledgements

The authors thank the anonymous reviewers of previous versions of this paper for their useful comments. Especially,we thank the reviewer for pointing out that a stronger inequality can be used in the proof of Lemma 7 enabling a betterapproximation factor.

36 A. Amir et al. / Theoretical Computer Science 533 (2014) 26–36

References

[1] Generalized means, http://en.wikipedia.org/wiki/Generalized mean.[2] A. Amir, G. Benson, Two-dimensional periodicity and its application, SIAM J. Comput. 27 (1) (1998) 90–106.[3] A. Amir, G. Benson, M. Farach, Optimal parallel two dimensional pattern matching, in: Proc. of the 5th ACM Symp. on Parallel Algorithms and Archi-

tectures, 1993, pp. 79–85.[4] A. Amir, G. Benson, M. Farach, Optimal parallel two dimensional text searching on a crew pram, Inform. and Comput. 144 (1) (1998) 1–17.[5] A. Amir, E. Eisenberg, A. Levy, Approximate periodicity, in: The 21st International Symposium on Algorithms and Computation (ISAAC), 2010, pp. 25–36.[6] A. Amir, E. Eisenberg, A. Levy, E. Porat, N. Shapira, Cycle detection and correction, in: The 37th International Colloquium on Automata, Languages and

Programming (ICALP), 2010, pp. 43–54.[7] A. Apostolico, R. Giancarlo, Periodicity and repetitions in parameterized strings, Discrete Appl. Math. 156 (9) (2008) 1389–1398.[8] R. Cole, M. Crochemore, Z. Galil, L. Gasieniec, R. Harihan, S. Muthukrishnan, K. Park, W. Rytter, Optimally fast parallel algorithms for preprocessing and

pattern matching in one and two dimensions, in: Proc. 34th IEEE FOCS, 1993, pp. 248–258.[9] R. Cole, M. Crochemore, Z. Galil, L. Gasieniec, R. Hariharan, S. Muthukrishnan, K. Park, H. Ramesh, W. Rytter, Optimally fast parallel algorithms for

preprocessing and pattern matching in one and two dimensions, in: Proc. 34th Annual IEEE FOCS, 1993, pp. 248–258.[10] T.H. Cormen, C.E. Leiserson, R.L. Rivest, Introduction to Algorithms, MIT Press and McGraw–Hill, 1992.[11] M. Crochemore, An optimal algorithm for computing the repetitions in a word, Inform. Process. Lett. 12 (5) (1981) 244–250.[12] Z. Galil, Optimal parallel algorithms for string matching, in: Proc. 16th ACM Symposium on Theory of Computing, vol. 67, 1984, pp. 144–157.[13] Z. Galil, K. Park, Alphabet-independent two-dimensional witness computation, SIAM J. Comput. 25 (5) (1996) 907–935.[14] B. Gfeller, Finding longest approximate periodic patterns, in: WADS, 2011, pp. 463–474.[15] Gad M. Landau, Jeanette P. Schmidt, Dina Sokol, An algorithm for approximate tandem repeats, J. Comput. Biol. 8 (1) (2001) 1–18.[16] O. Lipsky, E. Porat, Approximated pattern matching with the �1, �2, and �∞ metrics, Algorithmica 60 (2) (2011) 335–348.[17] M. Lothaire, Combinatorics on Words, Addison–Wesley, Reading, Mass., 1983.[18] S. Ma, J.L. Hellerstein, Mining partially periodic event patterns with unknown periods, in: The 17th International Conference on Data Engineering

(ICDE), IEEE Computer Society, 2001, pp. 205–214.[19] M. Régnier, L. Rostami, A unifying look at d-dimensional periodicities and space coverings, in: Proc. 4th Symp. on Combinatorial Pattern Matching, in:

Lecture Notes in Compute Science, vol. 684, 1993, pp. 215–227.[20] Q.F. Stout, Unimodal regression via prefix isotonic regression, Comput. Statist. Data Anal. 53 (2008) 289–297.