identifying patterns in time series data

Identifying Patterns in Time Series Data

Daniel Lewis04/06/06

Time Series Data

● Definition:– “An ordered set of m real-valued variables”

● How can patterns that occur in time series data be located?

● Euclidean Distance:

Comparing Time Series

Adaptive Piecewise Constant Approximation

● Segments of time series data are represented by 2 values, the mean value of all points in the segment and the right endpoint of the segment

● Allows large queries to be quickly compared, but APCA representations must be created first.

Adaptive Piecewise Constant Approximation (cont')

APCA (cont')

● Data Compression allows for faster search● Can be used for indexing large series● Can handle large queries

● But what if we want to identify patterns in streaming time series data?

Pattern Recognition

● For a given query q and a time series s, if the Euclidean distance between q and s < r, the series match

● r: user defined, application specific threshold

Detecting Patterns in Streaming Data

●Brute Force Method:

– P1 ,.., P

n : Set of patterns of length k

– S: Input Stream

– For every possible substring of length k in s, calculate the distance between the substring and all n patterns

– This method is obviously extremely costly in the case of a large pattern set and a large input stream

● O(n(|S| - k))

Speeding Up Pattern Identification

● Early Abandoning:– If at any point error > r 2, we can stop

computation

Wedge Creation

● Combine multiple patterns into a wedge:

– Define Upper Limit:● U

i = max( C

1i , .. , C

ki)

– Define Lower Limit:● L

i = min( C

1i , .. , C

ki)

– This produces a wedge such that:

Wedge Creation (cont')

Wedge Comparison

● Distance Between a Query and a Wedge:

● If distance > r, then the distance between the query and all component patterns > r, allowing you to eliminate multiple possible matches with a single comparison

Hierarchical Wedges

● The usefulness of any wedge is determined by the similarity of the patterns used in its construction.

● More similar patterns create smaller, more useful wedges

● Patterns can be combined in a tree-like pattern to produce a hierarchy of wedges

Hierarchical Wedges (cont')

Atomic Wedgie

● Preparation:– All patterns are clustered by similarity– The most similar patterns are combined into

wedges– The resulting wedges are combined to form

larger, less specific wedges

Atomic Wedgie (cont')

● Usage:– When streaming data arrives, each substring of

length k is first compared to the largest wedge, if dist > r, comparison stops, else, the distance is compared against the two component wedges, eliminating any branches where the distance exceeds r.

– Eventually, all branches are eliminated or a single (atomic) pattern is matched

Atomic Wedgie Optimization

● Optimization:– Summation is order independent– Large sections are less likely to increase error

than small sections

– Thus, if error is summed starting with the smallest sections first, the requirements for early abandon are more likely to be met earlier

Atomic Wedgie (cont')

● Advantages:– If wedges are well formed, large speed increases

can occur● A large number of similar possible patterns can be

analyzed quickly

● Disadvantages:– If wedges are poorly formed, the time required

will exceed the Brute Force Method● Dissimilar patterns are not handled well

Special Considerations

● The choice of r (similarity threshold) is of great importance:

– if r is too large, a substring can match too many patterns to be useful

– if r is too small, too little matching may occur

● Good Choice: – r = average distance from any pattern to its

nearest neighbor

Atomic Wedgie Results

References

Chakrabarti, K., Keogh, E., Mehrotra, S., and Pazzani, M., Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases, ACM Transactions on Database Systems, Vol 27, 2002.

Wei, L., Keogh, E., Van Herle, H., Mafra-Neto, A., Atomic Wedgie: Efficient Query Filtering for Streaming Times Series, Data Mining, Fifth IEEE International Conference on 27-30 Nov. 2005 Page(s):490 - 497

Questions?

identifying patterns in time series data

Documents